S
Stephen Toub - MSFT
Guest
Each year, summer arrives to find me daunted and excited to write about the performance improvements in the upcoming release of .NET. “Daunted,” because these posts, covering .NET 8, .NET 7, .NET 6, .NET 5, .NET Core 3.0, .NET Core 2.1, and .NET Core 2.0, have garnered a bit of a reputation I want to ensure the next iteration lives up to. And “excited,” because there’s such an abundance of material to cover due to just how much goodness has been packed into the next .NET release, I struggle to get it all written down as quickly as my thoughts whirl.
And so, every year, I start these posts talking about how the next release of .NET is the fastest and best release to date. That’s true for .NET 9 as well, of course, but the statement that .NET 9 is the fastest and best release of .NET to date is now a bit… mundane. So, let’s spice it up a bit. How about… a haiku?
Or, maybe a limerick:
A little gimmicky? Maybe something more classical, a sonnet perhaps:
Ok, so, yeah, I should stick to writing software rather than poetry (something with which my college poetry professor likely agreed). Nevertheless, the sentiment remains: .NET 9 is an incredibly exciting release. More than 7,500 pull requests (PRs) have merged into dotnet/runtime in the last year, of which a significant percentage have touched on performance in one way, shape, or form. In this post, we’ll take a tour through over 350 PRs that have all found their way into packing .NET 9 full of performance yumminess. Please grab a large cup of your favorite hot beverage, sit back, settle in, and enjoy.
In this post, I’ve included micro-benchmarks to showcase various performance improvements. Most of these benchmarks are implemented using BenchmarkDotNet v0.14.0, and, unless otherwise noted, there is a simple setup for each.
To follow along, first make sure you have .NET 8 and .NET 9 installed. The numbers I share were gathered using the .NET 9 Release Candidate.
Once you have the appropriate prerequisites installed, create a new C# project in a new benchmarks directory:
The resulting directory will contain two files:
The preceding project file tells the build system we want:
For each benchmark, I’ve then included the full
Running the benchmarks is then simple. Each test includes a comment at its top for the
That:
Throughout the post, I’ve shown many benchmarks and the results I received from running them. Unless otherwise stated (e.g. because I’m demonstrating an OS-specific improvement), the results shown for benchmarks are from running them on Linux (Ubuntu 22.04) on an x64 processor.
My standard caveat: these are micro-benchmarks, often measuring operations that take very short periods of time, but where improvements to those times add up to be impactful when executed over and over and over. Different hardware, different operating systems, what other processes might be running on your machine, who you had breakfast with this morning, and the alignment of the planets can all impact the numbers you get out. In short, the numbers you see are unlikely to match exactly the numbers I share here; however, I’ve chosen benchmarks that should be broadly repeatable.
With all that out of the way, let’s do this!
Improvements in .NET show up at all levels of the stack. Some changes result in large improvements in one specific area. Other changes result in small improvements across many things. When it comes to broad-reaching impact, there are few areas of .NET that result in changes more broadly-impactful than those changes made to the Just In Time (JIT) compiler. Code generation improvements help make everything better, and it’s where we’ll start our journey.
In Performance Improvements in .NET 8, I called out the enabling of dynamic profile guided optimization (PGO) as my favorite feature in the release, so PGO seems like a good place to start for .NET 9.
As a brief refresher, dynamic PGO is a feature that enables the JIT to profile code and use what it learns from that profiling to help it generate more efficient code based on the exact usage patterns of the application. The JIT utilizes tiered compilation, which allows code to be compiled and then re-compiled, possibly multiple times, achieving something new each time the code is compiled. For example, a typical method might start out at “tier 0,” where the JIT applies very few optimizations and has a goal of simply getting to functional assembly as quickly as possible. This helps with startup performance, as optimizations are one of the most costly things a compiler does. Then the runtime tracks the number of times the method is invoked, and if the number of invocations trips over a particular threshold, such that it seems like performance could actually matter, the JIT will re-generate code for it, still at tier 0, but this time with a bunch of additional instrumentation injected into the method, tracking all manner of things that could help the JIT better optimize, e.g. for a given virtual dispatch, what is the most common type on which the call is being performed. Then after enough data has been gathered, the JIT can compile the method yet again, this time at “tier 1,” fully optimized, also incorporating all of the learnings from that profile data. This same flow is relevant as well for code that’s already been pre-compiled with ReadyToRun (R2R), except instead of instrumenting tier 0 code, the JIT will generate optimized, instrumented code on its way to generating a re-optimized implementation.
In .NET 8, the JIT in particular paid attention to PGO data about types and methods involved in virtual, interface, and delegate dispatch. In .NET 9, it’s also able to use PGO data to optimize casts. Thanks to dotnet/runtime#90594, dotnet/runtime#90735, dotnet/runtime#96597, dotnet/runtime#96731, and dotnet/runtime#97773, dynamic PGO is now able to track the most common input types to cast operations (
That
but now on .NET 9, it produces this:
On .NET 8, it’s loading the reference to the object and the desired method token for
It’s also capable of optimizing for the negative case where the cast most often fails. Consider this benchmark:
On .NET 9, we get this assembly:
Here the incoming object is always a
dotnet/runtime#96311 also breaks new ground with dynamic PGO, by teaching it how to profile integers and paying attention to their most common values. Then in conjunction with dotnet/runtime#96571, it uses this super power to optimize
Tier 0 is all about getting to functioning code quickly, and as such most optimizations are disabled. However, every now and then there’s a reason to do a bit more optimization in tier 0, in situations where the benefits of doing so outweigh the cons. Several of those occurred in .NET 9.
dotnet/runtime#104815 is a simple example. The
When I run that on .NET 8, I get results like this:
The first few iterations are invoking
With these tweaks to tier 0, the boxing is also elided in tier 0, and so starts out without any allocation.
Another tier 0 boxing example is dotnet/runtime#90496. There’s a hot path method in the
Optimizations are avoided in tier 0 because they might slow down compilation. If there are really cheap optimizations, though, and they can have a meaningful impact, they can be worth enabling. That’s especially true if the optimizations can actually help to make compilations and startup faster, such as by minimizing calls to helpers that may take locks, trigger certain kinds of loading, etc. And that’s what dotnet/runtime#105190 does, enabling some constant folding in tier 0 at relatively little cost. Even with the low cost, though, there were still concerns about possible impact to JIT throughput, and the PR was fast-followed by dotnet/runtime#105250 which optimized some JIT code paths to make up for any impact from the former change.
Another similar case is dotnet/runtime#91403 from @MichalPetryka, which allows optimizations around
Applications spend a lot of time iterating through loops, and finding ways to reduce the overheads of loops has been a key focus for .NET 9. It’s also been quite successful.
dotnet/runtime#102261 and dotnet/runtime#103181 help to remove some instructions from even the tightest of loops by converting upward counting loops into downward counting loops. Consider a loop like the following:
Here’s what the generated assembly code for that core loop looks like on .NET 8:
It’s incrementing
Now let’s manually rewrite the loop to be downward counting:
And here’s what the generated assembly code for that core loop looks like there:
The key observation here is that by counting down, we can replace a
However, the JIT is only able to do this transformation if the iteration variable (
One such optimization is strength reduction in loops. In compilers, “strength reduction” is the simple idea of taking something relatively expensive and replacing it with something cheaper. In the context of loops, that typically means introducing more “induction variables” (variables whose values change in a predictable pattern on each iteration, such as being incremented by a constant amount). For example, consider a simple loop that sums all of the elements of an array:
We get the following assembly on .NET 8:
The interesting part is the loop starting at
Note the loop at
A lot of work went into enabling this strength reduction, including providing the basic implementation (dotnet/runtime#104243), enabling it by default (dotnet/runtime#105131), finding more opportunities to apply it (dotnet/runtime#105169), and using it to enable post-indexed addressing (dotnet/runtime#105181 and dotnet/runtime#105185), which is an Arm addressing mode where the address stored in the base register is used but then that register is updated to point to the next target memory location. A new phase was also added to the JIT to help with optimizing such induction variables (dotnet/runtime#97865), and in particular, to do induction variable widening where 32-bit induction variables (think of every loop you’ve ever written that starts with
These optimizations are all new, but of course there are also many loop optimizations already present in the JIT compiler, from loop unrolling to loop cloning to loop hoisting. In order to apply such loop optimizations, though, the JIT first needs to recognize loops, and that can sometimes be more challenging than it would seem (dotnet/runtime#43713 describes a case where the JIT was failing to do so). Historically, the JIT’s loop recognition was based on a relatively simplistic lexical analysis. In .NET 8, as part of the work to improve dynamic PGO, a more powerful graph-based loop analyzer was added that was able to recognize many more loops. For .NET 9 with dotnet/runtime#95251, that analyzer was factored out so that it could be used for generalized loop reasoning. And then with PRs like dotnet/runtime#96756 for loop alignment, dotnet/runtime#96754 and dotnet/runtime#96553 for loop cloning, dotnet/runtime#96752 for loop unrolling, dotnet/runtime#96751 for loop canonicalization, and dotnet/runtime#96753 for loop hoisting, many of these loop-related optimizations have now been moved to the better scheme. All of that means that more loops get optimized.
.NET code is, by default, “memory safe.” Unlike in C, where you can iterate through an array and easily walk off the end of it, by default accesses to arrays, strings, and spans are “bounds checked” to ensure you can’t walk off the end or before the beginning. Of course, such bounds checking adds overhead, and so wherever the JIT can prove that adding such checks would be unnecessary, it’ll elide the bounds check, knowing that it’s impossible for the guarded accesses to be problematic. The quintessential example of this is a loop over an array from
That
The key part to pay attention to is the loop at
Now, let’s tweak the benchmark ever so slightly. In the above, I was copying the
now we get this on .NET 8:
That’s a whole lot worse. Note how much that loop starting at
Every release the JIT gets better at removing more and more bounds checks where it can prove they’re superfluous. One of my favorite such improvements in .NET 9 is there on my favorites list because I’ve historically expected the optimization to “just work”, for various reasons it didn’t, and now it does (it also shows up in a fair amount of real code, which is why I’ve bumped up against it). In this benchmark, the function is handed an offset and a span, and its job is to sum all of the numbers from that offset to the end of the span.
By casting
But, this is a fairly awkward way to write such a condition. A more natural way would be to have that check as part of the loop condition:
Unfortunately, as a result of my code cleanup here to make the code more canonical, the JIT in .NET 8 fails to see that the bounds check can be elided… note the
But in .NET 9, thanks to dotnet/runtime#100777, the JIT is better able to track the knowledge about guarantees made by the loop condition and is able to elide the bounds check on this variation as well.
Yay!
Now consider this benchmark:
The test method here has a span of data initialized in a way where the JIT is able to see how long it is. It’s then indexing into the span, using the supplied index to read not from the start but from the end (the
But, now on .NET 9, thanks to dotnet/runtime#96123, the bounds check gets elided.
Here’s another case. We’re special-casing spans of lengths less than or equal to 1, returning
You and I can see that the access to
The JIT keeps track of what it knows about the lengths of various things, what conditions it’s proved, but here it’s lost track of the fact that, for the else branch of the ternary,
Most if not all of these bounds check elimination improvements come about because someone is optimizing something and sees a bounds check that could have been eliminated but wasn’t. In the case that inspired the improvement in dotnet/runtime#101352, that someone was me, while working on improving
That bounds check wasn’t previously being removed, but now in .NET 9, it is:
Sometimes eliding bounds checks is about learning new tricks; other times, it’s about fixing old ones. Consider this benchmark:
Note the bounds check in
Interestingly, sometimes even if we can’t elide a bounds check, we can learn things from the fact that one occurred, and then use that knowledge to optimize subsequent things. Consider this benchmark:
There’s nothing the JIT can do here to elide the bounds check on
However, all is not lost. After indexing into the array, we proceed to use
Making .NET on Arm an awesome and fast experience has been a critical, multi-year investment. You can read about it in Arm64 Performance Improvements in .NET 5, Arm64 Performance Improvements in .NET 7, and Arm64 Performance Improvements in .NET 8. And things continue to improve even further in .NET 9. Here are some examples:
Bringing up a new instruction set is a huge deal and a huge undertaking. I’ve mentioned in the past my process for gearing up to write one of these “Performance Improvements in .NET X” posts, including that throughout the year I keep a running list of the PRs I might want to talk about when it comes time to actually put pen to paper. Just for “SVE”, I found myself with over 200 links. I’m not going to bore you with such a laundry list; if you’re interested, you can search for SVE PRs, which includes PRs from @a74nh, from @ebepho, from @mikabl-arm, from @snickolls-arm, and from @SwapnilGaikwad. But, we can still talk a bit about what it is and what it means for .NET.
Single instruction, multiple data (SIMD) is a kind of parallel processing where one instruction performs the same operation on multiple pieces of data at the same time, rather than one instruction manipulating just a single piece of data. For example, the
SVE, or “Scalable Vector Extensions,” is an ISA from Arm that’s a bit different. The instructions in SVE don’t operate on a fixed size. Rather, the specification allows for them to operate on sizes from 128 bits up to 2048 bits, and the specific hardware can choose which size to use (allowed sizes are multiples of 128, and with SVE 2 further constrained to be powers of 2). The same assembly code using these instructions might operate on 128 bits at a time on one piece of hardware and 256 bits at a time on another piece of hardware.
There are multiple ways such an ISA impacts .NET, and in particular the JIT. The JIT needs to be able to be able to work with the ISA, understand the associated registers and be able to do register allocation, be taught about encoding and emitting the instructions, and so on. The JIT needs to be taught when and where it’s appropriate to use these instructions, so that as part of compiling IL down to assembly, if operating on a machine that supports SVE, the JIT might be able to pick SVE instructions for use in the generated assembly. And the JIT needs to be taught how to represent this data, these vectors, to user code. All of that is a huge amount of work, especially when you consider that there are thousands of operations represented. What makes it even more work is hardware intrinsics.
Hardware intrinsics are a feature of .NET where, effectively, each of these instructions shows up as its own dedicated .NET method, such as
Two interesting things to notice if you open that file (beyond its sheer length):
Even with the size of the SVE effort, it’s not the only new ISA available in .NET 9. Thanks in large part to dotnet/runtime#99784 from @Ruihan-Yin and dotnet/runtime#101938 from @khushal1996, .NET 9 now also supports AVX10.1 (AVX10 version 1). AVX10.1 provides everything AVX512 provides, all of the base support, the updated encodings, support for embedded broadcasts, masking, and so on, but it only requires 256-bit support in the hardware (with 512-bits being optional, whereas AVX512 requires 512-bit support), and it does so in a much less incremental manner (AVX512 has multiple instruction sets like “F”, “DQ”, “Vbmi”, etc.). That’s modeled in the .NET APIs as well, where you can check
On the subject of ISAs, it’s worth mentioning AVX512. .NET 8 added broad support for AVX512, including support in the JIT and employment of it throughout the libraries. Both of those improve further in .NET 9. We’ll talk more about places it’s better used in the libraries later. For now, here are some JIT-specific improvements.
One of the things the JIT needs to generate code for is zeroing, e.g. by default all locals in a method need to be set to zero, and even if
Here’s the assembly for
This is on a machine with AVX512 hardware support, but we can see the zero’ing is happening using a loop (
Now there’s no loop, because
Zeroing also shows up elsewhere, such as when initializing structs. Those have also previously employed SIMD instructions where relevant, e.g. this:
produces this assembly today on .NET 8:
But, if we tweak
This is due to alignment requirements in order to provide necessary atomicity guarantees. But rather than giving up wholesale, dotnet/runtime#102132 allows the SIMD zeroing to be used for the contiguous portions that don’t contain GC references, so now on .NET 9 we get this:
This optimization isn’t specific to AVX512, but it includes the ability to use AVX512 instructions when available. (dotnet/runtime#99140 provides similar support for Arm64.)
Other optimizations improve the JIT’s ability to select AVX512 instructions as part of generating code. One neat example of this is dotnet/runtime#91227 from @Ruihan-Yin, which utilizes the cool
Here we’ve put our Boolean operation into an
We then take that last
So the values are
also yields:
Why those specific three values of
Another neat example is in dotnet/runtime#92017 from @Ruihan-Yin, which optimizes 512-bit vector constants via
that’s broadcasting the single value
This would result in that whole byte sequence being stored in the assembly data section, and then the JIT would emit the code to load that data into the appropriate registers; no broadcasting. But instead, the JIT should be able to recognize that this is actually the same 16-byte sequence repeated four times, store the sequence once, and then use a
This is beneficial for a variety of reasons, including less data to store, less data to load, and if the register containing this state needed to be spilled (meaning something else needs to be put into the register, so the value currently in the register is temporarily stored in memory), reloading it is similarly cheaper.
Two of the more far-reaching changes related to AVX512, though, come from dotnet/runtime#97675 and dotnet/runtime#101886, which do the work to enable the JIT to utilize AVX512 “embedded masking.” Masking is a commonly needed solution when writing SIMD code; anywhere you see a
And guess what I’d get for assembly? You guessed it, our good friend
We can see it’s computing both the
AVX512 supports this implicitly via embedded masking, which means that instructions can include the masking operation as part of them rather than performing the operation separately and then doing the masking via
Here we still have a
That
dotnet/runtime#97529 also improved casting from
In addition to improvements that teach the JIT about entirely new architectures, there have also been a plethora of improvements that simply help the JIT to better employ SIMD in general.
One of my favorites is dotnet/runtime#92852, which merges consecutive stores into a single operation. Consider wanting to implement a method like
Pretty simple: we’re writing out each individual value. That’s a bit unfortunate, though, in that we’re naively then spending several
The developer has manually done the work of computing the value of the merged writes, e.g.
in order to be able to perform a single write rather than doing four individual ones. For this particular case, now in .NET 9, the JIT can automatically do this merging so the developer doesn’t have to. The developer just writes the code that’s natural to write, and the JIT does the heavy lifting of optimizing its output (note below the
dotnet/runtime#92939 improves this further by enabling longer sequences to similarly be merged using SIMD instructions.
Of course, you may then wonder, why wasn’t
Another nice improvement is dotnet/runtime#86811 from @BladeWise, which adds SIMD support for multiplying two vectors of
dotnet/runtime#103555 (x64, when AVX512 isn’t available) and dotnet/runtime#104177 (Arm64) also improve vector multiplication, this time for
It’s also evident, however, on higher-level benchmarks, for example on this benchmark for
This benchmark references the System.IO.Hashing nuget package. Note that we’re explicitly adding in a reference to the 8.0.0 version; that means that even when running on .NET 9, we’re using the .NET 8 version of the hashing code, yet it’s still significantly faster, because of these runtime improvements.
Some other notable examples:
Just like the JIT tries to elide redundant bounds checking, where it can prove the bounds check is unnecessary, it similarly does so for branching.
The ability to handle the relationships between branches is improved in .NET 9. Consider this benchmark:
The
We can see in the original code that the branch within the inlined
Just the one outer
Similar to the
Another PR that eliminated some redundant branches is dotnet/runtime#94563, which feeds information from value numbering (a technique used to eliminate redundant expressions by giving every unique expression its own unique identifier) into the building of PHIs (a kind of node in the JIT’s intermediate representation of the code that aids in determining a variable’s value based on control flow). Consider this benchmark:
This is allocating a
As such, there’s actually an extra branch not visible in the
But now on .NET 9, that branch (in fact, multiple redundant branches) is removed:
dotnet/runtime#87656 is another nice example and addition to the JIT’s optimization repertoire. As was discussed earlier, branches have costs associated with them. A hardware’s branch predictor can often do a very good job of mitigating the bulk of those costs, but there’s still some, and even if it were fully mitigated in the common case, a branch prediction failure can be relatively very costly. As such, minimizing branches can be very helpful, and if nothing else, turning branch-based operations into branchless ones leads to more consistent and predictable throughput, as it’s then less subject to the nature of the data being processed. Consider the following function that’s used to determine whether a character is a particular subset of whitespace characters:
On .NET 8, we get what you’d probably expect, a series of
On .NET 9, though, we now get this:
It’s now using a
Unfortunately, this also highlights that such optimizations, which are looking for a particular pattern, can get knocked off their golden path, at which point the optimization won’t kick in. In this case, there are several ways it can get knocked off. The most obvious is if there are too many values or if they’re too spread out, such that they can’t fit into the 32-bit or 64-bit bit mask. More interesting, if you switch it to instead use C# pattern matching (e.g.
A related optimization was added in dotnet/runtime#93521. Consider a function like the following, which is checking to see whether a character is a lower-case hexadecimal char:
On .NET 8, we get a comparison against
But on .NET 9, we instead get this:
Effectively the JIT has rewritten the condition as if I’d written it like this:
which is nice, because it’s replaced two of the conditional branches with two (cheaper) subtractions.
The .NET garbage collector (GC) is a generational collector. That means it divides the heap up logically by object age, where “generation 0” (or “gen0”) are the newest objects that haven’t been around for very long, “gen2” are the objects that have been around for a while, and “gen1” are in the middle. This approach is based on the theory (that also generally plays out in practice) that most objects end up being very short-lived, created for some task and then quickly dropped, and conversely that if an object has been around for a while, there’s a really good chance it’ll continue to be around for a while. By partitioning up objects like this, the GC can reduce the amount of work it needs to do when it scans for objects to be collected. It can do a scan focused only on gen0 objects, allowing it to ignore anything in gen1 or gen2 and thereby make its scan much faster. Or at least, that’s the goal. If it were to only scan gen0 objects, though, it could easily think a gen0 object wasn’t referenced because it couldn’t find any references to one from other gen0 objects… but there may have been a reference from a gen1 or gen2 object. That would be bad. How does the GC deal with this then, having its cake and eating it, too? It colludes with the rest of the runtime to track any time its generational assumptions might be violated. The GC maintains a table (called the “card table”) that indicates whether an object in a higher generation might contain a reference to a lower generation object, and any time a reference is written such that there could end up being a reference from a higher generation to a lower one, this table is updated. Then when the GC does its scan, it only needs to examine higher generation objects if the relevant bit in the table is set (the table doesn’t track individual objects, just ranges of them, so it’s similar to a “Bloom filter”, where the lack of a bit means there’s definitely not a reference but the presence of a bit only means there might be a reference).
The code that’s executed to track the reference write and possibly update the card table is referred to as a GC write barrier. And, obviously, if that code is happening every time a reference is written to an object, you really, really, really want that code to be efficient. There are actually multiple different forms of GC write barriers, all specialized for slightly different purposes.
The standard GC write barrier is
dotnet/runtime#98166 helps the JIT do better in a certain case. If you have a static field of a value type:
the runtime implements that by having a box associated with that field for storing that struct. Such static boxes are always on the heap, so if you then do:
the JIT can prove that the cheaper unchecked write barrier may be used, and this PR teaches it that. Previously sometimes the JIT would be able to figure it out, but this effectively ensures it.
Another similar improvement comes from dotnet/runtime#97953. Here’s an example based on
Here as well we can see on .NET 8 it’s using the more expensive checked write barrier, but on .NET 9 the JIT has recognized it can use the cheaper unchecked write barrier:
dotnet/runtime#101761 actually introduces a new form of write barrier. Consider this:
Previously as part of copying that struct, each of those fields (represented by
Now in .NET 9, this PR added a new bulk write barrier, which can implement the operation more efficiently.
Making GC write barriers faster is good; after all, they’re used a lot. However, switching from the checked write barrier to the non-checked write barrier is a very micro optimization; the extra overhead of the checked variant is often just a couple of comparisons. A better optimization is avoiding the need for a barrier entirely! dotnet/runtime#103503 recognizes that
On .NET 8, we have two barriers; on .NET 9, zero:
Similarly, dotnet/runtime#102084 is able to remove some barriers on Arm64 as part of
For years, .NET has explored the possibility of stack-allocating managed objects. It’s something that other managed languages like Java are already capable of doing, but it’s also more critical in Java, which lacks the equivalent of value types (e.g. if you want a list of integers, that’d most likely be
The hardest part of stack allocating objects is ensuring that it’s safe. If a reference to the object were to escape and end up being stored somewhere that outlived the stack frame containing the stack-allocated object, that would be very bad; when the method returned, those outstanding references would be pointing to garbage. So, the JIT needs to perform escape analysis to ensure that never happens, and doing that well is extremely challenging. For .NET 9, the support was introduced in dotnet/runtime#103361 (and brought to Native AOT in dotnet/runtime#104411), and it doesn’t do any interprocedural analysis, which means it’s limited to only handling cases where it can easily prove the object reference doesn’t leave the current frame. Even so, there are plenty of situations where this will help to eliminate allocations, and I expect it’ll be expanded to handle more and more cases in the future. When the JIT does choose to allocate an object on the stack, it effectively promotes the fields of the object to be individual variables in the stack frame.
Here’s a very simple example of the mechanism in action:
On .NET 8, the generated code for
The generated code is allocating a new object, populating that object’s
The JIT has inlined the constructor, inlined accesses to the
Here’s another more impactful example. When it comes to performance optimization, it’s really nice when the right things just happen; otherwise, developers need to learn the minute differences between performing an operation this way or that way. Every programming language and platform has non-trivial amounts of such things, but we really want to drive the number of them down. One interesting case for .NET has had to do with structs and casting. Consider these two
Ideally, if you call them with a value type
This is one of those things that a developer would have to “just know,” and also fight against tooling like IDE0038 that pushes developers to write this code like in my first version, whereas for structs the latter ends up being more efficient. This work on stack allocation makes that difference go away, because the boxing that occurs as part of the first version is a quintessential example of the allocation the compiler is now able to stack allocate. On .NET 9, we now end up with this:
Improvements in inlining were a major focus of previous releases, and will likely be a major focus again in the future. For .NET 9, there weren’t a ton of changes, but there was one particularly impactful improvement.
As a motivating example, consider
Notably, it’s non-generic, and that’s a question we get asked about at some relevant frequency. We chose not to make it generic for three reasons:
So,
on .NET 8 we get this for the
Two things in particular to notice here. First, we see there’s a
to my
I see this as part of the output:
In other words,
Now on .NET 9, thanks to dotnet/runtime#99265, it is inlined! The resulting assembly is too large for me to show here, but we can see the impact in the benchmark results:
and we can see it in the inlining report as successfully inlining.
Applications end up having different needs when it comes to memory management. Would you be willing to throw more memory at maximizing throughput, or do you care more about minimizing working set? How important is it that unused memory be returned to the system aggressively? Is your expected workload constant or ebbing and flowing in nature? The GC has long had lots of knobs for configuring behavior based on these kinds of questions, but none more apparent than the choice of whether to use the “workstation GC” or “server GC”.
By default, an application uses the workstation GC, though some environments (like ASP.NET) opt-in to using server GC automatically. You can explicitly opt-in in a variety of ways, including by adding
The decision for which to use isn’t always so clear. Especially in the presence of containers, you frequently still care about really good throughput, but also don’t want to be spending memory uselessly. Enter “DATAS,” or “Dynamically Adapting To Application Sizes”. DATAS was introduced in .NET 8 and serves to narrow the gap between workstation and server GC, bringing server GC closer to workstation memory consumption. It dynamically scales how much memory is being consumed by server GC, such that in times of less load, less memory is being used. While DATAS shipped in .NET 8, it was only on by default for Native AOT-based projects, and even there it still had some issues to be sorted. Those issues have now been sorted (e.g. dotnet/runtime#98743, dotnet/runtime#100390, dotnet/runtime#102368, and dotnet/runtime#105545), such that in .NET 9, as of dotnet/runtime#103374, DATAS is now enabled by default for server GC.
If you have a workload where absolute best possible throughput is paramount and you’re ok with additional memory being consumed to enable that, you should feel free to disable DATAS, e.g. by adding this to your project file:
While DATAS being enabled by default is a very impactful improvement for .NET 9, there are other GC-related improvements in the release as well. For example, when compacting heaps, the GC may end up sorting objects by addresses. For large numbers of objects, this sort can be relatively expensive, and it behooves the GC to parallelize the sorting operation. For this purpose, several releases ago the GC incorporated a parallelized sorting algorithm called vxsort, which is effectively a quicksort with a parallelized partitioning step. However, it was only enabled for Windows (and only on x64). In .NET 9, it’s enabled for Linux as well as part of dotnet/runtime#98712. This helps to reduce GC pause times.
The .NET runtime provides many services to managed code. There’s the GC, of course, and the JIT compiler, and then there’s a whole bunch of functionality around things like assembly and type loading, exception handling, configuration management, virtual dispatch, interop infrastructure, stub management, and so on. All of that functionality is generally referred to as being part of the coreclr virtual machine (VM).
Many performance changes in this area are hard to demonstrate, but they’re still worth mentioning. dotnet/runtime#101580 lazily-allocates some information related to method entrypoints, resulting in smaller heap sizes and less work on startup. dotnet/runtime#96857 also removed some unnecessary allocation happening related to data structures around methods. dotnet/runtime#96703 reduced the algorithmic complexity of some key functions involved in building up method tables, while dotnet/runtime#96466 streamlined access to those tables, minimizing the number of indirections involved.
Another set of changes went in to improving various calls from managed code into the VM. When managed code needs to call into the runtime, it has a couple of mechanisms it can employ. One is a “QCALL,” which is effectively just a P/Invoke /
But arguably the most impactful change in this area for .NET 9 is around exceptions. Exceptions are expensive and should be avoided where performance matters. But… just because they’re expensive doesn’t mean it’s not valuable to make them less expensive. And in fact, there are cases where it’s really worthwhile to make them less expensive. One of the things we sporadically observe in the wild are “exception storms.” Some failure happens, which causes another failure, which causes another. Each of those incurs exceptions. CPU consumption starts to spike as the overhead of those exceptions is incurred. Now other things start to time out because they’re getting starved, and they throw exceptions, which in turn causes more failures. You get the idea.
In Performance Improvements in .NET 8, I highlighted that in my opinion the single most important performance improvement in the release was a single character change, enabling dynamic PGO by default. Now in .NET 9, dotnet/runtime#98570 is similar, a super small and simple PR that belies all the work that came before it. Earlier on, dotnet/runtime#88034 had ported the Native AOT exception handling implementation over to coreclr, but it was disabled by default due to still needing bake time. It’s now had that bake time, and the new implementation is now on by default in .NET 9. And it’s much faster. Things get better still with dotnet/runtime#103076, which removes a global spinlock involved in the handling of exceptions.
We frequently say “the runtime,” but in reality there are currently multiple runtime implementations in .NET. “coreclr” is the runtime thus far referred to, which is the default runtime used on Windows, Linux, and macOS, and for services and desktop applications, but there’s also “mono,” which is mainly used when the form factor of the target application requires a small runtime: by default, it’s the runtime that’s used when building mobile apps for Android and iOS today, as well as the runtime used for Blazor WASM apps. mono has also seen a multitude of performance improvements in .NET 9:
Native AOT is a solution for generating native executables directly from .NET applications. The resulting binary doesn’t require .NET to be installed and does not require JIT’ing; instead it contains in it all of the assembly code for the whole app, inclusive of the code for any core library functionality accessed, the assembly for the garbage collector, and so on. Native AOT first shipped in .NET 7 and was then significantly improved for .NET 8, in particular around reducing the size of the resulting applications. Now in .NET 9, investment continues in Native AOT, with some very nice fruits of the labor on it. (Note that the Native AOT tool chain uses the JIT to generate assembly code, so most of the code generation improvements discussed in the JIT section and elsewhere in this post accrue to Native AOT as well.)
One of the biggest concerns for Native AOT is size and trimming. Native AOT-based applications and libraries compile everything, all user code, all the library code, the runtime, everything, into the single native binary. It’s thus imperative that the tool chain goes to extra lengths to get rid of as much as possible in order to keep that size down. This can include being more clever about how the runtime stores the state necessary for execution. It can include being more thoughtful about generics in order to reduce the possible code size explosion that can result from lots of generic instantiations (effectively multiple copies of the exact same code all specialized for different type arguments). And it can include being very diligent about avoiding dependencies that can bring in lots of code unexpectedly and that the trimming tools are unable to reason about enough to remove. Here are some examples of all of these in .NET 9:
Previous releases saw a considerable amount of time spent on driving down binary sizes, but these kinds of changes chip away at them even further. Let’s create a new ASP.NET minimal APIs application using Native AOT. This command uses the
Replace the contents of the generated
All I’ve done on top of the template’s defaults is have both
We can publish this app with Native AOT:
which yields:
We can see here that the whole site, web server, garbage collector, everything, are contained in
which results in:
Now, just by moving to the new version, that same
Beyond a focus on size, ahead-of-time compilation also differs from just-in-time compilation in that each has their own opportunities for unique optimizations. The JIT can see the exact details of the current machine and employ the best possible instructions based on what’s available (e.g. using AVX512 instructions on hardware that supports it), and the JIT can use dynamic PGO to evolve the code based on execution characteristic. But, Native AOT is capable of doing whole program optimization, where it can look at everything in the program and optimize based on the totality of everything involved (in contrast, a JIT’d .NET application may load additional .NET libraries at any point). For example, dotnet/runtime#92923 enables automatically making fields
dotnet/runtime#99761 provides a nice example where, based on whole program analysis, the compiler can see that a particular type is never instantiated. If it’s never instantiated, then type checks for that type will never succeed. And thus if a program has a check like
Whole program analysis is also applicable to devirtualization in really cool ways. With dotnet/runtime#92440, the compiler can now devirtualize all calls to a virtual method
NativeAOT also has a super power in its ability to do pre-initialization. The compiler contains an interpreter that’s able to evaluate code at build time and replace that code with just the result; for some objects, it’s then also able to blit the object’s data into the binary in a way that it can be cheaply dehydrated at execution time. The interpreter is limited in what it’s able and allowed to do, but over time its capabilities are improving. dotnet/runtime#92470 extends it to support more type checks, static interface method calls, constrained method calls, and various operations on spans, while dotnet/runtime#92666 expands it to have some support for hardware intrinsics and the various
Since the beginning of .NET, general wisdom has been that the vast majority of code that needs to synchronize access to shared state should just use
Over 20 years after the introduction of .NET, that’s evolving, just a bit.
As is evident from this benchmark, the syntax for using both can be identical.
Note that C# 13 has special-recognition of
but even though the syntax is identical, here’s an equivalent of what’s generated for
We’ve also started using
Of course, while locks are the default recommendation for synchronization, there’s still a lot of code that demands the higher throughput and scalability that comes from more lock-free programming, and the workhorse for such implementations is
Even with those new overloads, though, there are still places it’s desirable to use
Another place it’d be really nice to use
Now in .NET 9, as of dotnet/runtime#104558 the generic
This is not only good for usability, it’s good for performance in a few ways. First, it enables performance improvements like the
but these are really just
Any additional padding aside, that reduces 12 bytes down to 3 bytes on the object.
Also related to
You’ll see a very similar loop any time you want to use optimistic concurrency to create a new value and substitute it for the original in an atomic manner. The actual
such that
The approach employed by
And luckily, the JIT can see that it’s ignoring the result. As such, on x86/64, the JIT can use the optimal sequence when it can see that the result isn’t being used, and even if it is being used, it can still emit a slightly more concise instruction sequence than would have naturally resulted from our open-coded implementation:
Locks and interlocked operations are about coordinating between operations, at a relatively low level. There are higher level coordination constructs as well; that’s effectively what
and you can then explicitly join with that returned task to observe any exceptions it may have incurred or consume its result value if it has one. But what do you then do to join with the remaining two tasks? You might end up writing code something like this:
That’s not terribly hard, but it’s also not terribly efficient. Or, rather, for larger number of tasks, it’s terribly inefficient, as it’s an
Enter
It’s a little hard to get a good applies-to-apples comparison of the overhead here, but this benchmark is a reasonable approximation:
There are some other interesting performance improvements in threading in .NET 9 as well.
Reflection is a very powerful (though sometimes overused) capability of .NET that enables code to load and introspect .NET assemblies and invoke their functionality. It is used in all manner of library and application, including by the core .NET libraries themselves, and it’s important that we continue to find ways to reduce the overheads associated with reflection.
Several PRs in .NET 9 whittle away at some of the allocation overheads in reflection. dotnet/runtime#92310 and dotnet/runtime#93115 avoid some defensive array copies by instead handing around
that’ll print out
that’ll print out:
However, that
Reflection is particularly important with libraries involved in dependency injection, where object construction is frequently done in a more dynamic fashion.
The aforementioned
Of course, if you can avoid using these more expensive reflection approaches in the first place, that’s very desirable. One reason for using reflection is to access private members of other types, and while that’s a scary thing to do and generally something to be avoided, there are valid cases for it where having an efficient solution is highly desirable. .NET 8 added such a mechanism in
I get this:
However, in .NET 8, this mechanism could only be used with non-generic members. Now in .NET 9, thanks to dotnet/runtime#99468 and dotnet/runtime#99830, this capability now extends to generics, as well.
Parsing that occurs as part of reflection, and in particular as part of type names, was also improved as part of some work to consolidate type name parsing into a reusable component. dotnet/runtime#100094‘s primary purpose wasn’t to improve performance, but it ended up doing so, anyway.
And then there are more intrinsics. In compiler speak, an “intrinsic” is something the compiler has “intrinsic” knowledge of, a fancy way of saying it’s something the compiler implicitly knows about. This typically manifests as a method whose implementation is provided by the compiler, sometimes always or sometimes based on certain conditions. For example,
Several new members became intrinsics in .NET 9. dotnet/runtime#96226 turns
With
or
based on the nature of
on .NET 8 we end up with over 250 bytes of assembly code for implementing this operation. On .NET 9, we get just this:
The magic of intrinsics.
The core data types in .NET sit at the very bottom of the stack and are used everywhere. It’s thus a desire every release to whittle away at any overheads we can avoid. .NET 9 is no exception, where a multitude of PRs have gone into reducing overheads of various operations on these core types.
Consider
such that its success path might include the failure path from some number of these primitives’
Both
The JIT is able to do a bit better job with the latter, for the former producing:
but for the latter producing:
The net result is a nice improvement to these operations, e.g.
Various operations on the primitive types were also improved across a plethora of PRs:
Not exactly a “primitive” type, but in the same ballpark, is
dotnet/runtime#91176 from @Rob-Hague improved
Parsing is another common way of creating
This isn’t the first time efforts have been made to improve
Once you have a
dotnet/runtime#92208 from @kzrnm also improved
Lastly, in addition to parsing,
Numerics has been a big focus for .NET over the last several releases. A large stable of numerical operations is now exposed on every numerical type as well as on a set of generic interfaces those types implement. But sometimes you want to perform the same operation on a set of values rather than on an individual value, and for that, we have
which provides the hyberbolic cosine of one
In .NET 8,
such that it can be used with
A huge number of APIs is available, most of which see similar or better gains over the simple loop. Here’s what’s currently available in .NET 9, all as generic methods, and with multiple overloads available for most:
The possible speedups are even more pronounced on other operations and data types; for example, here is a manual implementation of hamming distance on two input
A slew of PRs went into making this happen. The generic method surface area was added via dotnet/runtime#94555, dotnet/runtime#97192, dotnet/runtime#97572, dotnet/runtime#101435, dotnet/runtime#103305, and dotnet/runtime#104651. And then many more PRs added or improved vectorization, including dotnet/runtime#97361, dotnet/runtime#97623, dotnet/runtime#97682, dotnet/runtime#98281, dotnet/runtime#97835, dotnet/runtime#97846, dotnet/runtime#97874, dotnet/runtime#97999, dotnet/runtime#98877, dotnet/runtime#103214 from @neon-sunset, and dotnet/runtime#103820.
As part of all of this work, there was also a recognition that we had the scalar operations and we had the operations on an unbounded number of elements as part of spans, but doing the latter efficiently required effectively having the same set of operations on the various
Other related numerical types have also seen improvements. Quaternion multiplication was vectorized in dotnet/runtime#96624 by @TJHeuvel, and dotnet/runtime#103527 accelerated a variety of operations on
dotnet/runtime#102301 also moves a lot of the implementation for types like
As previously noted in Performance Improvements in .NET 8 and earlier in this post, my single favorite performance improvement in .NET 8 came from enabling dynamic PGO. But my second favorite improvement came from the introduction of
The
If you print out the name of the type of the instance returned by that
That type provides a specialization of
So, in .NET 8,
This also highlights another interesting difference from the existing
But, let’s say we did want to perform such a search, without that functionality existing in the core libraries. One approach is to simply walk through the input, position by position, comparing each of the target values at that location:
Classic. Functional. And slow. This is doing a fair amount of work for every single character in the input, for each looping over every day name and doing a comparison. How can we do better? First, we could try making the inner loop more efficient. Rather than iterating through the strings, we could hardcode our own switch:
The main benefit of this is it makes the
Much better, more than a 16x improvement. What if we instead just kept things simple and searched for each individual string using the already-optimized
Nice and simple, but…
Ouch. On the positive side, this approach benefits from vectorization, as the
yields this:
That means that whereas
Ok, so what if we changed approach, and instead searched for the first letter in each word in order to quickly skip past the locations that couldn’t possibly match. We could even use
In some situations, this is a very viable strategy; in fact, it’s a technique often employed by
Now, let’s try with
The functionality is built-in, so we haven’t had to write any custom logic other than the call to
Not only simpler, then, but also several times faster than the fastest result we’d previously managed, and ~105x faster than our original attempt. Sweet!
How does this all work? The algorithms behind it are quite fascinating. As with
Beyond those special cases, it starts to get really interesting. There’s been a lot of research done over the last 50 years for the most efficient ways to perform a multi-string search. One popular algorithm is Rabin-Karp, which was created by Richard Karp and Michael Rabin in the 1980s, and which works via a “rolling hash.” Imagine creating a hash of the first N characters in the haystack (input) text, where N is the length of the needle (the substring) for which you’re searching, and comparing the haystack hash against the needle hash; if they match, do the actual full comparison at that location, otherwise continue. Then update the hash by removing the first character and adding the next character, and repeat the check. And then repeat, and repeat, and so on. Each time you move forward, you’re just updating the hash via a fixed number of operations, meaning that all of the updates to the hash function for the whole operation are only
This supports one needle, but extending to support multiple needles can be accomplished in a variety of ways, such as by bucketing needles by their hash codes (ala what a hash map does), and then either checking all needles in the corresponding bucket when there’s a hit, or further reduction in what needs to be checked based on using a Bloom filter or similar technique.
Another popular algorithm is Aho-Corasick, which was designed by Alfred Aho and Margaret Corasick even earlier, in the 1970s. Its primary purpose is multi-string search, enabling finding a match to be performed in linear time in the length of the input, assuming a fixed set of needles. It works by building up a form of a trie, a finite automaton where you start at the root of the graph and transition to children based on matching the character associated with the edge to that child. But, it extends a typical trie with additional edges between nodes that can be used as fallbacks. For example, here’s the automaton for the days of the week discussed earlier: Given the input text “wednesunday”, this will start at the root, progress through the “w”, “we”, “wed”, “wedn”, “wedne”, and “wednes” nodes, but then upon encountering the subsequent ‘u’ and not being able to progress down that path, it’ll employ the fallback link over to the “s” node, at which point it’ll be able to traverse down through “s”, “su”, and so on, until it hits the leaf “sunday” node and can declare success. Aho-Corasick efficiently supports larger strings, and is the go-to implementation
The real workhorse of
Earlier, I gave a rough summary of how the
Rather than doing two individual searches, we can perform a single search:
This is searching “War and Peace” for both “warning” or “error”, but even though both appear in the text, such that the second search for “error” in the original code will never happen, the
Beyond
The same thing applies with four characters: instead of doing four vector comparisons and three OR operations to combine them, we can do a single OR on the input to mix in
Other
That same PR also made another significant improvement related to the probabilistic map: not using it as much. It’s a terrific implementation for some sets of inputs, but for others it can end up performing poorly. .NET 8 included a
The needle here includes all of the characters in the Greek and Greek Extended Unicode blocks, approximately 400 characters. With the way the probabilistic map builds up its filter bitmap, every single bit in the bitmap ends up being set, which means every examined character will fall back to the expensive path. Now in .NET 9, it’ll use a simpler, non-probabilistic bitmap, and even though it’s not vectorized, it yields significantly faster throughput.
dotnet/runtime#96931 also extended this probabilistic map support to benefit from AVX512 such that when the probabilistic map implementation is used, it can be significantly faster. Previously, its implementation would utilize 128-bit or 256-bit vectors, depending on hardware support, but now in .NET 9, it can also use 512-bit vectors. This not only possibly doubles throughput due to vector width, AVX512 also includes some applicable instructions that the older instruction sets don’t have (e.g.
Further, while the probabilistic map implementations were previously vectorized for
In many of my examples, I’ve used
Improvements around
It’s also worth highlighting that there have been improvements around
What’s interesting about this is the
Why then is this extra
if the JIT can’t prove the argument is a constant or if it’s not exactly one character in length, or:
if it can. This eliminates a bit of overhead from the call.
Regular expression support in .NET has received a lot of love over the past few years. The implementation was overhauled in .NET 5 to yield significant performance gains, and then in .NET 7 it not only saw another round of huge performance gains, it also gained a source generator, a new non-backtracking implementation, and more. In .NET 8, it saw additional performance improvements, in part because of using
Now in .NET 9, the trend continues. First and foremost, it’s important to recognize that many of the changes discussed thus far implicitly accrue to
There are multiple engines backing
As of dotnet/runtime#98791, dotnet/runtime#103496, and dotnet/runtime#98880, all of the engines other than the interpreter avail themselves of the new
In Visual Studio, you can right-click on
The code is using an
The source generator is employing the approach we looked at earlier, searching for a single character from each string. Here it’s decided that its best chance for an optimal search is to look for the character at offset 5 in each string, so ‘y’ for “Monday”, ‘a’ for “Tuesday”, etc., plus looking for the upper-case variants since
Now, here’s what we get in .NET 9:
Again, we see an
The sharp-eyed amongst you might notice that there’s no ‘y’ at the end of “Wednesday”; that’s simply due to a heuristic in the
we still end up with a
Interestingly, as of dotnet/runtime#96402,
On .NET 8, the code generated for
whereas on .NET 9, the generated code is:
And the results:
This highlights that using
My seeming obsession with
dotnet/runtime#93190 is a nice addition. One of the optimizations introduced to
The pattern starts with a word character loop. We don’t have a good way to vectorize a search for any word character, nor would we really want to; there are over 50,000 word characters that are part of the
There were, however, a number of gaps in this optimization. Most notably, the implementation needs to examine the pattern to determine whether it’s applicable. If the starting loop was wrapped in a capture or an atomic group, it was unnecessarily giving up and would fail to discover the loop for the purposes of enabling the “literal-after-loop” mechanism. The search would also give up if the literal after the loop was a set inside of various grouping constructs, like a concatenation.
This PR fixed those gaps. The impact of this can be seen by looking at another industry benchmark, this time from the BurntSushi/rebar repo:
The impact of the literal-after-loop optimization ends up being obvious in the resulting numbers:
Improvements in
What jumped out about this pattern is that it should trigger an optimization in the regex source generator that emits alternations like this as a C#
The non-backtracking engine also got some nice attention in .NET 9. dotnet/runtime#102655 from @ieviev (who submitting a small subset of the changes they’d made as part of some exciting regex research being done in a fork of the library), followed by dotnet/runtime#104766 and dotnet/runtime#105668 made a variety of changes to the non-backtracking implementation, including:
The net result of these changes is most patterns get faster, some significantly, especially on non-ASCII inputs.
The dotnet/runtime employs an automated performance regression testing system, with tests in dotnet/performance constantly running on various operating systems and hardware, with the goal of detecting regressions. When a possible regression is noticed, an issue is opened containing the details. However, the system also notices statistically-significant improvements and also opens issues on those, just to ensure that we’re all aware of when and how things change in a meaningful way. When possible, the issues reference the PR known to have caused the regression or improvement, so it’s always a treat to see a list of references like this on a PR, as was the case with dotnet/runtime#102655:
Both the non-backtracking engine and the interpreter now also gain additional optimized searching for certain classes of prefixes they didn’t previously support. With dotnet/runtime#100315, patterns that begin with ranges can now be optimized with an
Finally,
Base64 encoding has been supported in .NET since the beginning, with methods like
Base64 is a fairly simple encoding scheme for taking arbitrary binary data and converting it to ASCII text, splitting the input up into groups of 6 bits (2^6 == 64 possible values) and mapping each of those values to a specific character in the Base64 alphabet: the 26 upper-case ASCII letters, the 26 lower-case ASCII letters, the 10 ASCII digits,
While .NET has had Base64 support for a long time, it hasn’t had Base64Url support, and as such, developers have had to craft their own. Many have done so by layering on top of the Base64 implementations in
We can obviously write more code to do that more efficiently, but with .NET 9 we don’t have to. With dotnet/runtime#102364, .NET now has a fully-featured
This also benefits from a set of changes that improved the performance of
Another simpler form of encoding is hex, effectively employing an alphabet of 16 characters (for each group of 4 bits) rather than 64 (for each group of 6 bits). .NET 5 introduced the
where
Before .NET 5 introduced
There are a variety of reasons for that difference, including the obvious one that the
In the other direction,
The introduction of
One great example of this is the new C# 13 support for “params collections,” which merged into the C# compiler’s main branch in dotnet/roslyn#72511. This feature enables the C#
the IL the C# compiler generates for
This is using the
The C# compiler has also improved around spans in other ways. For example, dotnet/roslyn#71261 extends the assembly data support for initializing arrays and
the compiler will generate code along the lines of the following:
The compiler has taken that char data and blit it into the assembly; then when it creates the array, rather than setting each individual value into the array, it just copies that data directly from the assembly into the array. Similarly, if you have:
the compiler recognizes that all of the data is constant and is being stored into a “read-only” location, so it doesn’t actually need to allocate an array. Instead, it emits code like:
which effectively creates a span that points directly into the assembly data; no allocation and no copy needed. However, if you have this:
or this:
you’d get codegen more like this:
But now, thanks to dotnet/roslyn#71261, that last example will also be unified with the same approach for the other constructions, resulting in code more like this:
(the compiler will actually generate a
The C# compiler has also improved its ability to avoid allocations when creating
got lowered into something more like this:
The optimization at the time was limited to only single-byte primitive types because of endianness concerns, but .NET 7 added a
gets lowered into something more like this:
Lovely. But… what about types that are supported as constants at the C# level but that aren’t blittable in this fashion? That includes
which are lowered to, well, themselves, such that there’s still an allocation. Or, at least there was. Thanks to dotnet/roslyn#69820, these cases are now handled as well. They’re addressed by lazily allocating an array that’s then cached for all subsequent use. So now, that same example gets lowered into the equivalent of something more like this:
There are, of course, many more span-related improvements in the libraries, too. One improvement for an existing span-related method is dotnet/runtime#103728, which further optimizes
New span-related functionality also shows up in .NET 9. String splitting is an operation that’s used all over the place; a search for “.Split(” in C# code on GitHub yields millions of hits, and data from a variety of sources suggests that just the simplest overload
The devil is in the details, of course, and it’s taken a long time to figure out exactly how it should be exposed. There are largely two different use cases for splitting we see in the wild. One case is where the content being split has an expected or max number of segments, and splitting is used to extract them. For example,
The
With that, this same operation can be written as:
In doing so, it becomes allocation-free, as this
whereas that wouldn’t be possible if
You might notice that there’s no
There are several possible performance problems here. Imagine the
The net result is such operations can be significantly more efficient while not sacrificing much if anything in the way of maintainability.
The nature of this new set of splitting APIs is that they find just the next separator / segment; that’s both practical and possibly a performance improvement by itself. It’s practical because we’re only yielding a single segment at a time, and we don’t have anywhere to store all possible found separator positions (nor do we want to allocate space to do so). And it’s desirable because the consumer may early-exit from the consuming loop, in which case we don’t want to have spent time unnecessarily searching for additional segments that are going to be ignored. The existing set of splitting APIs, however, hand back all found segments in one go, either via a returned
Spans show up in other new methods as well. dotnet/runtime#93938 from @TheMaximum added new overloads of
This is having to perform two
but if
and, boom, no more allocation for the arguments.
A variety of other improvements were made to
More interesting to me than these nice gains is the code that was generated to achieve them. This is what the assembly for this benchmark looked like with .NET 8:
Pretty straightforward, a bit of argument manipulation and then jumping to the actual
Notice there’s no call to
Finally, we’ve not talked much about arrays separate from spans, but there have been improvements there as well. dotnet/runtime#102739 and dotnet/runtime#104103 move more logic for array handling from native code in the runtime up into C# code in
In addition to other benefits that come from moving such logic into managed code (better maintainability, more implicitly safe code, reduced overhead from transitioning between managed and native, etc.), there’s another less obvious benefit: impact on GC pause times. And that’s nowhere more obvious than with dotnet/runtime#98623, which moved the implementations of
This is sitting in a loop that simply times how long it takes to perform 10 gen2 collections, each spaced out by ~15 ms. If each collection were free then, this loop should take ~150 ms. Since it’s not free, let’s round up and estimate that the loop should be around ~200 ms. Before we run the loop, though, we launch a thread that just sits in an infinite loop filling a span. That shouldn’t mess with our timing loop… or should it? When I run this on .NET 8, I get values like this:
Those values are in seconds, and that’s approximately 5x larger than we’d predicated. Now I try on .NET 9, and I get results like this:
What happened? In order to do some of its work, the GC needs to be able to get a consistent view of the world, which is violated if things are concurrently changing out from under it. As such, it may need to temporarily suspend all threads in the process, but to do that, it needs to wait for each thread to get to a safe point, and if a thread is executing code in the runtime, that can be hard to do. In this particular case, there’s a thread spending almost all of its time sitting in a call to
Language Integrated Query, or LINQ, is a mainstay of .NET. At its heart, LINQ is a specification for hundreds of overloads of methods that manipulate data, and then implementations of that specification for different types. One of the most prominent implementations comes from
One of the more sweeping LINQ changes in .NET 9 has to do with how various optimizations are implemented. In the original implementation of LINQ circa 2007, almost every method was logically independent from every other. A method like
And then
and that
This is useful because it allows for avoiding one of the major sources of overhead with enumerables. Without this optimization, if I had
Over the last decade with .NET, those optimizations have been significantly extended, and in some cases to much greater benefit than just saving a few interface calls. For example, in a previous .NET release, a similar mechanism was used to special-case
This additional special-casing was achieved with internal interfaces in the library. An
dotnet/runtime#98969 and dotnet/runtime#99344 remove those internal interfaces and consolidate all of their members down to the base
dotnet/runtime#97905, dotnet/runtime#97956, dotnet/runtime#98874, and dotnet/runtime#99216 also added more implementations.
Subsequent PRs also further benefited from this consolidation. dotnet/runtime#99218, for example, uses it to improve
Another cross-cutting improvement across LINQ comes in dotnet/runtime#96602 and has to do with empty inputs. It’s also a nice example of how what’s considered an optimization ebbs and flows. In the beginning of LINQ,
dotnet/runtime#98963 also has to do with emptiness, but actually improves non-empty cases.
Another change taking advantage of emptiness is dotnet/runtime#99256, this time for
The statement about checking for empty now and permanently applies in particular to methods that accept and return enumerables. It’s the laziness of these methods that makes that relevant. There is, however, a set of LINQ methods that are not lazy because they produce things that aren’t enumerables, such as
dotnet/runtime#97004 from @neon-sunset uses that same mechanism to improve performance for
dotnet/runtime#104365 from @andrewjsaid followed-up on this to use that same
A few more tweaks were made to
Not to be left out from the fun its
dotnet/runtime#96605 updated
The aforementioned special code paths for the primitive types also support vectorization. Previously that vectorization only supported 128-bit and 256-bit vector widths, but as of dotnet/runtime#93369 from @Spacefish, it now also supports 512-bit vector widths, possibly doubling the throughput of
One caveat here about AVX512. Some AVX512 hardware, even on recent chips, can take a measurable amount of time to “power up,” such that it might be tens, hundreds, or even thousands of cycles before AVX512 processing ends up actually dispatching 512-bit vectors. Until then, the hardware might end up doing the equivalent of dispatching two 256-bit vectors. On my machine, for example, if I lower the size in the previous benchmark from 10,000 elements to 1,000 elements, the .NET 9 improvement disappears and it ends up running at exactly the same throughput as on .NET 8; on a colleague’s machine with a different processor, even at 1,000 elements the .NET 9 throughput is almost twice that of .NET 8. This is all to say, your mileage may vary. In some of the micro-benchmarks discussed in this post, small improvements are made to already very fast operations, and the gains then come from those operations being done many, many, many times on hot paths. In others, the gains come from taking an expensive operation and making it measurably cheaper. In general the benefits with using AVX512 in these kinds of vectorized implementations come for the latter case, where large data sizes lead to operations taking significant amounts of time, and the use of 512-bit vectors instead of 256-bit vectors measurably speeds up those longer operations.
The
In general we favor using C# iterators over manual implementations (unless we’re going to go all out and implement all of the
As shared in Performance Improvements in .NET 8,
One of the most common uses for a dictionary is as a cache, often indexed by a
This has been addressed in .NET 9 with the introduction of
I’m returning a
Note the distinct lack of a
The
Another possibly perplexing thing for anyone reading this and who’s well versed in the ways of
Of course, all of this working hinges on the supplied comparer sporting the appropriate
Note the huge reduction in allocation.
For fun, we can also take this example one step further. .NET 6 introduced the
“But wait, there’s more!” dotnet/runtime#104202 extends the alternate comparer implementation for
dotnet/runtime#96573 from @ndsvw also identified a few places in various libraries where a
A variety of other collection types have also seen improvements in .NET 9:
It’s an important goal of the core .NET libraries to be as platform-agnostic as possible. Things should generally behave the same way regardless of which operating system or which hardware is being used, excepting things that really are operating system or hardware specific (e.g. we purposefully don’t try to paper over casing differences of different file systems). To that end, we generally implement as much as possible in C#, deferring down to the operating system and native platform libraries only when necessary; for example, the default .NET HTTP implementation,
There are, however, just a few specific places where we’ve actively made the choice to defer more to something in the platform. The most important case here is cryptography, where we want to rely on the operating system for such security-related functionality; so on Windows, for example, TLS is implemented in terms of components like
To simplify things, to improve consistency and performance across more platforms, and to move to an actively supported and evolving implementation, this changes for .NET 9. Thanks to a stream of PRs, and in particular dotnet/runtime#104454 and dotnet/runtime#105771, .NET 9 now includes the
Benchmarking just throughput is easy with BenchmarkDotNet. Unfortunately, while I love the tool, the lack of dotnet/BenchmarkDotNet#784 makes it very challenging to appropriately benchmark compression, because throughput is only one part of the equation. Compression ratio is also a key part of the equation (you can make “compression” super fast by just outputting the input without actually manipulating it at all), so we also need to know about compressed output size when discussing compression speeds. To do that for this post, I’ve hacked up just enough in this benchmark to make it work for this example, implementing a custom column for BenchmarkDotNet, but please note this is not a general-purpose implementation.
Running that for .NET 8, I get this:
and for .NET 9, I get this:
A few interesting things to note here:
The net effect of this is, especially if you’re using
Investments in
Let’s start with random number generation. .NET 8 added a new
The core implementation is very simple, and is just a convenience implementation for something you could easily do yourself:
Easy peasy. However, in some situations we can do better. This implementation ends up making a call to the random number generator for each element, and that roundtrip adds measurable overhead. If we could instead make fewer calls, we could ammortize that overhead across however many elements could be filled by that single call. That’s exactly what dotnet/runtime#92229 does. If the number of choices is less than or equal to 256 and a power of two, rather than asking for a random integer for each element, we can instead get a byte for each element, and we can do that in bulk with a single call to NextBytes. The max of 256 choices is because that’s the number of values a byte can represent, and the power of two is so that we can simply mask off unnecessary bits from the byte, which helps to avoid bias. This makes a measurable impact for
Sometimes performance improvements are about revisiting past assumptions. .NET 5 added a new
On the subject of memory, multiple PRs went into the crypto libraries to reduce allocation. Here are some examples:
Of course, improving performance is more than just avoiding allocation. A variety of changes helped in other ways.
dotnet/runtime#99053 “memoizes” (caches) various properties on
There were also several improvements related to loading certificates and keys. dotnet/runtime#97267 from @birojnayak addressed an issue on Linux where the same certificate was being processed multiple times rather than just once, and dotnet/runtime#97827 improved the performance of RSA key loading by avoiding some unnecessary work that the key validation was performing.
Quick, when was the last time you worked on a real application or service that didn’t involve networking at all? I’ll wait… (I’m so funny.) Effectively every modern application relies on networking in one way, shape, or form, especially one that’s following more cloud-native architectures, involving microservices, and the like. Driving down the costs associated with networking is something we take very seriously, and the .NET community whittles away at these costs every release, including in .NET 9.
In .NET 9, a few PRs focused on steady-state throughput, such as dotnet/runtime#95595, which addressed an issue where some packets were being unnecessarily split into two, leading to extra overhead associated with needing to send and receive that extra packet. This was particularly impactful when writing out exactly 16K, and especially on Windows (where I’ve run this test):
dotnet/runtime#100513 also reduced the cost of checking
However, the bigger impacts in .NET 9 weren’t on steady-state throughput but rather on TLS connection establishment, aka the handshake. Establishing a TLS connection requires the client and server to engage in a conversation where they agree on details like TLS version, what cipher suite to use, confirm the other is who they say they are, create and exchange dedicated symmetric keys for the communication, and so on. That’s a relatively expensive endeavor. For long-lived connections, that overhead is generally not a big deal, but there are plenty of scenarios where connections are more routinely established and torn down, and for those, we want to drive down the overhead associated with setting up such a connection.
dotnet/runtime#87874 focused on reducing allocations associated with the TLS handshake, by renting some buffers from
Of course, while driving down the costs of doing something is good, avoiding that thing altogether is even better. “TLS resumption” is a capability in the TLS protocol where, if a connection is closed and the same client later opens a new connection to the same server, the client may be able to effectively pick up where it left off with the previous TLS connection rather than starting a brand new one from scratch. Support for TLS resumption on Linux was added in .NET 7, but clients using client certificates weren’t supported… now in .NET 9 thanks to dotnet/runtime#102656, even such clients can benefit from this significant time saver.
TLS resumption is an optimization where information is stored to enable more efficient operation later. In some ways, it’s not unlike pooling in that regard. We frequently talk about pooling as a way to optimize. Often our conversations are around avoiding allocations, where employing a pool is betting that you can be more efficient than the garbage collector. For small, cheap to create objects, that’s often a bad bet. For larger objects, such as for larger arrays, it can be a good bet, which is why
However, as with any pool, the pool itself has cost. In the case of the HTTP connection pool in a
Of course once you’ve got the connection, there’s all of the costs associated with actually making the request and handling the response, and those have been whittled away at as well.
While this simple benchmark doesn’t touch on all of these changes, it does highlight that the end-to-end performance of HTTP requests gets cheaper:
Related to HTTP, the
There were also changes elsewhere in the networking stack that contribute to HTTP use cases. In dotnet/runtime#98074, for example,
Beyond raw HTTP, there were also some new features for
There are of course a variety of reasons that performance could have improved, e.g. maybe
and then re-ran (on Windows). That’s it. While the test is running, you’ll see the same output as you’re used to seeing, plus a little more. For example, at the end of the benchmarking, I now also see this:
I then simply opened that
System.Text.Json hit the scene in .NET Core 3.0, and every release since it’s gotten more capable and more efficient. .NET 9 is no exception. In addition to new features like support for exporting JSON schema, deep semantic equality comparison of
One improvement comes from the integration of
One last improvement I want to call out. The feature itself is not actually about performance, but the workarounds I’ve seen folks employ for the lack of this capability do have a significant performance impact, and so having the feature built-in will be a net performance win. dotnet/runtime#104328 adds support to both
Being able to observe one’s application in production is critical to the operation of modern services.
it used an interlocked operation to perform the addition atomically. Here
That helps, but while this does reduce the overhead and improve scalability, it still represents a bottleneck under heavy contention. To address that, the change also split the single
There’s another interesting aspect of the improvement worth mentioning, and that’s the padding employed in the array. Going from a single
but if you look at the code, it’s instead:
where
This effectively increases the size of each value from 8 bytes to 64 bytes, where only the first 8 bytes of each value is used and the other 56 bytes are padding. That’s odd, right? Normally we’d jump at an opportunity to shrink 64 bytes down to 8 bytes in order to reduce allocation and memory consumption, but here we’re purposefully going in the other direction.
The reason for that is “false sharing.” Consider this benchmark, which I’ve shamelessly borrowed from a conversation Scott Hanselman and I recently recorded for the Deep .NET series but which hasn’t yet posted online:
When I run that, I get results like this:
In this benchmark, one thread is incrementing
As an aside, there are some additional BenchmarkDotNet diagnosers that can help to highlight the effects of false sharing. ETW on Windows enables collecting various CPU performances counters, such as for branch misses or instructions retired, and BenchmarkDotNet has a
Here I’ve asked for both instructions retired, which reflects how much instructions were fully executed (this in and of itself can be a useful metric when analyzing performance, as it’s not as prone to variation as wall-clock measurements), and cache misses, which reflects how many times data wasn’t available in the CPU’s cache.
In the two benchmarks, we can see that the number of instructions executed is almost the same between when false sharing occurred (Index == 1) and didn’t (Index == 31), but the number of cache misses is more than three times larger in the false sharing case, and reasonably well correlated with the time increase. When one core performs a write, that invalidates the corresponding cache line in the other core’s cache, such that the other core then needs to reload the cache line, resulting in cache misses. But I digress…
Another nice improvement comes in dotnet/runtime#105011 from @stevejgordon, adding a new constructor to
that would end up boxing the
Throughout this post, I’ve tried to group improvements by topic area in order to create a more fluid and interesting discussion. However, over the course of a year, with as vibrant a community as exists for .NET, and with the breadth of functionality that exists across the platform, there are invariably a large number of one-off PRs that improve this or that by a little. It’s often challenging to imagine any one of these significantly “moving the needle,” but altogether, such changes reduce the “peanut butter” of performance overhead spread thinly across the libraries. In no particular order, here’s a non-comprehensive look at some of these:
Maybe one more poem? An acrostic this time:
Several hundred pages later and still not a poet. Oh well.
I’m asked from time to time why I invest in writing these “Performance Improvements in .NET” posts. There’s no one answer. In no particular order:
If you’ve read this far, I hope you indeed have learned something and are excited about the .NET 9 release. As is likely obvious from my enthusiastic ramblings and awkward poetry, I’m incredibly excited about .NET, everything that’s been achieved in .NET 9, and the future of the platform. If you’re already using .NET 8, upgrading to .NET 9 should be a breeze (the .NET 9 Release Candidate is available for download), and I’d love it if you’d do so and share with us any successes you achieve or issues you face along the way. We’d love to learn from you. And if you have ideas about how to further improve the performance of .NET for .NET 10, please join us in dotnet/runtime.
Happy coding!
The post Performance Improvements in .NET 9 appeared first on .NET Blog.
Continue reading...
And so, every year, I start these posts talking about how the next release of .NET is the fastest and best release to date. That’s true for .NET 9 as well, of course, but the statement that .NET 9 is the fastest and best release of .NET to date is now a bit… mundane. So, let’s spice it up a bit. How about… a haiku?
Code:
As the falcon flies,
.NET 9 speeds joy into
Developers' hearts.
Or, maybe a limerick:
Code:
In the coding world, there's a star,
.NET 9, the best by far.
With speed that's supreme,
It's every coder's dream,
Taking development to a new par.
A little gimmicky? Maybe something more classical, a sonnet perhaps:
Code:
In realms of code where brilliance finds its way,
.NET 9 shines with an unmatched array.
Its speed and grace, a marvel to behold,
Transforming tasks to treasures, fast and bold.
Developers, with joy, embrace its might,
Their projects soar, efficiency in sight.
No longer bound by limits of the past,
In .NET 9, their dreams are built to last.
Its libs, a symphony of pure delight,
Turning complex to simple, dim to light.
With every line of code, a masterpiece,
In .NET 9, dev burdens find release.
Oh, wondrous .NET 9, you light the way,
In your embrace, our future's bright as day.
Ok, so, yeah, I should stick to writing software rather than poetry (something with which my college poetry professor likely agreed). Nevertheless, the sentiment remains: .NET 9 is an incredibly exciting release. More than 7,500 pull requests (PRs) have merged into dotnet/runtime in the last year, of which a significant percentage have touched on performance in one way, shape, or form. In this post, we’ll take a tour through over 350 PRs that have all found their way into packing .NET 9 full of performance yumminess. Please grab a large cup of your favorite hot beverage, sit back, settle in, and enjoy.
Benchmarking Setup
In this post, I’ve included micro-benchmarks to showcase various performance improvements. Most of these benchmarks are implemented using BenchmarkDotNet v0.14.0, and, unless otherwise noted, there is a simple setup for each.
To follow along, first make sure you have .NET 8 and .NET 9 installed. The numbers I share were gathered using the .NET 9 Release Candidate.
Once you have the appropriate prerequisites installed, create a new C# project in a new benchmarks directory:
Code:
dotnet new console -o benchmarks
cd benchmarks
The resulting directory will contain two files:
benchmarks.csproj
, which is the project file with information about how the application should be compiled, and Program.cs
, which contains the code for the application. Replace the entire contents of benchmarks.csproj
with this:
Code:
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFrameworks>net9.0;net8.0</TargetFrameworks>
<LangVersion>Preview</LangVersion>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
<ServerGarbageCollection>true</ServerGarbageCollection>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="BenchmarkDotNet" Version="0.14.0" />
</ItemGroup>
</Project>
The preceding project file tells the build system we want:
- to build a runnable application, as opposed to a library.
- to be able to run on both .NET 8 and .NET 9, so that BenchmarkDotNet can build multiple versions of the application, one to run on each version, in order to compare the results.
- to be able to use all of the latest features from the C# language even though C# 13 hasn’t officially shipped yet.
- to automatically import common namespaces.
- to be able to use nullable reference type annotations in the code.
- to be able to use the
unsafe
keyword in the code. - to configure the garbage collector (GC) into its “server” configuration, which impacts the trade-offs it makes between memory consumption and throughput. This isn’t required, but it’s how most services are configured.
- to pull in
BenchmarkDotNet
v0.14.0 from NuGet so that we’re able to use the library inProgram.cs
.
For each benchmark, I’ve then included the full
Program.cs
source; to test it, just replace the entire contents of your Program.cs
with the shown benchmark. Each test may be configured slightly differently from others, in order to highlight the key aspects being shown. For example, some tests include the [MemoryDiagnoser(false)]
attribute, which tells BenchmarkDotNet to track allocation-related metrics, or the [DisassemblyDiagnoser]
attribute, which tells BenchmarkDotNet to find and share the assembly code for the test, or the [HideColumns]
attribute, which removes some output columns that BenchmarkDotNet might otherwise emit but that are unnecessary clutter for our needs in this post.Running the benchmarks is then simple. Each test includes a comment at its top for the
dotnet
command to use to run the benchmark. It’s typically something like this:dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
That:
- builds the benchmarks in a Release build. Compiling for Release is important as both the C# compiler and the JIT compiler have optimizations that are disabled for Debug. Thankfully, BenchmarkDotNet warns if Debug is accidentally used:
Code:// Validating benchmarks: // * Assembly Benchmarks which defines benchmarks is non-optimized Benchmark was built without optimization enabled (most probably a DEBUG configuration). Please, build it in RELEASE. If you want to debug the benchmarks, please see https://benchmarkdotnet.org/articles/guides/troubleshooting.html#debugging-benchmarks.
- targets .NET 8 for the host project. There are multiple builds involved here: the “host” application you run with the above command, which uses BenchmarkDotNet, which will in turn generate and build an application per target runtime. Because the code for the benchmark is compiled into all of these, you typically want the host project to target the oldest runtime you’ll be testing, so that building the host application will fail if you try to use an API that’s not available in all of the target runtimes.
- runs all of the benchmarks in the whole program. If you don’t specify the
--filter
argument, BenchmarkDotNet will prompt you to ask which benchmarks to run. By specifying “*”, we’re saying “don’t prompt, just run ’em all.” You can also specify an expression to filter down which subset of the tests you want invoked. - runs the tests on both .NET 8 and .NET 9.
Throughout the post, I’ve shown many benchmarks and the results I received from running them. Unless otherwise stated (e.g. because I’m demonstrating an OS-specific improvement), the results shown for benchmarks are from running them on Linux (Ubuntu 22.04) on an x64 processor.
Code:
BenchmarkDotNet v0.14.0, Ubuntu 22.04.3 LTS (Jammy Jellyfish) WSL
11th Gen Intel Core i9-11950H 2.60GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100-rc.1.24452.12
[Host] : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
My standard caveat: these are micro-benchmarks, often measuring operations that take very short periods of time, but where improvements to those times add up to be impactful when executed over and over and over. Different hardware, different operating systems, what other processes might be running on your machine, who you had breakfast with this morning, and the alignment of the planets can all impact the numbers you get out. In short, the numbers you see are unlikely to match exactly the numbers I share here; however, I’ve chosen benchmarks that should be broadly repeatable.
With all that out of the way, let’s do this!
JIT
Improvements in .NET show up at all levels of the stack. Some changes result in large improvements in one specific area. Other changes result in small improvements across many things. When it comes to broad-reaching impact, there are few areas of .NET that result in changes more broadly-impactful than those changes made to the Just In Time (JIT) compiler. Code generation improvements help make everything better, and it’s where we’ll start our journey.
PGO
In Performance Improvements in .NET 8, I called out the enabling of dynamic profile guided optimization (PGO) as my favorite feature in the release, so PGO seems like a good place to start for .NET 9.
As a brief refresher, dynamic PGO is a feature that enables the JIT to profile code and use what it learns from that profiling to help it generate more efficient code based on the exact usage patterns of the application. The JIT utilizes tiered compilation, which allows code to be compiled and then re-compiled, possibly multiple times, achieving something new each time the code is compiled. For example, a typical method might start out at “tier 0,” where the JIT applies very few optimizations and has a goal of simply getting to functional assembly as quickly as possible. This helps with startup performance, as optimizations are one of the most costly things a compiler does. Then the runtime tracks the number of times the method is invoked, and if the number of invocations trips over a particular threshold, such that it seems like performance could actually matter, the JIT will re-generate code for it, still at tier 0, but this time with a bunch of additional instrumentation injected into the method, tracking all manner of things that could help the JIT better optimize, e.g. for a given virtual dispatch, what is the most common type on which the call is being performed. Then after enough data has been gathered, the JIT can compile the method yet again, this time at “tier 1,” fully optimized, also incorporating all of the learnings from that profile data. This same flow is relevant as well for code that’s already been pre-compiled with ReadyToRun (R2R), except instead of instrumenting tier 0 code, the JIT will generate optimized, instrumented code on its way to generating a re-optimized implementation.
In .NET 8, the JIT in particular paid attention to PGO data about types and methods involved in virtual, interface, and delegate dispatch. In .NET 9, it’s also able to use PGO data to optimize casts. Thanks to dotnet/runtime#90594, dotnet/runtime#90735, dotnet/runtime#96597, dotnet/runtime#96731, and dotnet/runtime#97773, dynamic PGO is now able to track the most common input types to cast operations (
castclass
/isinst
, e.g. what you get from doing operations like (T)obj
or obj is T
), and then when generating the optimized code, emit special checks that add fast paths for the most common types. For example, in the following benchmark, we have a field of type A
initialized to a type C
that’s derived from both B
and A
. Then the benchmark is type checking the instance stored in that A
field to see whether it’s a B
or anything derived from B
.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser(maxDepth: 0)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private A _obj = new C();
[Benchmark]
public bool IsInstanceOf() => _obj is B;
public class A { }
public class B : A { }
public class C : B { }
}
That
IsInstanceOf
benchmark results in the following disassembly on .NET 8:
Code:
; Tests.IsInstanceOf()
push rax
mov rsi,[rdi+8]
mov rdi,offset MT_Tests+B
call qword ptr [7F3D91524360]; System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object)
test rax,rax
setne al
movzx eax,al
add rsp,8
ret
; Total bytes of code 35
but now on .NET 9, it produces this:
Code:
; Tests.IsInstanceOf()
push rbp
mov rbp,rsp
mov rsi,[rdi+8]
mov rcx,rsi
test rcx,rcx
je short M00_L00
mov rax,offset MT_Tests+C
cmp [rcx],rax
jne short M00_L01
M00_L00:
test rcx,rcx
setne al
movzx eax,al
pop rbp
ret
M00_L01:
mov rdi,offset MT_Tests+B
call System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object)
mov rcx,rax
jmp short M00_L00
; Total bytes of code 62
On .NET 8, it’s loading the reference to the object and the desired method token for
B
, and calling the CastHelpers.IsInstanceOfClass
JIT helper to do the type check. On .NET 9, instead it’s loading the method token for C
, which it saw during profiling to be the most common type used, and then comparing that against the actual object’s method token. If they match, since the JIT knows that C
derives from B
, it then knows the object is in fact a B
. If they don’t match, then it jumps down to the fallback path where it does the same thing that was being done on .NET 8, loading the reference and the desired method token for B
and calling IsInstanceOfClass
.It’s also capable of optimizing for the negative case where the cast most often fails. Consider this benchmark:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser(maxDepth: 0)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private object _obj = "hello";
[Benchmark]
public bool IsInstanceOf() => _obj is Tests;
}
On .NET 9, we get this assembly:
Code:
; Tests.IsInstanceOf()
push rbp
mov rbp,rsp
mov rsi,[rdi+8]
mov rcx,rsi
test rcx,rcx
je short M00_L00
mov rax,offset MT_System.String
cmp [rcx],rax
jne short M00_L01
xor ecx,ecx
M00_L00:
test rcx,rcx
setne al
movzx eax,al
pop rbp
ret
M00_L01:
mov rdi,offset MT_Tests
call System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object)
mov rcx,rax
jmp short M00_L00
; Total bytes of code 64
Here the incoming object is always a
string
and never the Tests
class that’s being tested for. The generated code is comparing the incoming object against string
, and then, assuming the types match, the JIT knows the object is not a Tests
.dotnet/runtime#96311 also breaks new ground with dynamic PGO, by teaching it how to profile integers and paying attention to their most common values. Then in conjunction with dotnet/runtime#96571, it uses this super power to optimize
Buffer.Memmove
(which is the workhorse behind methods like Span<T>.CopyTo
) and SpanHelpers.SequenceEqual
(which is the implementation behind methods like string.Equals
). Previously, the JIT was taught how to unroll such operations, where if a constant length was provided, the JIT could generate the exact code sequence to implement the operation for that length. Now with this capability, the JIT can track the most common lengths provided to these methods, and if there’s one length that really stands out, it can special-case it, unrolling and vectorizing the operation when the length matches and falling back to calling the original when it doesn’t. While this is expected to improve in the future, for .NET 9 this set of length-profiling optimizations only kicks in when R2R is disabled, as the JIT is otherwise unable to do the exact profiling required. Disabling R2R is something services can do when startup performance isn’t a big concern and they instead care about maximum throughput at run-time.
Code:
// dotnet run -c Release -f net8.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
var config = DefaultConfig.Instance
.AddJob(Job.Default.WithId(".NET 8").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_ReadyToRun", "0"))
.AddJob(Job.Default.WithId(".NET 9").WithRuntime(CoreRuntime.Core90).WithEnvironmentVariable("DOTNET_ReadyToRun", "0"));
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);
[DisassemblyDiagnoser(maxDepth: 0)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "a", "b")]
public class Tests
{
[Benchmark]
[Arguments("abcd", "abcg")]
public bool Equals(string a, string b) => a == b;
}
Method | Runtime |
---|---|
Equals | .NET 8.0 |
Equals | .NET 9.0 |
[th]
Mean
[/th][th]
Code Size
[/th][td]
2.8592 ns
[/td][td]
78 B
[/td][td]
0.6754 ns
[/td][td]
87 B
[/td]Tier 0
Tier 0 is all about getting to functioning code quickly, and as such most optimizations are disabled. However, every now and then there’s a reason to do a bit more optimization in tier 0, in situations where the benefits of doing so outweigh the cons. Several of those occurred in .NET 9.
dotnet/runtime#104815 is a simple example. The
ArgumentNullException.ThrowIfNull
method is now used in thousands upon thousands of places for doing argument validation. It’s a non-generic method, accepting an object
argument and checking to see whether it’s null
. That non-genericity causes some friction for folks when it’s used with value types. It’s rare for someone to directly call ThrowIfNull
with a value type (other than maybe with a Nullable<T>
), and in fact if they do, thanks to dotnet/roslyn-analyzers from @CollinAlpert, there’s now the CA2264 analyzer that will warn that what’s being done is nonsensical: Instead, the common case is when the argument being validated is an unconstrained generic. In such cases, if the generic argument ends up being a value type, it’ll be boxed in the call to ThrowIfNull
. That boxing allocation gets removed in tier 1, because the ThrowIfNull
call gets inlined and the JIT can see at the call site that the boxing was unnecessary. But, because inlining doesn’t happen in tier 0, such boxing has remained in tier 0. As the API is so ubiquitous, this caused developers to fret that there was something bad happening, and it caused enough consternation that the JIT now special-cases ArgumentNullException.ThrowIfNull
and avoids the boxing, even in tier 0. This is easy to see with a little test console app:
Code:
// dotnet run -c Release -f net8.0 --filter "*"
// dotnet run -c Release -f net9.0 --filter "*"
using System.Runtime.CompilerServices;
while (true)
{
Test();
}
[MethodImpl(MethodImplOptions.NoInlining)]
static void Test()
{
long gc = GC.GetAllocatedBytesForCurrentThread();
for (int i = 0; i < 100; i++)
{
ThrowIfNull(i);
}
gc = GC.GetAllocatedBytesForCurrentThread() - gc;
Console.WriteLine(gc);
Thread.Sleep(1000);
}
static void ThrowIfNull<T>(T value) => ArgumentNullException.ThrowIfNull(value);
When I run that on .NET 8, I get results like this:
Code:
2400
2400
2400
0
0
0
The first few iterations are invoking
Test()
at tier 0, such that each call to ArgumentNullException.ThrowIfNull
boxes the input int
. Then when the method gets recompiled at tier 1, the boxing gets elided, and we stabilize at zero allocation. Now on .NET 9, I get results like this:
Code:
0
0
0
0
0
0
With these tweaks to tier 0, the boxing is also elided in tier 0, and so starts out without any allocation.
Another tier 0 boxing example is dotnet/runtime#90496. There’s a hot path method in the
async
/await
machinery: AsyncTaskMethodBuilder<TResult>.AwaitUnsafeOnCompleted
(see How Async/Await Really Works in C# for all the details). It’s really important that this method be optimized well, but it performs various type tests that can end up boxing in tier 0. In a previous release, that boxing was deemed too impactful to startup for async
methods invoked early in an application’s lifetime, so [MethodImpl(MethodImplOptions.AggressiveOptimization)]
was used to opt the method out of tiering, such that it gets optimized from the get-go. But that itself has downsides, because if it skips tiering up, it also skips dynamic PGO, and thus the optimized code isn’t as good as it possibly could be. So, this PR specifically addresses those type tests patterns that box, removing the boxing in tier 0, enabling removing that AggressiveOptimization
from AwaitUnsafeOnCompleted
, and thereby enabling better optimized code generation for it.Optimizations are avoided in tier 0 because they might slow down compilation. If there are really cheap optimizations, though, and they can have a meaningful impact, they can be worth enabling. That’s especially true if the optimizations can actually help to make compilations and startup faster, such as by minimizing calls to helpers that may take locks, trigger certain kinds of loading, etc. And that’s what dotnet/runtime#105190 does, enabling some constant folding in tier 0 at relatively little cost. Even with the low cost, though, there were still concerns about possible impact to JIT throughput, and the PR was fast-followed by dotnet/runtime#105250 which optimized some JIT code paths to make up for any impact from the former change.
Another similar case is dotnet/runtime#91403 from @MichalPetryka, which allows optimizations around
RuntimeHelpers.CreateSpan
to kick in for tier 0. Without that, the runtime can end up allocating many field stubs, which themselves add overhead to the startup path.Loops
Applications spend a lot of time iterating through loops, and finding ways to reduce the overheads of loops has been a key focus for .NET 9. It’s also been quite successful.
dotnet/runtime#102261 and dotnet/runtime#103181 help to remove some instructions from even the tightest of loops by converting upward counting loops into downward counting loops. Consider a loop like the following:
Code:
// dotnet run -c Release -f net8.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public int UpwardCounting()
{
int count = 0;
for (int i = 0; i < 100; i++)
{
count++;
}
return count;
}
}
Here’s what the generated assembly code for that core loop looks like on .NET 8:
Code:
M00_L00:
inc eax
inc ecx
cmp ecx,64
jl short M00_L00
It’s incrementing
eax
, which is storing count
. And it’s incrementing ecx
, which is storing i
. It’s then comparing ecx
against 100 (0x64) to see if it’s reached the end of the loop, and jumping back up to the beginning of the loop if it hasn’t.Now let’s manually rewrite the loop to be downward counting:
Code:
// dotnet run -c Release -f net8.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public int DownwardCounting()
{
int count = 0;
for (int i = 99; i >= 0; i--)
{
count++;
}
return count;
}
}
And here’s what the generated assembly code for that core loop looks like there:
Code:
M00_L00:
inc eax
dec ecx
jns short M00_L00
The key observation here is that by counting down, we can replace a
cmp
/jl
for comparing against a specific bound to instead just be a jns
that jumps if the value isn’t negative. We’ve thus removed an instruction from a tight loop that only had four to begin with. With the aforementioned PRs, the JIT can now do that transformation automatically where it’s applicable and deemed valuable, such that the loop in UpwardCounting
now results in the same assembly code on .NET 9 as does the loop in DownwardCounting
.Method | Runtime |
---|---|
UpwardCounting | .NET 8.0 |
UpwardCounting | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
30.27 ns
[/td][td]
1.00
[/td][td]
26.52 ns
[/td][td]
0.88
[/td]However, the JIT is only able to do this transformation if the iteration variable (
i
) isn’t used in the body of the loop, and obviously there are many loops where it is, such as by indexing into an array being iterated over. Thankfully, other optimizations in .NET 9 are able to reduce the actual reliance on the iteration variable, such that this optimization now kicks in frequently.One such optimization is strength reduction in loops. In compilers, “strength reduction” is the simple idea of taking something relatively expensive and replacing it with something cheaper. In the context of loops, that typically means introducing more “induction variables” (variables whose values change in a predictable pattern on each iteration, such as being incremented by a constant amount). For example, consider a simple loop that sums all of the elements of an array:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private int[] _array = Enumerable.Range(0, 1000).ToArray();
[Benchmark]
public int Sum()
{
int[] array = _array;
int sum = 0;
for (int i = 0; i < array.Length; i++)
{
sum += array[i];
}
return sum;
}
}
We get the following assembly on .NET 8:
Code:
; Tests.Sum()
push rbp
mov rbp,rsp
mov rax,[rdi+8]
xor ecx,ecx
xor edx,edx
mov edi,[rax+8]
test edi,edi
jle short M00_L01
M00_L00:
mov esi,edx
add ecx,[rax+rsi*4+10]
inc edx
cmp edi,edx
jg short M00_L00
M00_L01:
mov eax,ecx
pop rbp
ret
; Total bytes of code 35
The interesting part is the loop starting at
M00_L00
. i
is being stored in edx
(though it gets copied into esi
), and as part of adding the next element from the array to sum
(which is stored in ecx
), we’re loading that next value from the array with the address rax+rsi*4+10
. A strength reduction view of this would say “rather than re-computing the address on each iteration, we can instead have another induction variable and increment it by 4 on each iteration.” A key benefit of that is it then removes a dependency on i
from inside of the loop, which then means the iteration variable is no longer used in the loop, enabling the aforementioned downward counting optimization to kick in. That leads to the following assembly on .NET 9:
Code:
; Tests.Sum()
push rbp
mov rbp,rsp
mov rax,[rdi+8]
xor ecx,ecx
mov edx,[rax+8]
test edx,edx
jle short M00_L01
add rax,10
M00_L00:
add ecx,[rax]
add rax,4
dec edx
jne short M00_L00
M00_L01:
mov eax,ecx
pop rbp
ret
; Total bytes of code 35
Note the loop at
M00_L00
: it’s now downward counting, reading the next value from the array is simply dereferencing the address in rax
, and the address in rax
is incremented by 4 each go around.A lot of work went into enabling this strength reduction, including providing the basic implementation (dotnet/runtime#104243), enabling it by default (dotnet/runtime#105131), finding more opportunities to apply it (dotnet/runtime#105169), and using it to enable post-indexed addressing (dotnet/runtime#105181 and dotnet/runtime#105185), which is an Arm addressing mode where the address stored in the base register is used but then that register is updated to point to the next target memory location. A new phase was also added to the JIT to help with optimizing such induction variables (dotnet/runtime#97865), and in particular, to do induction variable widening where 32-bit induction variables (think of every loop you’ve ever written that starts with
for (int i = ...)
) are widened to 64-bit induction variables. This widening can help to avoid zero extensions that might otherwise occur on every iteration of the loop.These optimizations are all new, but of course there are also many loop optimizations already present in the JIT compiler, from loop unrolling to loop cloning to loop hoisting. In order to apply such loop optimizations, though, the JIT first needs to recognize loops, and that can sometimes be more challenging than it would seem (dotnet/runtime#43713 describes a case where the JIT was failing to do so). Historically, the JIT’s loop recognition was based on a relatively simplistic lexical analysis. In .NET 8, as part of the work to improve dynamic PGO, a more powerful graph-based loop analyzer was added that was able to recognize many more loops. For .NET 9 with dotnet/runtime#95251, that analyzer was factored out so that it could be used for generalized loop reasoning. And then with PRs like dotnet/runtime#96756 for loop alignment, dotnet/runtime#96754 and dotnet/runtime#96553 for loop cloning, dotnet/runtime#96752 for loop unrolling, dotnet/runtime#96751 for loop canonicalization, and dotnet/runtime#96753 for loop hoisting, many of these loop-related optimizations have now been moved to the better scheme. All of that means that more loops get optimized.
Bounds Checks
.NET code is, by default, “memory safe.” Unlike in C, where you can iterate through an array and easily walk off the end of it, by default accesses to arrays, strings, and spans are “bounds checked” to ensure you can’t walk off the end or before the beginning. Of course, such bounds checking adds overhead, and so wherever the JIT can prove that adding such checks would be unnecessary, it’ll elide the bounds check, knowing that it’s impossible for the guarded accesses to be problematic. The quintessential example of this is a loop over an array from
0
to array.Length
. Let’s look at the same benchmark we just looked at, summing all the elements of an integer array:
Code:
// dotnet run -c Release -f net8.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser(maxDepth: 0)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private int[] _array = new int[1000];
[Benchmark]
public int Test()
{
int[] array = _array;
int sum = 0;
for (int i = 0; i < array.Length; i++)
{
sum += array[i];
}
return sum;
}
}
That
Test
benchmark results in this assembly code on .NET 8:
Code:
; Tests.Test()
push rbp
mov rbp,rsp
mov rax,[rdi+8]
xor ecx,ecx
xor edx,edx
mov edi,[rax+8]
test edi,edi
jle short M00_L01
M00_L00:
mov esi,edx
add ecx,[rax+rsi*4+10]
inc edx
cmp edi,edx
jg short M00_L00
M00_L01:
mov eax,ecx
pop rbp
ret
; Total bytes of code 35
The key part to pay attention to is the loop at
M00_L00
, for which the only branch is the one comparing edx
(which is tracking i
) to edi
(which was earlier on initialized to the length of the array, [rax+8]
) as part of knowing when it’s done iterating. There’s no additional check required to make this safe, as the JIT knows the loop started at 0
(and thus isn’t walking off the beginning of the array) and the JIT knows iteration ends at the array length, which the JIT is already checking for, so it’s safe to index into the array without additional checks.Now, let’s tweak the benchmark ever so slightly. In the above, I was copying the
_array
field to a local array
and then doing all accesses against that array
; this is critical, because there’s nothing else that could be changing that local out from under the loop. But if we instead change the code to refer to the field directly:
Code:
// dotnet run -c Release -f net8.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser(maxDepth: 0)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private int[] _array = new int[1000];
[Benchmark]
public int Test()
{
int sum = 0;
for (int i = 0; i < _array.Length; i++)
{
sum += _array[i];
}
return sum;
}
}
now we get this on .NET 8:
Code:
; Tests.Test()
push rbp
mov rbp,rsp
xor eax,eax
xor ecx,ecx
mov rdx,[rdi+8]
cmp dword ptr [rdx+8],0
jle short M00_L01
nop dword ptr [rax]
nop dword ptr [rax]
M00_L00:
mov rdi,rdx
cmp ecx,[rdi+8]
jae short M00_L02
mov esi,ecx
add eax,[rdi+rsi*4+10]
inc ecx
cmp [rdx+8],ecx
jg short M00_L00
M00_L01:
pop rbp
ret
M00_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 61
That’s a whole lot worse. Note how much that loop starting at
M00_L00
has grown, and in particular note that instead of just having the one cmp
/jg
pair at the end, there’s another cmp
/jae
pair in the middle, just before it accesses the array element. Since the code is reading from the field on every access, the JIT needs to accommodate the fact that the reference could change between any two accesses; thus, even though the JIT is comparing against _array.Length
as part of the loop bounds, it also needs to ensure that the subsequent reference to _array[i]
is still in bounds, since by then _array
may be an entirely different object. That’s a “bounds check,” which is obvious from the tell-tale sign that immediately after the cmp
, there’s a conditional jump to code that unconditionally calls CORINFO_HELP_RNGCHKFAIL
; that’s the helper function that’s called to throw an IndexOutOfRangeException
when you try to walk off the end of one of these data structures.Every release the JIT gets better at removing more and more bounds checks where it can prove they’re superfluous. One of my favorite such improvements in .NET 9 is there on my favorites list because I’ve historically expected the optimization to “just work”, for various reasons it didn’t, and now it does (it also shows up in a fair amount of real code, which is why I’ve bumped up against it). In this benchmark, the function is handed an offset and a span, and its job is to sum all of the numbers from that offset to the end of the span.
Code:
// dotnet run -c Release -f net8.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
[Arguments(3)]
public int Test() => M(0, "1234567890abcdefghijklmnopqrstuvwxyz");
[MethodImpl(MethodImplOptions.NoInlining)]
public static int M(int i, ReadOnlySpan<char> src)
{
int sum = 0;
while (true)
{
if ((uint)i >= src.Length)
{
break;
}
sum += src[i++];
}
return sum;
}
}
By casting
i
to uint
as part of the comparison to src.Length
, the JIT knows that i
is in bounds of src
by the time i
is used to index into src
, because if i
were negative, the cast to uint
would have made it larger than int.MaxValue
and thus also larger than src.Length
(which can’t possibly be larger than int.MaxValue
). The .NET 8 assembly shows the bounds check has been elided (note the lack of CORINFO_HELP_RNGCHKFAIL
):
Code:
; Tests.M(Int32, System.ReadOnlySpan`1<Char>)
push rbp
mov rbp,rsp
xor eax,eax
M01_L00:
cmp edi,edx
jae short M01_L01
lea ecx,[rdi+1]
mov edi,edi
movzx edi,word ptr [rsi+rdi*2]
add eax,edi
mov edi,ecx
jmp short M01_L00
M01_L01:
pop rbp
ret
; Total bytes of code 27
But, this is a fairly awkward way to write such a condition. A more natural way would be to have that check as part of the loop condition:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
[Arguments(3)]
public int Test() => M(0, "1234567890abcdefghijklmnopqrstuvwxyz");
[MethodImpl(MethodImplOptions.NoInlining)]
public static int M(int i, ReadOnlySpan<char> src)
{
int sum = 0;
for (; (uint)i < src.Length; i++)
{
sum += src[i];
}
return sum;
}
}
Unfortunately, as a result of my code cleanup here to make the code more canonical, the JIT in .NET 8 fails to see that the bounds check can be elided… note the
CORINFO_HELP_RNGCHKFAIL
at the end:
Code:
; Tests.M(Int32, System.ReadOnlySpan`1<Char>)
push rbp
mov rbp,rsp
xor eax,eax
cmp edi,edx
jae short M01_L01
M01_L00:
cmp edi,edx
jae short M01_L02
mov ecx,edi
movzx ecx,word ptr [rsi+rcx*2]
add eax,ecx
inc edi
cmp edi,edx
jb short M01_L00
M01_L01:
pop rbp
ret
M01_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 36
But in .NET 9, thanks to dotnet/runtime#100777, the JIT is better able to track the knowledge about guarantees made by the loop condition and is able to elide the bounds check on this variation as well.
Code:
; Tests.M(Int32, System.ReadOnlySpan`1<Char>)
push rbp
mov rbp,rsp
xor eax,eax
cmp edi,edx
jae short M01_L01
mov ecx,edi
M01_L00:
movzx edi,word ptr [rsi+rcx*2]
add eax,edi
inc ecx
cmp ecx,edx
jb short M01_L00
M01_L01:
pop rbp
ret
; Total bytes of code 26
Yay!
Now consider this benchmark:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser(maxDepth: 0)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
[Arguments(3)]
public int Test(int i)
{
ReadOnlySpan<byte> rva = [1, 2, 3, 5, 8, 13, 21, 34];
return rva[7 - (i & 7)];
}
}
The test method here has a span of data initialized in a way where the JIT is able to see how long it is. It’s then indexing into the span, using the supplied index to read not from the start but from the end (the
(i & 7)
is there to ensure the JIT can see that the value will always be in range); if it were reading from the start, this was already optimized, but from the end, the JIT hadn’t previously been taught how to reason about the bounds checks. On .NET 8, it can’t prove the access is always in-bounds, and we can see the bounds check in place:
Code:
; Tests.Test(Int32)
push rax
and esi,7
mov eax,esi
neg eax
add eax,7
cmp eax,8
jae short M00_L00
mov rcx,7FC98A741EC8
movzx eax,byte ptr [rax+rcx]
add rsp,8
ret
M00_L00:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 41
But, now on .NET 9, thanks to dotnet/runtime#96123, the bounds check gets elided.
Code:
; Tests.Test(Int32)
and esi,7
mov eax,esi
neg eax
add eax,7
mov rcx,7F39B8724EC8
movzx eax,byte ptr [rax+rcx]
ret
; Total bytes of code 25
Here’s another case. We’re special-casing spans of lengths less than or equal to 1, returning
string.Empty
if the span is of length 0 or returning the first string if the span is of length 1:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
[Arguments(3)]
public string? Test() => M(["123"]);
[MethodImpl(MethodImplOptions.NoInlining)]
private static string? M(ReadOnlySpan<string> values)
{
if (values.Length <= 1)
{
return values.Length == 0 ?
string.Empty :
values[0];
}
return null;
}
}
You and I can see that the access to
values[0]
will always succeed, but on .NET 8 we get this:
Code:
; Tests.M(System.ReadOnlySpan`1<System.String>)
push rbp
mov rbp,rsp
cmp esi,1
jg short M01_L01
test esi,esi
je short M01_L00
test esi,esi
je short M01_L02
mov rax,[rdi]
pop rbp
ret
M01_L00:
mov rax,7FB62147C008
pop rbp
ret
M01_L01:
xor eax,eax
pop rbp
ret
M01_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 44
The JIT keeps track of what it knows about the lengths of various things, what conditions it’s proved, but here it’s lost track of the fact that, for the else branch of the ternary,
values
is guaranteed to be of length 1
, and thus indexing at index 0
is safe. dotnet/runtime#101323 improves the JIT’s range tracking ability, such that on .NET 9, the bounds check is successfully elided:
Code:
; Tests.M(System.ReadOnlySpan`1<System.String>)
push rbp
mov rbp,rsp
cmp esi,1
jg short M01_L01
test esi,esi
je short M01_L00
mov rax,[rdi]
pop rbp
ret
M01_L00:
mov rax,7F5700FB1008
pop rbp
ret
M01_L01:
xor eax,eax
pop rbp
ret
; Total bytes of code 34
Most if not all of these bounds check elimination improvements come about because someone is optimizing something and sees a bounds check that could have been eliminated but wasn’t. In the case that inspired the improvement in dotnet/runtime#101352, that someone was me, while working on improving
Enum
for .NET 8. Enums can be backed by various numerical types, including ulong
, and there’s a code path in Enum.GetName
that’s effectively this:
Code:
if (ulongValue < (ulong)names.Length)
{
return names[(uint)ulongValue];
}
That bounds check wasn’t previously being removed, but now in .NET 9, it is:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private readonly string[] _names = Enum.GetNames<MyEnum>();
[Benchmark]
[Arguments(2)]
public string? GetNameOrNull(ulong ulValue)
{
string[] names = _names;
return ulValue < (ulong)names.Length ?
names[(uint)ulValue] :
null;
}
public enum MyEnum : ulong { A, B, C, D }
}
Code:
// .NET 8
; Tests.GetNameOrNull(UInt64)
push rbp
mov rbp,rsp
mov rax,[rdi+8]
mov ecx,[rax+8]
mov edx,ecx
cmp rdx,rsi
jbe short M00_L00
cmp esi,ecx
jae short M00_L01
mov ecx,esi
mov rax,[rax+rcx*8+10]
pop rbp
ret
M00_L00:
xor eax,eax
pop rbp
ret
M00_L01:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 41
// .NET 9
; Tests.GetNameOrNull(UInt64)
mov rax,[rdi+8]
mov ecx,[rax+8]
cmp rcx,rsi
jbe short M00_L00
mov ecx,esi
mov rax,[rax+rcx*8+10]
ret
M00_L00:
xor eax,eax
ret
; Total bytes of code 23
Sometimes eliding bounds checks is about learning new tricks; other times, it’s about fixing old ones. Consider this benchmark:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static ReadOnlySpan<int> Lookup => [1, 2, 3, 5, 8, 13, 21];
[Benchmark]
[Arguments(3)]
public int Test1(int i) => (uint)i < 7 ? Lookup[i] : -1;
[Benchmark]
[Arguments(3)]
public int Test2(int i) => (uint)i <= 6 ? Lookup[i] : -1;
}
Test1
and Test2
are effectively the same thing, both guarding a lookup table by a known length and only accessing the table if we know the index to be in bounds. The bounds check will then be elided by the JIT in both cases, right? Wrong. On .NET 8, we get this:
Code:
; Tests.Test1(Int32)
cmp esi,7
jae short M00_L00
mov eax,esi
mov rcx,7F6D40064030
mov eax,[rcx+rax*4]
ret
M00_L00:
mov eax,0FFFFFFFF
ret
; Total bytes of code 27
; Tests.Test2(Int32)
push rbp
mov rbp,rsp
cmp esi,6
ja short M00_L00
cmp esi,7
jae short M00_L01
mov eax,esi
mov rcx,7F8D11621030
mov eax,[rcx+rax*4]
pop rbp
ret
M00_L00:
mov eax,0FFFFFFFF
pop rbp
ret
M00_L01:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 44
Note the bounds check in
Test2
. dotnet/runtime#97908 fixes this, such that on .NET 9 Test2
now successfully elides the bounds check as well:
Code:
; Tests.Test1(Int32)
cmp esi,7
jae short M00_L00
mov eax,esi
mov rcx,7F5B9DC5E030
mov eax,[rcx+rax*4]
ret
M00_L00:
mov eax,0FFFFFFFF
ret
; Total bytes of code 27
; Tests.Test2(Int32)
cmp esi,6
ja short M00_L00
mov eax,esi
mov rcx,7F7FDE2C9030
mov eax,[rcx+rax*4]
ret
M00_L00:
mov eax,0FFFFFFFF
ret
; Total bytes of code 27
Interestingly, sometimes even if we can’t elide a bounds check, we can learn things from the fact that one occurred, and then use that knowledge to optimize subsequent things. Consider this benchmark:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private readonly int[] _x = new int[10];
[Benchmark]
[Arguments(2)]
public int Add(int y) => _x[y] + (y % 8);
}
There’s nothing the JIT can do here to elide the bounds check on
_x[y]
; it has no information about the value of y
or the length of _x
. As such, as shown in the .NET 8 assembly here, we see a bounds check:
Code:
; Tests.Add(Int32)
push rax
mov rax,[rdi+8]
cmp esi,[rax+8]
jae short M00_L00
mov ecx,esi
mov edx,esi
sar edx,1F
and edx,7
add edx,esi
and edx,0FFFFFFF8
mov edi,esi
sub edi,edx
add edi,[rax+rcx*4+10]
mov eax,edi
add rsp,8
ret
M00_L00:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 46
However, all is not lost. After indexing into the array, we proceed to use
y
as the numerator of a %
operation. C#’s %
operator supports both int
and uint
numerators, but it has to do a little more work for int
in case the value is negative. However, by the time we get to that %
operation, we know that y
is not negative, as if it were negative, the _x[y]
would have thrown and we’d never end up here. dotnet/runtime#102089 teaches the JIT how to learn such non-negative information from such bounds checks, such that in .NET 9, we get code generation equivalent to if we’d explicitly cast y
to uint
.
Code:
; Tests.Add(Int32)
push rax
mov rax,[rdi+8]
cmp esi,[rax+8]
jae short M00_L00
mov ecx,esi
and esi,7
add esi,[rax+rcx*4+10]
mov eax,esi
add rsp,8
ret
M00_L00:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 32
Arm64
Making .NET on Arm an awesome and fast experience has been a critical, multi-year investment. You can read about it in Arm64 Performance Improvements in .NET 5, Arm64 Performance Improvements in .NET 7, and Arm64 Performance Improvements in .NET 8. And things continue to improve even further in .NET 9. Here are some examples:
- Better barriers. dotnet/runtime#91553 implements volatile writes via using the
stlur
(Store-Release Register) instruction rather than admb
(Data Memory Barrier) /str
(Store) pair of instructions (stlur
is generally cheaper). Similarly, dotnet/runtime#101359 eliminates full memory barriers when dealing with volatile reads and writes onfloat
s. For example, code that would previously have produced aldr
(Load Register) /dmb
pair may now produce aldar
(Load-Acquire Register) /fmov
(Floating-point Move) pair. - Better switches. Depending on the shape of a
switch
statement, the C# compiler may generate a variety of IL patterns, one of which is to use aswitch
IL instruction. Normally for aswitch
IL instruction, the JIT will generate a jump table, but for some forms, it has an optimization to instead rely on a bit test. Thus far this optimization only existed for x86/64, with thebt
(Bit Test) instruction. Now with dotnet/runtime#91811, it also exists for Arm, with thetbz
(Test bit and Branch if Zero) instruction. - Better conditionals. Arm has conditional instructions that logically contain a branch albeit without any branching, e.g. Performance Improvements in .NET 8 talked about the
csel
(Conditional Select) instruction that “conditionally selects” a value from one of two registers based on some condition. Another such instruction iscsinc
(Conditional Select Increment), which conditionally selects either the value from one register or the value from another register incremented by one. dotnet/runtime#91262 from @c272 enables the JIT to utilizecsinc
, so that a statement likex = condition ? x + 1 : y;
will be able to compile down to acsinc
rather than to a branching construct. dotnet/runtime#92810 also improves the custom comparison operation the JIT emits for someSequenceEqual
operations (e.g."hello, there"u8.SequenceEqual(spanOfBytes)
) to be able to useccmp
(Conditional Compare). - Better multiplies. Arm has single instructions that represent doing a multiply followed by an addition, subtraction, or negation. dotnet/runtime#91886 from @c272 finds such sequences of multiplies followed by one of those operations and consolidates them to use the single combined instruction.
- Better loads. Arm has instructions for loading a value from memory into a single register, but it also has instructions for loading multiple values into multiple registers. When the JIT emits a customized memory copy (such as for
byteArray.AsSpan(0, 32).SequenceEqual(otherByteArray)
), it may emit multipleldr
instructions for loading a value into a register. dotnet/runtime#92704 enables consolidating pairs of those intoldp
(Load Pair of Registers) instructions, which load two values into two registers.
ARM SVE
Bringing up a new instruction set is a huge deal and a huge undertaking. I’ve mentioned in the past my process for gearing up to write one of these “Performance Improvements in .NET X” posts, including that throughout the year I keep a running list of the PRs I might want to talk about when it comes time to actually put pen to paper. Just for “SVE”, I found myself with over 200 links. I’m not going to bore you with such a laundry list; if you’re interested, you can search for SVE PRs, which includes PRs from @a74nh, from @ebepho, from @mikabl-arm, from @snickolls-arm, and from @SwapnilGaikwad. But, we can still talk a bit about what it is and what it means for .NET.
Single instruction, multiple data (SIMD) is a kind of parallel processing where one instruction performs the same operation on multiple pieces of data at the same time, rather than one instruction manipulating just a single piece of data. For example, the
add
instruction on x86/64 can add together one pair of 32-bit integers, whereas the paddd
(Add Packed Doubleword Integers) instruction that’s part of Intel’s SSE2 (Streaming SIMD Extensions 2) instruction set operates on a pair of xmm
registers that can each store four 32-bit integer values at once. Many such instructions have been added to many different hardware platforms over the years, coming in groups referred to as instruction set architectures (ISA), where an ISA defines what the instructions are, what registers they interact with, how memory is accessed, and so on. Even if you’re not steeped in this stuff, you’ve likely heard names of these ISAs mentioned, like Intel’s SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions), or Arm’s Advanced SIMD (also known as Neon). In general, the instructions in all of these ISAs operate on a fixed number of values of a fixed size, e.g. the paddd
previously mentioned only works with 128-bits at a time, no more, no less. Different instructions exist for 256 bits at a time or 512 bits at a time.SVE, or “Scalable Vector Extensions,” is an ISA from Arm that’s a bit different. The instructions in SVE don’t operate on a fixed size. Rather, the specification allows for them to operate on sizes from 128 bits up to 2048 bits, and the specific hardware can choose which size to use (allowed sizes are multiples of 128, and with SVE 2 further constrained to be powers of 2). The same assembly code using these instructions might operate on 128 bits at a time on one piece of hardware and 256 bits at a time on another piece of hardware.
There are multiple ways such an ISA impacts .NET, and in particular the JIT. The JIT needs to be able to be able to work with the ISA, understand the associated registers and be able to do register allocation, be taught about encoding and emitting the instructions, and so on. The JIT needs to be taught when and where it’s appropriate to use these instructions, so that as part of compiling IL down to assembly, if operating on a machine that supports SVE, the JIT might be able to pick SVE instructions for use in the generated assembly. And the JIT needs to be taught how to represent this data, these vectors, to user code. All of that is a huge amount of work, especially when you consider that there are thousands of operations represented. What makes it even more work is hardware intrinsics.
Hardware intrinsics are a feature of .NET where, effectively, each of these instructions shows up as its own dedicated .NET method, such as
Sse2.Add
, and the JIT emits use of that method as the underlying instruction to which it maps. If you look at Sve.cs in dotnet/runtime, you’ll see the System.Runtime.Intrinsics.Arm.Sve
type, which already exposes more than 1400 public methods (that number is not a typo).Two interesting things to notice if you open that file (beyond its sheer length):
- The use of
Vector<T>
. .NET’s foray into SIMD started in 2014 and was accompanied by theVector<T>
type.Vector<T>
represents a single vector (list) of theT
numeric type. To provide a platform-agnostic representation, since different platforms were capable of different vector widths,Vector<T>
was defined to be variable in size, so for example on x86/x64 hardware that supported AVX2,Vector<T>
might be 256 bits wide, whereas on an Arm machine that supported Neon,Vector<T>
might be 128 bits wide. If the hardware supported both 128 bits and 256 bits,Vector<T>
would map to the larger. Since the introduction ofVector<T>
, various fixed-width vector types have been introduced, likeVector64<T>
,Vector128<T>
,Vector256<T>
, andVector512<T>
, and the hardware intrinsics for most of the other ISAs are all in terms of those fixed-width vector sizes, since the instructions themselves are fixed width. But SVE is not; its instructions might be 128 bits here and 512 bits there, thus it’s not possible to use those same fixed-width vector types in theSve
definition… but it makes a lot of sense to use the variable withVector<T>
. What’s old is new again. - The
Sve
class is tagged as[Experimental]
. The[Experimental]
attribute was introduced in .NET 8 and C# 12. The intent is it can be used to indicate that some functionality in an otherwise stable assembly is not yet stable and may change in the future. If code tries to use such a member, by default the C# compiler will issue an error telling the developer they’re using something that could break in the future. As long as the developer is willing to accept such breaking change risk, they can then suppress the error. Designing and enabling the SVE support is a monstrous, multi-year effort, and while the support is functional and folks are encouraged to take it for a spin, it’s not yet baked enough for us to be 100% confident the shape won’t need to evolve (for .NET 9, it’s also restricted to hardware with a vector width of 128 bits, but that restriction will be removed subsequently). Hence,[Experimental]
.
AVX10.1
Even with the size of the SVE effort, it’s not the only new ISA available in .NET 9. Thanks in large part to dotnet/runtime#99784 from @Ruihan-Yin and dotnet/runtime#101938 from @khushal1996, .NET 9 now also supports AVX10.1 (AVX10 version 1). AVX10.1 provides everything AVX512 provides, all of the base support, the updated encodings, support for embedded broadcasts, masking, and so on, but it only requires 256-bit support in the hardware (with 512-bits being optional, whereas AVX512 requires 512-bit support), and it does so in a much less incremental manner (AVX512 has multiple instruction sets like “F”, “DQ”, “Vbmi”, etc.). That’s modeled in the .NET APIs as well, where you can check
Avx10v1.IsSupported
as well as Avx10v1.V512.IsSupported
, both of which govern more than 500 new APIs available for consumption. (Note that at the time of this writing, there aren’t actually any chips on the market that support AVX10.1, but they’re expected in the foreseeable future.)AVX512
On the subject of ISAs, it’s worth mentioning AVX512. .NET 8 added broad support for AVX512, including support in the JIT and employment of it throughout the libraries. Both of those improve further in .NET 9. We’ll talk more about places it’s better used in the libraries later. For now, here are some JIT-specific improvements.
One of the things the JIT needs to generate code for is zeroing, e.g. by default all locals in a method need to be set to zero, and even if
[SkipLocalsInit]
is employed, references still need to be zeroed (otherwise, when the GC does a pass through all of the locals looking for references to objects to see what’s no longer referenced, it could see the references as being whatever garbage happened to be in that location in memory and end up making bad choices). Such zeroing of locals is overhead that occurs on every invocation of that method, so obviously it’s valuable for that to be as efficient as possible. Rather than zeroing out each word with a single instruction, if the current hardware supports the appropriate SIMD instructions, the JIT can instead emit code to use those instructions, so that it can zero out more per instruction. With dotnet/runtime#91166, it’s now able to use AVX512 instructions if available to zero out 512 bits per instruction, rather than “only” 256 bits or 128 bits using other ISAs. As an example, here’s a benchmark that needs to zero out 256 bytes:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public unsafe class Tests
{
[Benchmark]
public void Sum()
{
Bytes values;
Nop(&values);
}
[SkipLocalsInit]
[Benchmark]
public void SumSkipLocalsInit()
{
Bytes values;
Nop(&values);
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void Nop(Bytes* value) { }
[StructLayout(LayoutKind.Sequential, Size = 256)]
private struct Bytes { }
}
Here’s the assembly for
Sum
on .NET 8:
Code:
; Tests.Sum()
sub rsp,108
xor eax,eax
mov [rsp+8],rax
vxorps xmm8,xmm8,xmm8
mov rax,0FFFFFFFFFFFFFF10
M00_L00:
vmovdqa xmmword ptr [rsp+rax+100],xmm8
vmovdqa xmmword ptr [rsp+rax+110],xmm8
vmovdqa xmmword ptr [rsp+rax+120],xmm8
add rax,30
jne short M00_L00
mov [rsp+100],rax
lea rdi,[rsp+8]
call qword ptr [7F6B56B85CB0]; Tests.Nop(Bytes*)
nop
add rsp,108
ret
; Total bytes of code 90
This is on a machine with AVX512 hardware support, but we can see the zero’ing is happening using a loop (
M00_L00
through to the jne
that jumps back to it), as with only 256-bit instructions, this was deemed by the JIT’s heuristics too large to unroll completely. Now, here’s .NET 9:
Code:
; Tests.Sum()
sub rsp,108
xor eax,eax
mov [rsp+8],rax
vxorps xmm8,xmm8,xmm8
vmovdqu32 [rsp+10],zmm8
vmovdqu32 [rsp+50],zmm8
vmovdqu32 [rsp+90],zmm8
vmovdqa xmmword ptr [rsp+0D0],xmm8
vmovdqa xmmword ptr [rsp+0E0],xmm8
vmovdqa xmmword ptr [rsp+0F0],xmm8
mov [rsp+100],rax
lea rdi,[rsp+8]
call qword ptr [7F4D3D3A44C8]; Tests.Nop(Bytes*)
nop
add rsp,108
ret
; Total bytes of code 107
Now there’s no loop, because
vmovdqu32
(Move unaligned packed doubleword integer values) can be used to zero twice as much at a time (64 bytes) as vmovdqa (Move aligned packed integer values), and thus the zeroing can be done in fewer instructions that’s still considered a reasonable number.Zeroing also shows up elsewhere, such as when initializing structs. Those have also previously employed SIMD instructions where relevant, e.g. this:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public MyStruct Init() => new();
public struct MyStruct
{
public Int128 A, B, C, D;
}
}
produces this assembly today on .NET 8:
Code:
; Tests.Init()
vzeroupper
vxorps ymm0,ymm0,ymm0
vmovdqu32 [rsi],zmm0
mov rax,rsi
ret
; Total bytes of code 17
But, if we tweak
MyStruct
to add a field of a reference type anywhere in the struct (e.g. add public string Oops;
as the first line of the struct above), it knocks the initialization off this optimized path, and we end up with initialization like this on .NET 8:
Code:
; Tests.Init()
xor eax,eax
mov [rsi],rax
mov [rsi+8],rax
mov [rsi+10],rax
mov [rsi+18],rax
mov [rsi+20],rax
mov [rsi+28],rax
mov [rsi+30],rax
mov [rsi+38],rax
mov [rsi+40],rax
mov [rsi+48],rax
mov rax,rsi
ret
; Total bytes of code 45
This is due to alignment requirements in order to provide necessary atomicity guarantees. But rather than giving up wholesale, dotnet/runtime#102132 allows the SIMD zeroing to be used for the contiguous portions that don’t contain GC references, so now on .NET 9 we get this:
Code:
; Tests.Init()
xor eax,eax
mov [rsi],rax
vxorps xmm0,xmm0,xmm0
vmovdqu32 [rsi+8],zmm0
mov [rsi+48],rax
mov rax,rsi
ret
; Total bytes of code 27
This optimization isn’t specific to AVX512, but it includes the ability to use AVX512 instructions when available. (dotnet/runtime#99140 provides similar support for Arm64.)
Other optimizations improve the JIT’s ability to select AVX512 instructions as part of generating code. One neat example of this is dotnet/runtime#91227 from @Ruihan-Yin, which utilizes the cool
vpternlog
(Bitwise Ternary Logic) instruction. Imagine you have three bool
s (a
, b
, and c
), and you want to perform a series of Boolean operations on them, e.g. a ? (b ^ c) : (b & c)
. If you were to naively compile that down, you’d end up with branches. We could make it branchless by distributing the a
to both sides of the ternary, e.g. (a & (b ^ c)) | (!a & (b & c))
, but now we’ve gone from one branch and one Boolean operation to six Boolean operations. What if instead we could do all of that in a single instruction and do it for all of the lanes in a vector at the same time so it could be applied to multiple values as part of a SIMD operation? That’d be cool, right? That’s what vpternlog
enables. Try running this:
Code:
// dotnet run -c Release -f net9.0
internal class Program
{
private static bool Exp(bool a, bool b, bool c) => (a & (b ^ c)) | (!a & b & c);
private static void Main()
{
Console.WriteLine("a b c result");
Console.WriteLine("------------");
int control = 0;
foreach (var (a, b, c, result) in from a in new[] { true, false }
from b in new[] { true, false }
from c in new[] { true, false }
select (a, b, c, Exp(a, b, c)))
{
Console.WriteLine($"{Convert.ToInt32(a)} {Convert.ToInt32(b)} {Convert.ToInt32(c)} {Convert.ToInt32(result)}");
control = control << 1 | Convert.ToInt32(result);
}
Console.WriteLine("------------");
Console.WriteLine($"Control: {control:b8} == 0x{control:X2}");
}
}
Here we’ve put our Boolean operation into an
Exp
function, which is then being invoked for all 8 possible combinations of inputs (each of the three bool
s each having two possible values). We’re then printing out the resulting “truth table,” that details the Boolean output for each possible input. With this particular Boolean expression, that yields this truth table being output:
Code:
a b c result
------------
1 1 1 0
1 1 0 1
1 0 1 1
1 0 0 0
0 1 1 1
0 1 0 0
0 0 1 0
0 0 0 0
------------
We then take that last
result
column and we treat it as a binary number:Control: 01101000 == 0x68
So the values are
0 1 1 0 1 0 0 0
, which we read as the binary 0b01101000
, which is 0x68
. That byte is used as a “control code” to the vpternlog
instruction to encode which of the 256 possible truth tables that exist for any possible (deterministic) Boolean combination of those inputs is being chosen. This PR then teaches the JIT how to analyze the tree structures produced by the JIT to recognize such sequences of Boolean operations, compute the control code, and substitute in the use of the better instruction. Of course, the JIT isn’t going to do the enumeration I did above; turns out there’s a more efficient way to compute the control code, performing the same sequence of operations but on specific byte values instead of Booleans, e.g. this:
Code:
// dotnet run -c Release -f net9.0
Console.WriteLine($"0x{Exp(0xF0, 0xCC, 0xAA):X2}");
static int Exp(int a, int b, int c) => (a & (b ^ c)) | (~a & b & c);
also yields:
0x68
Why those specific three values of
0xF0
, 0xCC
, and 0xAA
? Let’s expand them from hex to binary: 0b11110000
, 0b11001100
, 0b10101010
. Look familiar? They’re the columns for a
, b
, and c
in the earlier truth table, so we’re really just running this expression over each of the 8 rows in the table at the same time. Fun.Another neat example is in dotnet/runtime#92017 from @Ruihan-Yin, which optimizes 512-bit vector constants via
broadcast
. “broadcast” is a fancy way of saying “replicate,” or “copy to each.” The instruction is used to take a single value and duplicate it to be used for each element of a vector. If, for example, I write:Vector512<int> vector = Vector512.Create(42);
that’s broadcasting the single value
42
, replicating it 16 times to fill up the 512-bit vector. Now imagine I have the following C# code, which is creating a Vector512<byte>
composed of the byte sequence for the hex digits, but manually replicated four times, to fill up the 64 bytes that compose a 512-bit vector.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.Intrinsics;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public Vector512<byte> HexLookupTable() =>
Vector512.Create("0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef"u8);
}
This would result in that whole byte sequence being stored in the assembly data section, and then the JIT would emit the code to load that data into the appropriate registers; no broadcasting. But instead, the JIT should be able to recognize that this is actually the same 16-byte sequence repeated four times, store the sequence once, and then use a
broadcast
to load and replicate that value to fill out the vector. With this PR, that’s exactly what happens.
Code:
// .NET 8
; Tests.HexLookupTable()
push rax
vzeroupper
vmovups zmm0,[7FB205399700]
vmovups [rsi],zmm0
mov rax,rsi
vzeroupper
add rsp,8
ret
; Total bytes of code 31
// .NET 9
; Tests.HexLookupTable()
push rax
vbroadcasti32x4 zmm0,xmmword ptr [7F78F75290F0]
vmovups [rsi],zmm0
mov rax,rsi
vzeroupper
add rsp,8
ret
; Total bytes of code 28
This is beneficial for a variety of reasons, including less data to store, less data to load, and if the register containing this state needed to be spilled (meaning something else needs to be put into the register, so the value currently in the register is temporarily stored in memory), reloading it is similarly cheaper.
Two of the more far-reaching changes related to AVX512, though, come from dotnet/runtime#97675 and dotnet/runtime#101886, which do the work to enable the JIT to utilize AVX512 “embedded masking.” Masking is a commonly needed solution when writing SIMD code; anywhere you see a
ConditionalSelect
, that’s masking. Consider again a ternary operation, e.g. a ? (b + c) : (b - c)
. Here, a
would be considered the “mask”: anywhere it’s true, the value of b + c
is used, and anywhere it’s false, the value of b - c
is used. If each of these were Vector512<byte>
, for example, it would look like this in C#:
Code:
public static Vector512<byte> Exp(Vector512<byte> a, Vector512<byte> b, Vector512<byte> c) =>
Vector512.ConditionalSelect(a, b + c, b - c);
And guess what I’d get for assembly? You guessed it, our good friend
vpternlogd
:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public Vector512<byte> Test() => Exp(default, default, default);
[MethodImpl(MethodImplOptions.NoInlining)]
public static Vector512<byte> Exp(Vector512<byte> a, Vector512<byte> b, Vector512<byte> c) =>
Vector512.ConditionalSelect(a, b + c, b - c);
}
Code:
; Tests.Exp(System.Runtime.Intrinsics.Vector512`1<Byte>, System.Runtime.Intrinsics.Vector512`1<Byte>, System.Runtime.Intrinsics.Vector512`1<Byte>)
vzeroupper
vmovups zmm0,[rsp+48]
vmovups zmm1,[rsp+88]
vpaddb zmm2,zmm0,zmm1
vpsubb zmm0,zmm0,zmm1
vmovups zmm1,[rsp+8]
vpternlogd zmm1,zmm2,zmm0,0CA
vmovups [rdi],zmm1
mov rax,rdi
vzeroupper
ret
; Total bytes of code 68
We can see it’s computing both the
b + c
(vpaddb zmm2,zmm0,zmm1
) and the b - c
(vpsubb zmm0,zmm0,zmm1
), and it’s then choosing between them based on the mask ([rsp+8]
, aka the a
parameter). In this example, the mask a
was being passed in and computed in a manner unknown to the ConditionalSelect
. A more common scheme, however, is that the mask is computed as an argument to the ConditionalSelect
. Let’s say for example that instead of passing in a
as a mask, we pass in Vector512.LessThan(b, c)
as the mask:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public Vector512<byte> Test() => Exp(default, default);
[MethodImpl(MethodImplOptions.NoInlining)]
public static Vector512<byte> Exp(Vector512<byte> b, Vector512<byte> c) =>
Vector512.ConditionalSelect(Vector512.LessThan(b, c), b + c, b - c);
}
AVX512 supports this implicitly via embedded masking, which means that instructions can include the masking operation as part of them rather than performing the operation separately and then doing the masking via
vpternlogd
. Instructions like the comparison operation employed by LessThan
can target storing the result into a new kind of register defined by AVX512, a mask register, and then that mask register can be used as part of other compound operations to incorporate the mask into them. .NET developers don’t need to do anything to take advantage of this support, though: the JIT just uses the specialized masking instructions where it sees an opportunity to do so. For the previous example, on .NET 8, we’d get this:
Code:
; Tests.Exp(System.Runtime.Intrinsics.Vector512`1<Byte>, System.Runtime.Intrinsics.Vector512`1<Byte>)
vzeroupper
vmovups zmm0,[rsp+8]
vmovups zmm1,[rsp+48]
vpcmpltub k1,zmm0,zmm1
vpmovm2b zmm2,k1
vpaddb zmm3,zmm0,zmm1
vpsubb zmm0,zmm0,zmm1
vpternlogd zmm2,zmm3,zmm0,0CA
vmovups [rdi],zmm2
mov rax,rdi
vzeroupper
ret
; Total bytes of code 70
Here we still have a
vpternlogd
. But, with the aforementioned PRs, now here’s what we get on .NET 9:
Code:
; Tests.Exp(System.Runtime.Intrinsics.Vector512`1<Byte>, System.Runtime.Intrinsics.Vector512`1<Byte>)
vmovups zmm0,[rsp+8]
vmovups zmm1,[rsp+48]
vpcmpltub k1,zmm0,zmm1
vpsubb zmm2,zmm0,zmm1
vpaddb zmm2{k1},zmm0,zmm1
vmovups [rdi],zmm2
mov rax,rdi
vzeroupper
ret
; Total bytes of code 54
That
vpcmpltub
instruction is doing the LessThan
between b
and c
and storing the result as a mask in the k1
masking register. The vpsubb
for the b - c
still happens as it did before. But now the b + c
operation is significantly different, and note there’s no vpternlogd
anymore. The vpternlogd
and the vpaddb
we previously saw have now effectively been folded into a single vpaddb
instruction with the mask. The result of the b - c
is sitting in the zmm2
register. The vpaddb
instruction then performs the addition between zmm0
(b
) and zmm1
(c
), and uses the mask k1
to decide whether to use that addition result or the existing subtraction result in zmm2
. (dotnet/runtime#97468 also enables some such usage of vpternlogd
to instead use vblendmps
. vblendmps
is similar to vpternlogd
except that it’s specific to floating-point and works with one of the dedicated mask registers.)dotnet/runtime#97529 also improved casting from
double
and float
to integer types, in particular when AVX512 is available such that it can benefit from dedicated AVX512 instructions for the purpose, e.g. the VCVTTSD2USI
(Convert With Truncation Scalar Double Precision Floating-Point Value to Unsigned Integer) instruction.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Linq;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private double[] _doubles = Enumerable.Range(0, 1024).Select(i => (double)i).ToArray();
private ulong[] _ulongs = new ulong[1024];
[Benchmark]
public void DoubleToUlong()
{
ReadOnlySpan doubles = _doubles;
Span ulongs = _ulongs;
for (int i = 0; i < doubles.Length; i++)
{
ulongs[i] = (ulong)doubles[i];
}
}
}
Method | Runtime |
---|---|
DoubleToUlong | .NET 8.0 |
DoubleToUlong | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Code Size
[/th][td]
1,386.5 ns
[/td][td]
1.00
[/td][td]
135 B
[/td][td]
461.4 ns
[/td][td]
0.33
[/td][td]
102 B
[/td]Vectorization
In addition to improvements that teach the JIT about entirely new architectures, there have also been a plethora of improvements that simply help the JIT to better employ SIMD in general.
One of my favorites is dotnet/runtime#92852, which merges consecutive stores into a single operation. Consider wanting to implement a method like
bool.TryFormat
:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private bool _value;
private char[] _destination = new char[10];
[Benchmark]
public bool TryFormat() => TryFormat(_destination, out _);
private bool TryFormat(char[] destination, out int charsWritten)
{
if (_value)
{
if (destination.Length >= 4)
{
destination[0] = 'T';
destination[1] = 'r';
destination[2] = 'u';
destination[3] = 'e';
charsWritten = 4;
return true;
}
}
else
{
if (destination.Length >= 5)
{
destination[0] = 'F';
destination[1] = 'a';
destination[2] = 'l';
destination[3] = 's';
destination[4] = 'e';
charsWritten = 5;
return true;
}
}
charsWritten = 0;
return false;
}
}
Pretty simple: we’re writing out each individual value. That’s a bit unfortunate, though, in that we’re naively then spending several
mov
s to write each character individually, when instead we could pack all of those values into a single value to write. In fact, that’s exactly what the real bool.TryFormat
does. Here is its handling of the true
case today:
Code:
if (destination.Length > 3)
{
ulong true_val = BitConverter.IsLittleEndian ? 0x65007500720054ul : 0x54007200750065ul; // "True"
MemoryMarshal.Write(MemoryMarshal.AsBytes(destination), in true_val);
charsWritten = 4;
return true;
}
The developer has manually done the work of computing the value of the merged writes, e.g.
Code:
ulong true_val = (((ulong)'e' << 48) | ((ulong)'u' << 32) | ((ulong)'r' << 16) | (ulong)'T')
Assert.Equal(0x65007500720054ul, true_val);
in order to be able to perform a single write rather than doing four individual ones. For this particular case, now in .NET 9, the JIT can automatically do this merging so the developer doesn’t have to. The developer just writes the code that’s natural to write, and the JIT does the heavy lifting of optimizing its output (note below the
mov rax, 65007500720054
instruction, loading the same value we manually computed above).
Code:
// .NET 8
; Tests.TryFormat(Char[], Int32 ByRef)
push rbp
mov rbp,rsp
cmp byte ptr [rdi+10],0
jne short M01_L01
mov ecx,[rsi+8]
cmp ecx,5
jl short M01_L00
mov word ptr [rsi+10],46
mov word ptr [rsi+12],61
mov word ptr [rsi+14],6C
mov word ptr [rsi+16],73
mov word ptr [rsi+18],65
mov dword ptr [rdx],5
mov eax,1
pop rbp
ret
M01_L00:
xor eax,eax
mov [rdx],eax
pop rbp
ret
M01_L01:
mov ecx,[rsi+8]
cmp ecx,4
jl short M01_L00
mov word ptr [rsi+10],54
mov word ptr [rsi+12],72
mov word ptr [rsi+14],75
mov word ptr [rsi+16],65
mov dword ptr [rdx],4
mov eax,1
pop rbp
ret
; Total bytes of code 112
// .NET 9
; Tests.TryFormat(Char[], Int32 ByRef)
push rbp
mov rbp,rsp
cmp byte ptr [rdi+10],0
jne short M01_L00
mov ecx,[rsi+8]
cmp ecx,5
jl short M01_L01
mov rax,73006C00610046
mov [rsi+10],rax
mov word ptr [rsi+18],65
mov dword ptr [rdx],5
mov eax,1
pop rbp
ret
M01_L00:
mov ecx,[rsi+8]
cmp ecx,4
jl short M01_L01
mov rax,65007500720054
mov [rsi+10],rax
mov dword ptr [rdx],4
mov eax,1
pop rbp
ret
M01_L01:
xor eax,eax
mov [rdx],eax
pop rbp
ret
; Total bytes of code 92
dotnet/runtime#92939 improves this further by enabling longer sequences to similarly be merged using SIMD instructions.
Of course, you may then wonder, why wasn’t
bool.TryFormat
reverted to use the simpler code? The unfortunate answer is that this optimization only currently applies to array targets rather than span targets. That’s because there are alignment requirements for performing these kinds of writes, and whereas the JIT can make certain assumptions about the alignment of arrays, it can’t make those same assumptions about spans, which can represent slices of something else at unaligned boundaries. This is now one of the few cases where arrays are better than spans; typically span is as good or better. But I’m hopeful it will be improved in the future.Another nice improvement is dotnet/runtime#86811 from @BladeWise, which adds SIMD support for multiplying two vectors of
byte
s or sbyte
s. Previously this would end up falling back to a software implementation, which is very slow compared to true SIMD operations. Now, the code is much faster and much more compact.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.Intrinsics;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private Vector128<byte> _v1 = Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
[Benchmark]
public Vector128<byte> Square() => _v1 * _v1;
}
Method | Runtime |
---|---|
Square | .NET 8.0 |
Square | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
15.4731 ns
[/td][td]
1.000
[/td][td]
0.0284 ns
[/td][td]
0.002
[/td]dotnet/runtime#103555 (x64, when AVX512 isn’t available) and dotnet/runtime#104177 (Arm64) also improve vector multiplication, this time for
long
/ulong
. This can be seen with a simple micro-benchmark (and because I’m running on a machine that supports AVX512, the benchmark is explicitly disabling it):
Code:
// dotnet run -c Release -f net8.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Running;
using System.Runtime.Intrinsics;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, DefaultConfig.Instance
.AddJob(Job.Default.WithId(".NET 8").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_EnableAVX512F", "0").AsBaseline())
.AddJob(Job.Default.WithId(".NET 9").WithRuntime(CoreRuntime.Core90).WithEnvironmentVariable("DOTNET_EnableAVX512F", "0")));
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private Vector256<long> _a = Vector256.Create(1, 2, 3, 4);
private Vector256<long> _b = Vector256.Create(5, 6, 7, 8);
[Benchmark]
public Vector256<long> Multiply() => Vector256.Multiply(_a, _b);
}
Method | Runtime |
---|---|
Multiply | .NET 8.0 |
Multiply | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
9.5448 ns
[/td][td]
1.00
[/td][td]
0.3868 ns
[/td][td]
0.04
[/td]It’s also evident, however, on higher-level benchmarks, for example on this benchmark for
XxHash128
, an implementation that makes heavy use of multiplication of such vectors.
Code:
// Add a <PackageReference Include="System.IO.Hashing" Version="8.0.0" /> to the csproj.
// dotnet run -c Release -f net8.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Environments;
using System.IO.Hashing;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, DefaultConfig.Instance
.AddJob(Job.Default.WithId(".NET 8").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable("DOTNET_EnableAVX512F", "0").AsBaseline())
.AddJob(Job.Default.WithId(".NET 9").WithRuntime(CoreRuntime.Core90).WithEnvironmentVariable("DOTNET_EnableAVX512F", "0")));
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private byte[] _data;
[GlobalSetup]
public void Setup()
{
_data = new byte[1024 * 1024];
new Random(42).NextBytes(_data);
}
[Benchmark]
public UInt128 Hash() => XxHash128.HashToUInt128(_data);
}
This benchmark references the System.IO.Hashing nuget package. Note that we’re explicitly adding in a reference to the 8.0.0 version; that means that even when running on .NET 9, we’re using the .NET 8 version of the hashing code, yet it’s still significantly faster, because of these runtime improvements.
Method | Runtime |
---|---|
Hash | .NET 8.0 |
Hash | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
40.49 us
[/td][td]
1.00
[/td][td]
26.40 us
[/td][td]
0.65
[/td]Some other notable examples:
- Improved SIMD comparisons. dotnet/runtime#104944 and dotnet/runtime#104215 improve how vector comparisons are handled.
- Improved ConditionalSelects. dotnet/runtime#104092 from @ezhevita improves the generated code for
ConditionalSelect
s when the condition is a set of constants. - Better Const Handling. Certain operations are only optimized when one of their arguments is a constant, otherwise falling back to a much slower software emulation implementation. dotnet/runtime#102827 enables such instructions (like for shuffling) to continue to be treated as optimized operations if the non-const argument becomes a constant as part of other optimizations (like inlining).
- Unblocking other optimizations. Some changes don’t themselves introduce optimizations, but instead make tweaks that enable other optimizations to do a better job. dotnet/runtime#104517 decomposes some bitwise operations (e.g. replacing a unified “and not” operation with an “and” and a “not”), which in turn enables other existing optimizations like common sub-expression elimination (CSE) to kick in more often. And dotnet/runtime#104214 normalized various negation patterns, which similarly enables other optimizations to apply in more places.
Branching
Just like the JIT tries to elide redundant bounds checking, where it can prove the bounds check is unnecessary, it similarly does so for branching.
The ability to handle the relationships between branches is improved in .NET 9. Consider this benchmark:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
[Arguments(50)]
public void Test(int x)
{
if (x > 100)
{
Helper(x);
}
}
private void Helper(int x)
{
if (x > 10)
{
Console.WriteLine("Hello!");
}
}
}
The
Helper
function is simple enough to be inlined, and in .NET 8 we end up with this assembly:
Code:
; Tests.Test(Int32)
push rbp
mov rbp,rsp
cmp esi,64
jg short M00_L01
M00_L00:
pop rbp
ret
M00_L01:
cmp esi,0A
jle short M00_L00
mov rdi,7F35E44C7E18
pop rbp
jmp qword ptr [7F35E914C7C8]
; Total bytes of code 33
We can see in the original code that the branch within the inlined
Helper
is entirely unnecessary: we’re only there if x
is greater than 100, so it’s definitely greater than 10, yet in the assembly code, we have both comparisons happening (notice the two cmp
s). Now in .NET 9, thanks to dotnet/runtime#95234 which improves the JIT’s ability to reason about the relationship between two ranges and whether one is implied by the other, we get this instead:
Code:
; Tests.Test(Int32)
cmp esi,64
jg short M00_L00
ret
M00_L00:
mov rdi,7F81C120EE20
jmp qword ptr [7F8148626628]
; Total bytes of code 22
Just the one outer
cmp
. The same thing happens for the negative case: if we tweak the x > 10
to instead be x < 10
, we end up with this:
Code:
// .NET 8
; Tests.Test(Int32)
push rbp
mov rbp,rsp
cmp esi,64
jg short M00_L01
M00_L00:
pop rbp
ret
M00_L01:
cmp esi,0A
jge short M00_L00
mov rdi,7F6138428DE0
pop rbp
jmp qword ptr [7FA1DDD4C7C8]
; Total bytes of code 33
// .NET 9
; Tests.Test(Int32)
ret
; Total bytes of code 1
Similar to the
x > 10
case, on .NET 8 the JIT retained both branches. But on .NET 9, it recognized that not only was the inner conditional redundant, it was redundant in a way that would make it always false, which then allowed it to dead-code eliminate the body of that if
, leaving the whole method a nop. dotnet/runtime#94689 makes this kind of information flow by enabling the JIT’s support for “cross-block local assertion prop”.Another PR that eliminated some redundant branches is dotnet/runtime#94563, which feeds information from value numbering (a technique used to eliminate redundant expressions by giving every unique expression its own unique identifier) into the building of PHIs (a kind of node in the JIT’s intermediate representation of the code that aids in determining a variable’s value based on control flow). Consider this benchmark:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public unsafe class Tests
{
[Benchmark]
[Arguments(50)]
public void Test(int x)
{
byte[] data = new byte[128];
fixed (byte* ptr = data)
{
Nop(ptr);
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void Nop(byte* ptr) { }
}
This is allocating a
byte[]
and then pinning it in order to use it with a method that requires a pointer. The C# specification for fixed
with arrays states “If the array expression is null
or if the array has zero elements, the initializer computes an address equal to zero,” and as such if you look at the IL for this code, you’ll see that it’s checking the length and setting the pointer equal to 0 if the length is 0. You can see this same behavior explicitly implemented as well for spans if you look at Span<T>
‘s GetPinnableReference
implementation:
Code:
public ref T GetPinnableReference()
{
ref T ret = ref Unsafe.NullRef<T>();
if (_length != 0) ret = ref _reference;
return ref ret;
}
As such, there’s actually an extra branch not visible in the
Tests.Test
test. But, in this particular case, that branch is also redundant, because we can very clearly see (and the JIT should be able to as well) that the length of the array is non-0. On .NET 8, we still get that branch:
Code:
; Tests.Test(Int32)
push rbp
sub rsp,10
lea rbp,[rsp+10]
xor eax,eax
mov [rbp-8],rax
mov rdi,offset MT_System.Byte[]
mov esi,80
call CORINFO_HELP_NEWARR_1_VC
mov [rbp-8],rax
mov rdi,[rbp-8]
cmp dword ptr [rdi+8],0
je short M00_L01
mov rdi,[rbp-8]
cmp dword ptr [rdi+8],0
jbe short M00_L02
mov rdi,[rbp-8]
add rdi,10
M00_L00:
call qword ptr [7F3F99B45C98]; Tests.Nop(Byte*)
xor eax,eax
mov [rbp-8],rax
add rsp,10
pop rbp
ret
M00_L01:
xor edi,edi
jmp short M00_L00
M00_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 96
But now on .NET 9, that branch (in fact, multiple redundant branches) is removed:
Code:
; Tests.Test(Int32)
push rax
xor eax,eax
mov [rsp],rax
mov rdi,offset MT_System.Byte[]
mov esi,80
call CORINFO_HELP_NEWARR_1_VC
mov [rsp],rax
add rax,10
mov rdi,rax
call qword ptr [7F22DAC844C8]; Tests.Nop(Byte*)
xor eax,eax
mov [rsp],rax
add rsp,8
ret
; Total bytes of code 55
dotnet/runtime#87656 is another nice example and addition to the JIT’s optimization repertoire. As was discussed earlier, branches have costs associated with them. A hardware’s branch predictor can often do a very good job of mitigating the bulk of those costs, but there’s still some, and even if it were fully mitigated in the common case, a branch prediction failure can be relatively very costly. As such, minimizing branches can be very helpful, and if nothing else, turning branch-based operations into branchless ones leads to more consistent and predictable throughput, as it’s then less subject to the nature of the data being processed. Consider the following function that’s used to determine whether a character is a particular subset of whitespace characters:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
[Arguments('s')]
public bool IsJsonWhitespace(int c)
{
if (c == ' ' || c == '\t' || c == '\r' || c == '\n')
{
return true;
}
return false;
}
}
On .NET 8, we get what you’d probably expect, a series of
cmp
s followed by conditional jumps, one for each character:
Code:
; Tests.IsJsonWhitespace(Int32)
push rbp
mov rbp,rsp
cmp esi,20
je short M00_L00
cmp esi,9
je short M00_L00
cmp esi,0D
je short M00_L00
cmp esi,0A
je short M00_L00
xor eax,eax
pop rbp
ret
M00_L00:
mov eax,1
pop rbp
ret
; Total bytes of code 35
On .NET 9, though, we now get this:
Code:
; Tests.IsJsonWhitespace(Int32)
push rbp
mov rbp,rsp
cmp esi,20
ja short M00_L00
mov eax,0FFFFD9FF
bt rax,rsi
jae short M00_L01
M00_L00:
xor eax,eax
pop rbp
ret
M00_L01:
mov eax,1
pop rbp
ret
; Total bytes of code 31
It’s now using a
bt
instruction (a bit test) against a pattern where there’s a bit set for each of the characters being tested against, consolidating most of the branches down to just this one.Unfortunately, this also highlights that such optimizations, which are looking for a particular pattern, can get knocked off their golden path, at which point the optimization won’t kick in. In this case, there are several ways it can get knocked off. The most obvious is if there are too many values or if they’re too spread out, such that they can’t fit into the 32-bit or 64-bit bit mask. More interesting, if you switch it to instead use C# pattern matching (e.g.
c is ' ' or '\t' or '\r' or '\n'
), it also doesn’t kick in. Why? Because the C# compiler itself is trying to optimize, and the pattern it ends up generating in the IL is different from what this optimization is expecting. I expect this’ll get better in the future, but it’s a good reminder that these kinds of optimizations are useful when they make arbitrary code better, but if you’re coding to the exact nature of the optimization and relying on it happening, you really need to be paying attention.A related optimization was added in dotnet/runtime#93521. Consider a function like the following, which is checking to see whether a character is a lower-case hexadecimal char:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
[Arguments('s')]
public bool IsHexLower(char c)
{
if ((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f'))
{
return true;
}
return false;
}
}
On .NET 8, we get a comparison against
'0'
, a comparison again '9'
, a comparison against 'a'
, and a comparison against 'f'
, with a conditional branch for each:
Code:
; Tests.IsHexLower(Char)
push rbp
mov rbp,rsp
movzx eax,si
cmp eax,30
jl short M00_L00
cmp eax,39
jle short M00_L02
M00_L00:
cmp eax,61
jl short M00_L01
cmp eax,66
jle short M00_L02
M00_L01:
xor eax,eax
pop rbp
ret
M00_L02:
mov eax,1
pop rbp
ret
; Total bytes of code 38
But on .NET 9, we instead get this:
Code:
; Tests.IsHexLower(Char)
push rbp
mov rbp,rsp
movzx eax,si
mov ecx,eax
sub ecx,30
cmp ecx,9
jbe short M00_L00
sub eax,61
cmp eax,5
jbe short M00_L00
xor eax,eax
pop rbp
ret
M00_L00:
mov eax,1
pop rbp
ret
; Total bytes of code 36
Effectively the JIT has rewritten the condition as if I’d written it like this:
(((uint)c - '0') <= ('9' - '0')) || (((uint)c - 'a') <= ('f' - 'a'))
which is nice, because it’s replaced two of the conditional branches with two (cheaper) subtractions.
Write Barriers
The .NET garbage collector (GC) is a generational collector. That means it divides the heap up logically by object age, where “generation 0” (or “gen0”) are the newest objects that haven’t been around for very long, “gen2” are the objects that have been around for a while, and “gen1” are in the middle. This approach is based on the theory (that also generally plays out in practice) that most objects end up being very short-lived, created for some task and then quickly dropped, and conversely that if an object has been around for a while, there’s a really good chance it’ll continue to be around for a while. By partitioning up objects like this, the GC can reduce the amount of work it needs to do when it scans for objects to be collected. It can do a scan focused only on gen0 objects, allowing it to ignore anything in gen1 or gen2 and thereby make its scan much faster. Or at least, that’s the goal. If it were to only scan gen0 objects, though, it could easily think a gen0 object wasn’t referenced because it couldn’t find any references to one from other gen0 objects… but there may have been a reference from a gen1 or gen2 object. That would be bad. How does the GC deal with this then, having its cake and eating it, too? It colludes with the rest of the runtime to track any time its generational assumptions might be violated. The GC maintains a table (called the “card table”) that indicates whether an object in a higher generation might contain a reference to a lower generation object, and any time a reference is written such that there could end up being a reference from a higher generation to a lower one, this table is updated. Then when the GC does its scan, it only needs to examine higher generation objects if the relevant bit in the table is set (the table doesn’t track individual objects, just ranges of them, so it’s similar to a “Bloom filter”, where the lack of a bit means there’s definitely not a reference but the presence of a bit only means there might be a reference).
The code that’s executed to track the reference write and possibly update the card table is referred to as a GC write barrier. And, obviously, if that code is happening every time a reference is written to an object, you really, really, really want that code to be efficient. There are actually multiple different forms of GC write barriers, all specialized for slightly different purposes.
The standard GC write barrier is
CORINFO_HELP_ASSIGN_REF
. However, there’s another one called CORINFO_HELP_CHECKED_ASSIGN_REF
that needs to do a bit more work. The JIT is the one deciding which of these to use, and it uses the latter when it’s possible the target isn’t on the heap, in which case the barrier needs to do a little more work to figure that out.dotnet/runtime#98166 helps the JIT do better in a certain case. If you have a static field of a value type:
Code:
static SomeStruct s_someField;
...
struct SomeStruct
{
public object Obj;
}
the runtime implements that by having a box associated with that field for storing that struct. Such static boxes are always on the heap, so if you then do:
static void Store(object o) => s_someField.Obj = o;
the JIT can prove that the cheaper unchecked write barrier may be used, and this PR teaches it that. Previously sometimes the JIT would be able to figure it out, but this effectively ensures it.
Another similar improvement comes from dotnet/runtime#97953. Here’s an example based on
ConcurrentQueue<T>
, which maintains arrays of elements, each of which is the actual item tagged with a sequence number that’s used by the implementation to maintain correctness in the face of concurrency.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private Slot<object>[] _arr = new Slot<object>[1];
private object _obj = new object();
[Benchmark]
public void Test() => Store(_arr, _obj);
private static void Store<T>(Slot<T>[] arr, T o)
{
arr[0].Item = o;
arr[0].SequenceNumber = 1;
}
private struct Slot<T>
{
public T Item;
public int SequenceNumber;
}
}
Here as well we can see on .NET 8 it’s using the more expensive checked write barrier, but on .NET 9 the JIT has recognized it can use the cheaper unchecked write barrier:
Code:
// .NET 8
; Tests.Test()
push rbx
mov rbx,[rdi+8]
mov rsi,[rdi+10]
cmp dword ptr [rbx+8],0
jbe short M00_L00
add rbx,10
mov rdi,rbx
call CORINFO_HELP_CHECKED_ASSIGN_REF
mov dword ptr [rbx+8],1
pop rbx
ret
M00_L00:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 42
// .NET 9
; Tests.Test()
push rbx
mov rbx,[rdi+8]
mov rsi,[rdi+10]
cmp dword ptr [rbx+8],0
jbe short M00_L00
add rbx,10
mov rdi,rbx
call CORINFO_HELP_ASSIGN_REF
mov dword ptr [rbx+8],1
pop rbx
ret
M00_L00:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 42
dotnet/runtime#101761 actually introduces a new form of write barrier. Consider this:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private MyStruct _value;
private Wrapper _wrapper = new();
[Benchmark]
public void Store() => _wrapper.Value = _value;
private record struct MyStruct(string a1, string a2, string a3, string a4);
private class Wrapper
{
public MyStruct Value;
}
}
Previously as part of copying that struct, each of those fields (represented by
a1
through a4
) would individually incur a write barrier:
Code:
; Tests.Store()
push rax
mov [rsp],rdi
mov rax,[rdi+8]
lea rdi,[rax+8]
mov rsi,[rsp]
add rsi,10
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
nop
add rsp,8
ret
; Total bytes of code 47
Now in .NET 9, this PR added a new bulk write barrier, which can implement the operation more efficiently.
Code:
; Tests.Store()
push rax
mov rsi,[rdi+8]
add rsi,8
cmp [rsi],sil
add rdi,10
mov [rsp],rdi
cmp [rdi],dil
mov rdi,rsi
mov rsi,[rsp]
mov edx,20
call qword ptr [7F5831BC5740]; System.Buffer.BulkMoveWithWriteBarrier(Byte ByRef, Byte ByRef, UIntPtr)
nop
add rsp,8
ret
; Total bytes of code 47
Making GC write barriers faster is good; after all, they’re used a lot. However, switching from the checked write barrier to the non-checked write barrier is a very micro optimization; the extra overhead of the checked variant is often just a couple of comparisons. A better optimization is avoiding the need for a barrier entirely! dotnet/runtime#103503 recognizes that
ref struct
s can’t possibly be on the GC heap by their very nature, and as such, write barriers can be entirely elided when writing into the fields of a ref struct
.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public void Store()
{
MyRefStruct s = default;
Test(ref s, new object(), new object());
}
[MethodImpl(MethodImplOptions.NoInlining)]
private void Test(ref MyRefStruct s, object o1, object o2)
{
s.Obj1 = o1;
s.Obj2 = o2;
}
private ref struct MyRefStruct
{
public object Obj1;
public object Obj2;
}
}
On .NET 8, we have two barriers; on .NET 9, zero:
Code:
// .NET 8
; Tests.Test(MyRefStruct ByRef, System.Object, System.Object)
push r15
push rbx
mov rbx,rsi
mov r15,rcx
mov rdi,rbx
mov rsi,rdx
call CORINFO_HELP_CHECKED_ASSIGN_REF
lea rdi,[rbx+8]
mov rsi,r15
call CORINFO_HELP_CHECKED_ASSIGN_REF
nop
pop rbx
pop r15
ret
; Total bytes of code 37
// .NET 9
; Tests.Test(MyRefStruct ByRef, System.Object, System.Object)
mov [rsi],rdx
mov [rsi+8],rcx
ret
; Total bytes of code 8
Similarly, dotnet/runtime#102084 is able to remove some barriers on Arm64 as part of
ref struct
copies.Object Stack Allocation
For years, .NET has explored the possibility of stack-allocating managed objects. It’s something that other managed languages like Java are already capable of doing, but it’s also more critical in Java, which lacks the equivalent of value types (e.g. if you want a list of integers, that’d most likely be
List<Integer>
, which will box each integer value added to the list, similar to if List<object>
were used in .NET). In .NET 9, object stack allocation starts to happen. Before you get too excited, it’s limited in scope right now, but in the future it’s likely to expand out further.The hardest part of stack allocating objects is ensuring that it’s safe. If a reference to the object were to escape and end up being stored somewhere that outlived the stack frame containing the stack-allocated object, that would be very bad; when the method returned, those outstanding references would be pointing to garbage. So, the JIT needs to perform escape analysis to ensure that never happens, and doing that well is extremely challenging. For .NET 9, the support was introduced in dotnet/runtime#103361 (and brought to Native AOT in dotnet/runtime#104411), and it doesn’t do any interprocedural analysis, which means it’s limited to only handling cases where it can easily prove the object reference doesn’t leave the current frame. Even so, there are plenty of situations where this will help to eliminate allocations, and I expect it’ll be expanded to handle more and more cases in the future. When the JIT does choose to allocate an object on the stack, it effectively promotes the fields of the object to be individual variables in the stack frame.
Here’s a very simple example of the mechanism in action:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public int GetValue() => new MyObj(42).Value;
private class MyObj
{
public MyObj(int value) => Value = value;
public int Value { get; }
}
}
On .NET 8, the generated code for
GetValue
looks like this:
Code:
; Tests.GetValue()
push rax
mov rdi,offset MT_Tests+MyObj
call CORINFO_HELP_NEWSFAST
mov dword ptr [rax+8],2A
mov eax,[rax+8]
add rsp,8
ret
; Total bytes of code 31
The generated code is allocating a new object, populating that object’s
Value
, and then reading that Value
as the value to return. On .NET 9, we instead end up with this picture of simplicity:
Code:
; Tests.GetValue()
mov eax,2A
ret
; Total bytes of code 6
The JIT has inlined the constructor, inlined accesses to the
Value
property, promoted the field backing that property to be a variable, and in effect optimized the entire operation simply to be return 42;
.Method | Runtime |
---|---|
GetValue | .NET 8.0 |
GetValue | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Code Size
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
3.6037 ns
[/td][td]
1.00
[/td][td]
31 B
[/td][td]
24 B
[/td][td]
1.00
[/td][td]
0.0519 ns
[/td][td]
0.01
[/td][td]
6 B
[/td][td]
–
[/td][td]
0.00
[/td]Here’s another more impactful example. When it comes to performance optimization, it’s really nice when the right things just happen; otherwise, developers need to learn the minute differences between performing an operation this way or that way. Every programming language and platform has non-trivial amounts of such things, but we really want to drive the number of them down. One interesting case for .NET has had to do with structs and casting. Consider these two
Dispose1
and Dispose2
methods:
Code:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public void Test()
{
Dispose1<MyStruct>(default);
Dispose2<MyStruct>(default);
}
[MethodImpl(MethodImplOptions.NoInlining)]
private bool Dispose1<T>(T o)
{
bool disposed = false;
if (o is IDisposable disposable)
{
disposable.Dispose();
disposed = true;
}
return disposed;
}
[MethodImpl(MethodImplOptions.NoInlining)]
private bool Dispose2<T>(T o)
{
bool disposed = false;
if (o is IDisposable)
{
((IDisposable)o).Dispose();
disposed = true;
}
return disposed;
}
private struct MyStruct : IDisposable
{
public void Dispose() { }
}
}
Ideally, if you call them with a value type
T
, there wouldn’t be any allocation, but unfortunately, in Dispose1
because of how things line up here, the JIT would end up needing to box o
to produce the IDisposable
. Interestingly, due to optimizations several years ago, in Dispose2
the JIT is in fact able to elide the boxing. On .NET 8, we get this:
Code:
; Tests.Dispose1[[Tests+MyStruct, benchmarks]](MyStruct)
push rbx
mov rdi,offset MT_Tests+MyStruct
call CORINFO_HELP_NEWSFAST
add rax,8
mov ebx,[rsp+10]
mov [rax],bl
mov eax,1
pop rbx
ret
; Total bytes of code 33
; Tests.Dispose2[[Tests+MyStruct, benchmarks]](MyStruct)
mov eax,1
ret
; Total bytes of code 6
This is one of those things that a developer would have to “just know,” and also fight against tooling like IDE0038 that pushes developers to write this code like in my first version, whereas for structs the latter ends up being more efficient. This work on stack allocation makes that difference go away, because the boxing that occurs as part of the first version is a quintessential example of the allocation the compiler is now able to stack allocate. On .NET 9, we now end up with this:
Code:
; Tests.Dispose1[[Tests+MyStruct, benchmarks]](MyStruct)
mov eax,1
ret
; Total bytes of code 6
; Tests.Dispose2[[Tests+MyStruct, benchmarks]](MyStruct)
mov eax,1
ret
; Total bytes of code 6
Method | Runtime |
---|---|
Test | .NET 8.0 |
Test | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Code Size
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
5.726 ns
[/td][td]
1.00
[/td][td]
94 B
[/td][td]
24 B
[/td][td]
1.00
[/td][td]
2.095 ns
[/td][td]
0.37
[/td][td]
45 B
[/td][td]
–
[/td][td]
0.00
[/td]Inlining
Improvements in inlining were a major focus of previous releases, and will likely be a major focus again in the future. For .NET 9, there weren’t a ton of changes, but there was one particularly impactful improvement.
As a motivating example, consider
ArgumentNullException.ThrowIfNull
again. It is defined like this:public static void ThrowIfNull(object? arg, [CallerArgumentExpression(nameof(arg))] string? paramName = null);
Notably, it’s non-generic, and that’s a question we get asked about at some relevant frequency. We chose not to make it generic for three reasons:
- The main benefit of making it generic would be to avoid boxing structs, but the JIT already eliminated said boxing in tier 1, and as was highlighted earlier in this post, it’s possible for it to eliminate it in tier 0 as well (and now does).
- Every generic instantiation (using a generic with a different type) adds runtime overhead. We didn’t want to bloat a process with such additional metadata and runtime data structures just to support argument validation that should rarely if ever fail in production.
- When used with reference types (which is its raison d’etre), it would not play well with inlining, but inlining of such a “throw helper” is critical for performance. Generic methods with coreclr and Native AOT work in one of two ways. For value types, every time a generic is used with a different value type, an entire copy of the generic method is made and specialized for that parameter type; it’s as if you wrote a dedicated version of that generic code that wasn’t generic and was instead customized specifically for that type. For reference types, there’s only one copy of the code that’s then shared across all reference types, and it’s parameterized at run-time based on the actual type being used. When you access such a shared generic, at run-time it ends up looking up in a dictionary the information about the generic argument and using the discovered information to inform the rest of the method. Historically, this has not been conducive to inlining.
So,
ThrowIfNull
is non-generic. But, there are other throw helpers, many of them are generic. That’s because a) they’re primarily expected to work with value types, and b) we had no choice, given the nature of the methods. So, for example, ArgumentOutOfRangeException.ThrowIfEqual
is generic on T
, accepting two values of T
to compare and throw if they’re the same. And if T
is a reference type, on .NET 8 it may not successfully inline if the caller is a shared generic as well. With this:
Code:
// dotnet run -c Release -f net8.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
namespace Benchmarks;
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public unsafe class Tests
{
private static void Main(string[] args) =>
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[Benchmark]
public void Test() => ThrowOrDispose(new Version(1, 0), new Version(1, 1));
[MethodImpl(MethodImplOptions.NoInlining)]
private static void ThrowOrDispose<T>(T value, T invalid) where T : IEquatable<T>
{
ArgumentOutOfRangeException.ThrowIfEqual(value, invalid);
if (value is IDisposable disposable)
{
disposable.Dispose();
}
}
}
on .NET 8 we get this for the
ThrowOrDispose
method (this example benchmark has a slightly different shape from previous examples and this output is from Windows, for reasons to be made clearer shortly):
Code:
; Benchmarks.Tests.ThrowOrDispose[[System.__Canon, System.Private.CoreLib]](System.__Canon, System.__Canon)
push rsi
push rbx
sub rsp,28
mov [rsp+20],rcx
mov rbx,rdx
mov rsi,r8
mov rdx,[rcx+10]
mov rax,[rdx+10]
test rax,rax
je short M01_L00
mov rcx,rax
jmp short M01_L01
M01_L00:
mov rdx,7FF996A8B170
call CORINFO_HELP_RUNTIMEHANDLE_METHOD
mov rcx,rax
M01_L01:
mov rdx,rbx
mov r8,rsi
mov r9,1DB81B20390
call qword ptr [7FF996AC5BC0]; System.ArgumentOutOfRangeException.ThrowIfEqual[[System.__Canon, System.Private.CoreLib]](System.__Canon, System.__Canon, System.String)
mov rdx,rbx
mov rcx,offset MT_System.IDisposable
call qword ptr [7FF996664348]; System.Runtime.CompilerServices.CastHelpers.IsInstanceOfInterface(Void*, System.Object)
test rax,rax
jne short M01_L03
M01_L02:
add rsp,28
pop rbx
pop rsi
ret
M01_L03:
mov rcx,rax
mov r11,7FF9965204F8
call qword ptr [r11]
jmp short M01_L02
; Total bytes of code 124
Two things in particular to notice here. First, we see there’s a
call
to CORINFO_HELP_RUNTIMEHANDLE_METHOD
; that’s the helper being used to obtain information about the actual type T
being used. Second, ThrowIfEqual
is not being inlined; if that were being inlined, we wouldn’t see that call
to ThrowIfEqual
here but instead we’d see the actual code for ThrowIfEqual
. We can confirm why it’s not being inlined via another BenchmarkDotNet diagnoser: [InliningDiagnoser]
. The JIT is capable of emitting events for much of its activity, including reporting on any successful or failed inlining operations, and [InliningDiagnoser]
listens to those events and reports them as part of the benchmarking results. This particular diagnoser is in a separate BenchmarkDotNet.Diagnostics.Windows
package and only works today when running on Windows, because it relies on ETW, hence for comparison why I made the previous benchmark also be Windows. When I add:[InliningDiagnoser(allowedNamespaces: ["Benchmarks"])]
to my
Tests
class, and run the benchmarks for .NET 8:
Code:
// Add a <PackageReference Include="BenchmarkDotNet.Diagnostics.Windows" Version="9.0.0" /> to the csproj.
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnostics.Windows.Configs;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
namespace Benchmarks;
[InliningDiagnoser(allowedNamespaces: ["Benchmarks"])]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static void Main(string[] args) =>
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[Benchmark]
public void Test() => ThrowOrDispose(new Version(1, 0), new Version(1, 1));
[MethodImpl(MethodImplOptions.NoInlining)]
private static void ThrowOrDispose<T>(T value, T invalid) where T : IEquatable<T>
{
ArgumentOutOfRangeException.ThrowIfEqual(value, invalid);
if (value is IDisposable disposable)
{
disposable.Dispose();
}
}
}
I see this as part of the output:
Code:
Inliner: Benchmarks.Tests.ThrowOrDispose - generic void (!!0,!!0)
Inlinee: System.ArgumentOutOfRangeException.ThrowIfEqual - generic void (!!0,!!0,class System.String)
Fail Reason: runtime dictionary lookup
In other words,
ThrowOrDispose
called ThrowIfEqual
but couldn’t inline it because ThrowIfEqual
contained a “runtime dictionary lookup;” in other words, it’s a shared generic method.Now on .NET 9, thanks to dotnet/runtime#99265, it is inlined! The resulting assembly is too large for me to show here, but we can see the impact in the benchmark results:
Method | Runtime |
---|---|
Test | .NET 8.0 |
Test | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
17.54 ns
[/td][td]
1.00
[/td][td]
12.76 ns
[/td][td]
0.73
[/td]and we can see it in the inlining report as successfully inlining.
GC
Applications end up having different needs when it comes to memory management. Would you be willing to throw more memory at maximizing throughput, or do you care more about minimizing working set? How important is it that unused memory be returned to the system aggressively? Is your expected workload constant or ebbing and flowing in nature? The GC has long had lots of knobs for configuring behavior based on these kinds of questions, but none more apparent than the choice of whether to use the “workstation GC” or “server GC”.
By default, an application uses the workstation GC, though some environments (like ASP.NET) opt-in to using server GC automatically. You can explicitly opt-in in a variety of ways, including by adding
<ServerGarbageCollection>true</ServerGarbageCollection>
into your project file (as we did in the Benchmarking Setup section of this post). Workstation GC optimizes for reduced memory consumption, while server GC optimizes for maximum throughput. Historically, workstation employs a single heap, whereas server employs a heap per core. That typically represents a tradeoff between amount of memory consumed and the overhead of accessing a heap, such as the cost of allocating. If a bunch of threads are all trying to allocate at the same time, with server GC they’re very likely to all be accessing different heaps, thereby reducing contention, whereas with workstation GC, they’re all going to be fighting for access. Conversely, more heaps generally means more memory consumed (even though each heap could be smaller than the single one), especially in lull periods where the system might not be fully loaded, yet is paying in working set for those extra heaps.The decision for which to use isn’t always so clear. Especially in the presence of containers, you frequently still care about really good throughput, but also don’t want to be spending memory uselessly. Enter “DATAS,” or “Dynamically Adapting To Application Sizes”. DATAS was introduced in .NET 8 and serves to narrow the gap between workstation and server GC, bringing server GC closer to workstation memory consumption. It dynamically scales how much memory is being consumed by server GC, such that in times of less load, less memory is being used. While DATAS shipped in .NET 8, it was only on by default for Native AOT-based projects, and even there it still had some issues to be sorted. Those issues have now been sorted (e.g. dotnet/runtime#98743, dotnet/runtime#100390, dotnet/runtime#102368, and dotnet/runtime#105545), such that in .NET 9, as of dotnet/runtime#103374, DATAS is now enabled by default for server GC.
If you have a workload where absolute best possible throughput is paramount and you’re ok with additional memory being consumed to enable that, you should feel free to disable DATAS, e.g. by adding this to your project file:
<GarbageCollectionAdaptationMode>0</GarbageCollectionAdaptationMode>
While DATAS being enabled by default is a very impactful improvement for .NET 9, there are other GC-related improvements in the release as well. For example, when compacting heaps, the GC may end up sorting objects by addresses. For large numbers of objects, this sort can be relatively expensive, and it behooves the GC to parallelize the sorting operation. For this purpose, several releases ago the GC incorporated a parallelized sorting algorithm called vxsort, which is effectively a quicksort with a parallelized partitioning step. However, it was only enabled for Windows (and only on x64). In .NET 9, it’s enabled for Linux as well as part of dotnet/runtime#98712. This helps to reduce GC pause times.
VM
The .NET runtime provides many services to managed code. There’s the GC, of course, and the JIT compiler, and then there’s a whole bunch of functionality around things like assembly and type loading, exception handling, configuration management, virtual dispatch, interop infrastructure, stub management, and so on. All of that functionality is generally referred to as being part of the coreclr virtual machine (VM).
Many performance changes in this area are hard to demonstrate, but they’re still worth mentioning. dotnet/runtime#101580 lazily-allocates some information related to method entrypoints, resulting in smaller heap sizes and less work on startup. dotnet/runtime#96857 also removed some unnecessary allocation happening related to data structures around methods. dotnet/runtime#96703 reduced the algorithmic complexity of some key functions involved in building up method tables, while dotnet/runtime#96466 streamlined access to those tables, minimizing the number of indirections involved.
Another set of changes went in to improving various calls from managed code into the VM. When managed code needs to call into the runtime, it has a couple of mechanisms it can employ. One is a “QCALL,” which is effectively just a P/Invoke /
DllImport
into functions declared in the runtime. The other is an “FCALL,” a much more specialized and complicated mechanism for invoking runtime code that’s capable of accessing managed objects. FCALL used to be the dominant mechanism, but each release more and more such calls are transitioned over to being QCALLs, which helps with both correctness (FCALLs can be hard to “get right”) and in some cases performance (some FCALLS need helper method frames that in turn typically make them more expensive than QCALLs). dotnet/runtime#96860 converted over members of Marshal
, dotnet/runtime#96916 did so for Interlocked
, dotnet/runtime#96926 handled several more threading-related members, dotnet/runtime#97432 converted some of the built-in marshaling support, dotnet/runtime#97469 and dotnet/runtime#100939 handled methods from GC
and throughout reflection, dotnet/runtime#103211 from @AustinWise converted GC.ReRegisterForFinalize
, and dotnet/runtime#105584 converted Delegate.GetMulticastInvoke
(which is used in APIs like Delegate.Combine
and Delegate.Remove
). dotnet/runtime#97590 did the same for the slow path in ValueType.GetHashCode
, while also converting the fast path to managed to avoid the transition entirely.But arguably the most impactful change in this area for .NET 9 is around exceptions. Exceptions are expensive and should be avoided where performance matters. But… just because they’re expensive doesn’t mean it’s not valuable to make them less expensive. And in fact, there are cases where it’s really worthwhile to make them less expensive. One of the things we sporadically observe in the wild are “exception storms.” Some failure happens, which causes another failure, which causes another. Each of those incurs exceptions. CPU consumption starts to spike as the overhead of those exceptions is incurred. Now other things start to time out because they’re getting starved, and they throw exceptions, which in turn causes more failures. You get the idea.
In Performance Improvements in .NET 8, I highlighted that in my opinion the single most important performance improvement in the release was a single character change, enabling dynamic PGO by default. Now in .NET 9, dotnet/runtime#98570 is similar, a super small and simple PR that belies all the work that came before it. Earlier on, dotnet/runtime#88034 had ported the Native AOT exception handling implementation over to coreclr, but it was disabled by default due to still needing bake time. It’s now had that bake time, and the new implementation is now on by default in .NET 9. And it’s much faster. Things get better still with dotnet/runtime#103076, which removes a global spinlock involved in the handling of exceptions.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public async Task ExceptionThrowCatch()
{
for (int i = 0; i < 1000; i++)
{
try { await Recur(10); } catch { }
}
}
private async Task Recur(int depth)
{
if (depth <= 0)
{
await Task.Yield();
throw new Exception();
}
await Recur(depth - 1);
}
}
Method | Runtime |
---|---|
ExceptionThrowCatch | .NET 8.0 |
ExceptionThrowCatch | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
123.03 ms
[/td][td]
1.00
[/td][td]
54.68 ms
[/td][td]
0.44
[/td]Mono
We frequently say “the runtime,” but in reality there are currently multiple runtime implementations in .NET. “coreclr” is the runtime thus far referred to, which is the default runtime used on Windows, Linux, and macOS, and for services and desktop applications, but there’s also “mono,” which is mainly used when the form factor of the target application requires a small runtime: by default, it’s the runtime that’s used when building mobile apps for Android and iOS today, as well as the runtime used for Blazor WASM apps. mono has also seen a multitude of performance improvements in .NET 9:
- Save/restoring of profile data. One of the features provided by mono is an interpreter, which enables .NET code to execute in environments where JIT’ing isn’t permitted, as well as to enable faster startup. Specifically for when targeting WASM, the interpreter has a form of PGO where after methods have been invoked some number of times and are deemed important, it’ll generate WASM on-the-fly to optimize those methods. This tiering gets better in .NET 9 with dotnet/runtime#92981, which enables keeping track of which methods tiered up, and if the code is running in a browser, storing that information in the browser’s cache for subsequent runs. When the code then runs subsequently, it can incorporate the previous learnings to tier up better and more quickly.
- SSA-based Optimization. The compiler that generates that WASM applied optimizations primarily at the basic block level. dotnet/runtime#96315 overhauls the implementation to employ Static Single Assignment (SSA) form, which is commonly used by optimizing compilers and which ensures that every variable is assigned in exactly one location. That form simplifies many resulting analyses and thus helps to better optimize the code.
- Vector improvements. More and more vectorization is being done by the core libraries, utilizing hardware intrinsics and the various
Vector
types. To enable such library code to execute well on mono, the various mono backends need to also handle those operations efficiently. One of the most impactful changes here is dotnet/runtime#105299, which updated mono to accelerateShuffle
for types other thanbyte
andsbyte
(which were already handled). This is very impactful to functionality in the core libraries, many of which useShuffle
as part of core algorithms, like throughoutIndexOfAny
, hex encoding and decoding, Base64 encoding and decoding,Guid
, and more. dotnet/runtime#92714 and dotnet/runtime#98037 also improved vector construction, such as by enabling the mono JIT to utilize the Arm64ins
(Insert) instruction when creating onefloat
ordouble
vector from the values in another. - More intrinsics. dotnet/runtime#98077, dotnet/runtime#98514, and dotnet/runtime#98710 implemented various
AdvSimd.Load*
andAdvSimd.Store*
APIs. dotnet/runtime#99115 and dotnet/runtime#101622 intrinsified several clearing and filling methods that backSpan<T>.Clear/Fill
. And dotnet/runtime#105150 and dotnet/runtime#104698 optimized variousUnsafe
methods, such asBitCast
. dotnet/runtime#91813 also significantly improved unaligned access on a variety of CPUs by not forcing the implementation down a slow path if the CPU is able to handle such reads and writes. - Startup. dotnet/runtime#100146 is a fun one, as it had accidentally positive benefits for mono startup. The change updated dotnet/runtime’s configuration to enable more static analysis, and in particular enforcing CA1865, CA1866, and CA1867, which we hadn’t yet gotten around to enabling for the repo. The change included fixing all of the violations of the rules, which mostly meant fixing call sites like
IndexOf("!")
(IndexOf
taking a single-characterstring
) and replacing it withIndexOf('!')
. The intent of the rule was that doing so is a little bit faster and the call site ends up being a little bit cleaner. ButIndexOf(string)
is cultural-aware, which means using it can force the globalization library ICU to be loaded and initialized. As it turns out, some of these uses were on mono’s startup path and were forcing ICU to be loaded when it wasn’t actually necessary. Fixing those meant the loading could be delayed, and startup performance improved as a result. dotnet/runtime#101312 also improved startup with the interpreter by adding a cache to the code that does vtable setups. This uses a custom hash table implementation added in dotnet/runtime#100386, which is then also used elsewhere, such as in dotnet/runtime#101460 and dotnet/runtime#102476. That hash table is itself interesting, as it’s lookups are vectorized for x64, Arm, and WASM and it’s generally optimized for cache locality. - Variance check removal. When storing objects into an array, that operation needs to be validated to ensure compatibility between the type being stored and the concrete type of the array. Given a base type
B
and two derived typesD1 : B
andD2 : B
, you could have an arrayB[] array = new D1[42];
, and then the codearray[0] = new D2()
would successfully compile, becauseD2
is aB
, but at run-time this must fail, asD2
is not aD1
, and so the runtime needs a check to ensure correctness. If the array’s type is sealed, though, this check can be avoided, since then you can’t end up with this discrepancy. coreclr already does that optimization; now as part of dotnet/runtime#99829, the mono interpreter does so as well.
Native AOT
Native AOT is a solution for generating native executables directly from .NET applications. The resulting binary doesn’t require .NET to be installed and does not require JIT’ing; instead it contains in it all of the assembly code for the whole app, inclusive of the code for any core library functionality accessed, the assembly for the garbage collector, and so on. Native AOT first shipped in .NET 7 and was then significantly improved for .NET 8, in particular around reducing the size of the resulting applications. Now in .NET 9, investment continues in Native AOT, with some very nice fruits of the labor on it. (Note that the Native AOT tool chain uses the JIT to generate assembly code, so most of the code generation improvements discussed in the JIT section and elsewhere in this post accrue to Native AOT as well.)
One of the biggest concerns for Native AOT is size and trimming. Native AOT-based applications and libraries compile everything, all user code, all the library code, the runtime, everything, into the single native binary. It’s thus imperative that the tool chain goes to extra lengths to get rid of as much as possible in order to keep that size down. This can include being more clever about how the runtime stores the state necessary for execution. It can include being more thoughtful about generics in order to reduce the possible code size explosion that can result from lots of generic instantiations (effectively multiple copies of the exact same code all specialized for different type arguments). And it can include being very diligent about avoiding dependencies that can bring in lots of code unexpectedly and that the trimming tools are unable to reason about enough to remove. Here are some examples of all of these in .NET 9:
- Refactoring choke points. Think through your code: how many times have you written a method that takes some input and then dispatches to one of many different kinds of things based on the input provided? That’s reasonably common. Unfortunately, it can also be problematic for Native AOT code size. A good example is fixed by dotnet/runtime#91185 in
System.Security.Cryptography
. There are a bunch of hashing related types, likeSHA256
orSHA3_384
, that all offer aHashData
method. Then there are places where the exact hashing algorithm to be used is specified via aHashAlgorithmName
. You can likely envision the large switch statement that results (or don’t imagine and instead just look at the code), where based on the exactHashAlgorithmName
specified, the implementation selects the right type’sHashData
method to call. That is what’s often referred to as a “choke point,” where all callers end up coming through this one method, which then fans out to the relevant implementations, but that also then causes this size problem for Native AOT: if that choke point is referenced, it typically ends up needing to generate the code for all of the referenced methods, even if only a subset are actually used. Some of these cases are really challenging to solve. In this particular case, though, thankfully all of thoseHashData
methods turned around and called to a parameterized, shared implementation. So the fix was to just skip the middle tier and have theHashAlgorithmName
layer go directly to the workhorse implementation, without naming the intermediate layer methods. - Less LINQ. LINQ is an amazing productivity tool. We love LINQ and invest in it every release of .NET (see the later section in this post on tons of performance improvements in LINQ in .NET 9). With Native AOT, however, significant use of LINQ can also measurably increase code size, in particular when value types are involved. As will be discussed later when talking about LINQ optimizations, one of the optimizations LINQ employs is to special-case based on the inputs what kind of
IEnumerable<T>
its methods give back. So, for example, if you callSelect
with an array input, theIEnumerable<T>
you get back might actually be an instance of the internalArraySelectIterator<T>
, and if you callSelect
with aList<T>
, theIEnumerable<T>
you get back might actually be an instance of the internalListSelectIterator<T>
. The Native AOT trimmer can’t readily determine which of those paths might be used, so the Native AOT compiler needs to generate code for all such types when you callSelect<T>
. If theT
is a reference type, there will just be a single copy of the generated code shared for all reference types. But if theT
is a value type, there will be a custom stamp of the code generated for and optimized for each uniqueT
. That means if such LINQ APIs (and other similar APIs) are used a lot, they can disproportionately increase the size of a Native AOT binary. dotnet/runtime#98109 is an example of a PR that replaced a bit of LINQ code in order to measurably reduce the size of ASP.NET applications compiled with Native AOT. But you can also see that PR being thoughtful about which LINQ usage was removed, citing these few specific instances making a measurable difference and leaving the rest of the LINQ usage in the library intact. - Avoiding unnecessary array types. The
SharedArrayPool<T>
that backsArrayPool<T>.Shared
was storing lots of state, including several fields with types along the lines ofT[][]
. This makes sense; it’s pooling arrays, so it needs an array of arrays. From a Native AOT perspective, though, ifT
is a value type (as is very common withArrayPool<T>
),T[][]
as its own unique array type needs its own code generated for it, distinct from the code for, for example,T[]
. As it turns out,ArrayPool<T>
doesn’t actually need to work with the array instances in these cases, so it doesn’t actually need the strongly-typed nature of the arrays; this could just as well beobject[]
orArray[]
. And that’s one of the main things that dotnet/runtime#97058 does: with that, the compiled binary can carry the code generated for justArray[]
rather than needing code forbyte[][]
andchar[][]
andobject[][]
and for whatever other typeT
s are used withArrayPool<T>
in the application. - Avoiding unnecessarily generic code. The Native AOT compiler doesn’t do any kind of “outlining” today (the opposite of inlining, where, rather than moving code from a called method into the caller, the compiler would extract code from a method out into a separate method). If you have a large method, the compiler will need to generate code for the whole method, and if that method is generic and multiple generic specializations are compiled, the whole method will be compiled and optimized for each. But, if you have any meaningful amounts of code in that method that don’t actually depend on the generic types in question, you can avoid that duplication by refactoring the code into separate non-generic methods that are invoked by the generic. That’s what dotnet/runtime#101474 does in some of the types in
Microsoft.Extensions.Logging.Console
, likeSimpleConsoleFormatter
andJsonConsoleFormatter
. There’s a genericWrite<TState>
method, but theTState
is only used in the very first line of the method, which formats the arguments into a string. After that, there’s a lot of logic about doing the actual writing, but all of it only needs the output of that formatting operation, not the input. So, this PR simply refactors thatWrite<TState>
to do the formatting and then delegates to the bulk of the work in a separate non-generic method. - Cutting out unnecessary dependencies. There are many small but meaningful dependencies one doesn’t think about until they start focusing on generated code size and zooming in on exactly where all that code size is coming from. dotnet/runtime#95710 is a good example. The
AppContext.OnProcessExit
method is rooted (never trimmed) by the runtime because it’s invoked when the process is exiting. ThatOnProcessExit
was accessingAppDomain.CurrentDomain
, which returns anAppDomain
.AppDomain
‘sToString
override depends on a bunch of stuff. AndToString
on a type that’s not trimmed away is itself basically never trimmed because if anything anywhere calls to the baseobject.ToString
, the system needs to know that any possible derived type that might find its way to that call site will be invokable. That all means that all of that stuff used byAppDomain.ToString
was never being trimmed. This small refactoring made it so that all that stuff would only need to be kept ifAppDomain.CurrentDomain
is ever actually accessed by user code. Another example of this comes in dotnet/runtime#101858, which removes a dependency on some of theConvert
methods. - Using a better tool for the job. Sometimes there’s just a better, simpler answer. dotnet/runtime#100916 highlights one such case. Some code in
Microsoft.Extensions.DependencyInjection
needed aMethodInfo
for a particular method, and it was usingSystem.Linq.Expressions
to extract one, when it could instead just use a delegate. That’s not only cheaper in terms of allocation and overhead, it removes a dependency on theExpressions
library. - Compile time instead of run time. Source generators are a great boon for Native AOT, as they enable computing things at build time and baking the results into the assembly rather than computing those same things at run time (which, in the relevant situations, typically is done once and then cached). That’s useful for startup performance, as you’re not having to do that work just to get going. It’s useful for steady-state throughput, as you can often take more time to do a better job when the work is being done at build time. But it’s also useful for size, because it removes a dependency on anything that might have been used as part of the computation. And often that dependency is reflection, which brings with it a lot of size. As it turns out,
System.Private.CoreLib
has its own source generator that’s used when buildingCoreLib
, and dotnet/runtime#102164 augmented that source generator to generate a dedicated implementation ofEnvironment.Version
andRuntimeInformation.FrameworkDescription
. Previously, both of these methods that are implemented inCoreLib
would use reflection to look up attributes also inCoreLib
, but that’s something the source generator can instead do at a build time, and just bake the answer into the implementation of these methods. - Avoiding duplication. It’s not uncommon to have two methods somewhere in your app that have the same implementations, especially for small helper methods, like property accessors. dotnet/runtime#101969 teaches the Native AOT tool chain to deduplicate those, such that the code is only stored once.
- Interfaces be gone. Previously, unused interface methods could be trimmed away (effectively removing them from the interface type and removing all implementations of that method), but the compiler wasn’t able to fully remove the actual interface types themselves. Now with dotnet/runtime#100000, it can.
- Unnecessary static constructors. The trimmer was keeping the static constructor of a type if any field was accessed. This is unnecessarily broad: those static constructors only need to be kept if a static field was accessed. dotnet/runtime#96656 improves that.
Previous releases saw a considerable amount of time spent on driving down binary sizes, but these kinds of changes chip away at them even further. Let’s create a new ASP.NET minimal APIs application using Native AOT. This command uses the
webapiaot
template and creates the new project in a new myapp
directory:dotnet new webapiaot -o myapp
Replace the contents of the generated
myapp.csproj
with this:
Code:
<Project Sdk="Microsoft.NET.Sdk.Web">
<PropertyGroup>
<TargetFrameworks>net9.0;net8.0</TargetFrameworks>
<Nullable>enable</Nullable>
<ImplicitUsings>enable</ImplicitUsings>
<InvariantGlobalization>true</InvariantGlobalization>
<PublishAot>true</PublishAot>
<OptimizationPreference>Size</OptimizationPreference>
<StackTraceSupport>false</StackTraceSupport>
</PropertyGroup>
</Project>
All I’ve done on top of the template’s defaults is have both
net9.0
and net8.0
as target frameworks and then add a couple of settings (at the bottom) focused on driving down the size of Native AOT apps. The app is a simple site that exposes a /todos
list as JSON. We can publish this app with Native AOT:
Code:
dotnet publish -f net8.0 -r linux-x64 -c Release
ls -hs bin/Release/net8.0/linux-x64/publish/myapp
which yields:
9.4M bin/Release/net8.0/linux-x64/publish/myapp
We can see here that the whole site, web server, garbage collector, everything, are contained in
myapp
app, which on .NET 8 is weighing in at 9.4 megabytes. Now, let’s do the same thing for .NET 9:
Code:
dotnet publish -f net9.0 -r linux-x64 -c Release
ls -hs bin/Release/net9.0/linux-x64/publish/myapp
which results in:
8.5M bin/Release/net9.0/linux-x64/publish/myapp
Now, just by moving to the new version, that same
myapp
has shrunk to 8.5 megabytes, an ~10% reduction in binary size.Beyond a focus on size, ahead-of-time compilation also differs from just-in-time compilation in that each has their own opportunities for unique optimizations. The JIT can see the exact details of the current machine and employ the best possible instructions based on what’s available (e.g. using AVX512 instructions on hardware that supports it), and the JIT can use dynamic PGO to evolve the code based on execution characteristic. But, Native AOT is capable of doing whole program optimization, where it can look at everything in the program and optimize based on the totality of everything involved (in contrast, a JIT’d .NET application may load additional .NET libraries at any point). For example, dotnet/runtime#92923 enables automatically making fields
readonly
based on looking at the whole program to see if anything could possibly write to the field from outside of the constructor; this can in turn help things like improving pre-initialization.dotnet/runtime#99761 provides a nice example where, based on whole program analysis, the compiler can see that a particular type is never instantiated. If it’s never instantiated, then type checks for that type will never succeed. And thus if a program has a check like
if (variable is SomethingNeverInstantiated)
, that can be turned into a constant false
, and all of the code associated with that if
block then eliminated. dotnet/runtime#102248 is similar, but for types; if code is doing if (someType == typeof(X))
and the compiler never had to construct a method table for X
, it can similarly turn this into a constant result.Whole program analysis is also applicable to devirtualization in really cool ways. With dotnet/runtime#92440, the compiler can now devirtualize all calls to a virtual method
C.M
if the compiler doesn’t see any instantiations of types that derive from C
. And with dotnet/runtime#97812 and dotnet/runtime#97867, the compiler can now treat virtual
methods as instead being non-virtual
and sealed
when there are no overrides of those methods anywhere in the program.NativeAOT also has a super power in its ability to do pre-initialization. The compiler contains an interpreter that’s able to evaluate code at build time and replace that code with just the result; for some objects, it’s then also able to blit the object’s data into the binary in a way that it can be cheaply dehydrated at execution time. The interpreter is limited in what it’s able and allowed to do, but over time its capabilities are improving. dotnet/runtime#92470 extends it to support more type checks, static interface method calls, constrained method calls, and various operations on spans, while dotnet/runtime#92666 expands it to have some support for hardware intrinsics and the various
IsSupported
methods. dotnet/runtime#92739 further rounds it out with support for stackalloc
‘ing spans, IntPtr
/nint
math, and Unsafe.Add
.Threading
Since the beginning of .NET, general wisdom has been that the vast majority of code that needs to synchronize access to shared state should just use
Monitor
, either directly or more likely via the the C# language syntax for it with lock(...)
. There are a plethora of other synchronization primitives available, at various levels of complexity and with varying purposes, but lock(...)
is the workhorse and the thing that everyone should reach for by default.Over 20 years after the introduction of .NET, that’s evolving, just a bit.
lock(...)
is still the go-to syntax, but in .NET 9 as of dotnet/runtime#87672 and dotnet/runtime#102222, there is now a dedicated System.Threading.Lock
type. Anywhere you were previously allocating an object
just to use that object
with lock(...)
, you should consider using the new Lock
type. You can absolutely still just use object
, and you’ll still need to do so in certain situations, like if you’re using the “condition variable” aspects of Monitor
(such as Signal
and Wait
), and you’ll still want to in others (such as if you’re trying to reduce managed allocation and you have another existing object that can serve double-duty as the monitor). But locking a Lock
can be a more efficient answer. It can also help to be self-documenting, making the code cleaner and more maintainable.As is evident from this benchmark, the syntax for using both can be identical.
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private readonly object _monitor = new();
private readonly Lock _lock = new();
private int _value;
[Benchmark]
public void WithMonitor()
{
lock (_monitor)
{
_value++;
}
}
[Benchmark]
public void WithLock()
{
lock (_lock)
{
_value++;
}
}
}
Lock
, however, will generally be a tad cheaper (and in the future, as most locking shifts to use the new type, we may be able to make most object
s lighter weight by not optimizing for direct locking on arbitrary objects):Method |
---|
WithMonitor |
WithLock |
[th]
Mean
[/th][td]
14.30 ns
[/td][td]
13.86 ns
[/td]Note that C# 13 has special-recognition of
System.Threading.Lock
. If you look at the code that’s generated for WithMonitor
above, it’s equivalent to this:
Code:
public void WithMonitor()
{
object monitor = _monitor;
bool lockTaken = false;
try
{
Monitor.Enter(monitor, ref lockTaken);
_value++;
}
finally
{
if (lockTaken)
{
Monitor.Exit(monitor);
}
}
}
but even though the syntax is identical, here’s an equivalent of what’s generated for
WithLock
:
Code:
Lock.Scope scope = _lock.EnterScope();
try
{
_value++;
}
finally
{
scope.Dispose();
}
We’ve also started using
Lock
internally. dotnet/runtime#103085 and dotnet/runtime#103104 used it instead of object
locks in Timer
, ThreadLocal
, and RegisteredWaitHandle
. In time, I expect to see more and more use switched over.Of course, while locks are the default recommendation for synchronization, there’s still a lot of code that demands the higher throughput and scalability that comes from more lock-free programming, and the workhorse for such implementations is
Interlocked
. In .NET 9, Interlocked.Exchange
and Interlocked.CompareExchange
gain some very welcome capabilities. First, dotnet/runtime#92974 from @MichalPetryka, dotnet/runtime#97588 from @filipnavara, and dotnet/runtime#106660 grant Interlocked
some new powers, the ability to operate over types smaller than int
. It introduces new overloads of Exchange
and CompareExchange
that can work on byte
, sbyte
, ushort
, and short
. These overloads are public and available for anyone to call, but they’re also then consumed by dotnet/runtime#97528 from @MichalPetryka to improve Parallel.ForAsync<T>
. ForAsync
is given a range of T
to be processed, and schedules multiple workers that all need to repeatedly get the next item from the range, until the range is exhausted. For arbitrary types, that means ForAsync
needs to lock to protect the increment while iterating through the range. But for types where an Interlocked
operation is available, we can use that with low-lock techniques to avoid the lock entirely (both the need to access it and the need to allocate it).
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public async Task ParallelForAsync()
{
await Parallel.ForAsync('\0', '\uFFFF', async (c, _) =>
{
});
}
}
Method | Runtime |
---|---|
ParallelForAsync | .NET 8.0 |
ParallelForAsync | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
42.807 ms
[/td][td]
1.00
[/td][td]
7.184 ms
[/td][td]
0.17
[/td]Even with those new overloads, though, there are still places it’s desirable to use
Interlocked.Exchange
or Interlocked.CompareExchange
where they can’t be used easily. Consider the aforementioned Parallel.ForAsync
. It’d be really nice if we could just call Interlocked.CompareExchange<T>
, but CompareExchange<T>
only works with reference types. So we’re instead left with unsafe code:
Code:
static unsafe bool CompareExchange(ref T location, T value, T comparand) =>
sizeof(T) == sizeof(byte) ? Interlocked.CompareExchange(ref Unsafe.As<T, byte>(ref location), Unsafe.As<T, byte>(ref value), Unsafe.As<T, byte>(ref comparand)) == Unsafe.As<T, byte>(ref comparand) :
sizeof(T) == sizeof(ushort) ? Interlocked.CompareExchange(ref Unsafe.As<T, ushort>(ref location), Unsafe.As<T, ushort>(ref value), Unsafe.As<T, ushort>(ref comparand)) == Unsafe.As<T, ushort>(ref comparand) :
sizeof(T) == sizeof(uint) ? Interlocked.CompareExchange(ref Unsafe.As<T, uint>(ref location), Unsafe.As<T, uint>(ref value), Unsafe.As<T, uint>(ref comparand)) == Unsafe.As<T, uint>(ref comparand) :
sizeof(T) == sizeof(ulong) ? Interlocked.CompareExchange(ref Unsafe.As<T, ulong>(ref location), Unsafe.As<T, ulong>(ref value), Unsafe.As<T, ulong>(ref comparand)) == Unsafe.As<T, ulong>(ref comparand) :
throw new UnreachableException();
Another place it’d be really nice to use
Interlocked.Exchange
and Interlocked.CompareExchange
is with enums. It’s reasonable common to use these APIs to transition between states in some algorithm, and often the ideal is for those states to be represented as an enum. However, there are no overloads of {Compare}Exchange
that have worked with enums, so developers have been forced to use integers instead, often with comments stating something along the lines of “This should be an enum, but enums can’t work with CompareExchange.” Or, at least, they couldn’t, until .NET 9.Now in .NET 9, as of dotnet/runtime#104558 the generic
Exchange
and CompareExchange
have had their class
constraint removed. This means use of Exchange<T>
and CompareExchange<T>
will compile for any T
. Then at runtime, the T
is checked to ensure it’s a reference type, a primitive type, or an enum type; anything else, and it’ll throw. When it is one of those, it delegates to the correspondingly-sized overload. For example, this now compiles and runs successfully:
Code:
static DayOfWeek UpdateIfEqual(ref DayOfWeek location, DayOfWeek newValue, DayOfWeek expectedValue) =>
Interlocked.CompareExchange(ref location, newValue, expectedValue);
This is not only good for usability, it’s good for performance in a few ways. First, it enables performance improvements like the
Parallel.ForAsync
one described without needing to resort to Unsafe
tricks. But second, it enables smaller objects. The previously listed change not only updated CompareExchange
to remove the constraint but also then employed the overload in dozens of places. In Http3Connection
, for example, the object previously had these three fields which were updated with Interlocked.Exchange
:
Code:
private int _haveServerControlStream;
private int _haveServerQpackDecodeStream;
private int _haveServerQpackEncodeStream;
but these are really just
bool
s masquerading as int
s, exactly because they needed to be updated atomically. Now with Interlocked.Exchange<T>
and Interlocked.CompareExchange<T>
supporting bool
, these have been updated to just be:
Code:
private bool _haveServerControlStream;
private bool _haveServerQpackDecodeStream;
private bool _haveServerQpackEncodeStream;
Any additional padding aside, that reduces 12 bytes down to 3 bytes on the object.
Also related to
Interlocked
, dotnet/runtime#96258 intrinsifies the Interlocked.And
and Interlocked.Or
methods for additional platforms; previously they were specially handled on Arm, but now they’re also specially handled on x86/64. As an example, the implementation in the And
method is a fairly typical CompareExchange
loop:
Code:
public static int And(ref int location1, int value)
{
int current = location1;
while (true)
{
int newValue = current & value;
int oldValue = CompareExchange(ref location1, newValue, current);
if (oldValue == current)
{
return oldValue;
}
current = oldValue;
}
}
You’ll see a very similar loop any time you want to use optimistic concurrency to create a new value and substitute it for the original in an atomic manner. The actual
&
operation is just one line here, and to highlight that this is broadly applicable, you could create a generalized version of this method for any operation using a delegate, like this:
Code:
public static int CompareExchange(ref int location1, int value, Func<int, int, int> update)
{
int current = location1;
while (true)
{
int newValue = update(current, value);
int oldValue = CompareExchange(ref location1, newValue, current);
if (oldValue == current)
{
return oldValue;
}
current = oldValue;
}
}
such that
And
could be implemented then like:
Code:
public static int And(ref int location1, int value) =>
CompareExchange(ref location1, value, static (current, value) => current & value)
The approach employed by
And
is reasonable when there’s nothing better you can do, but as it turns out, modern hardware platforms have single instructions capable of performing such an interlocked and
and or
in a much more efficient manner. The JIT already handled this for Arm because the instructions on Arm have semantics that very closely align with the semantics of Interlocked.And
and Interlocked.Or
. On x86/64, however, the relevant instruction sequence (lock and
or lock or
) doesn’t enable accessing the original value atomically replaced, whereas And
/Or
require that as part of their definition. Luckily, most uses of Interlocked.And
/Interlocked.Or
don’t actually need the return value. For example, SafeHandle.SetHandleAsInvalid
simply wants to atomically OR an additional flag into some bit flags, ignoring the result of Or
:
Code:
public void SetHandleAsInvalid()
{
Interlocked.Or(ref _state, StateBits.Closed);
GC.SuppressFinalize(this);
}
And luckily, the JIT can see that it’s ignoring the result. As such, on x86/64, the JIT can use the optimal sequence when it can see that the result isn’t being used, and even if it is being used, it can still emit a slightly more concise instruction sequence than would have naturally resulted from our open-coded implementation:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private int _location;
[Benchmark] public void Test_ResultNotUsed() => Interlocked.And(ref _location, 42);
[Benchmark] public int Test_ResultUsed() => Interlocked.And(ref _location, 42);
}
Code:
// .NET 8
; Tests.Test_ResultNotUsed()
push rbp
sub rsp,10
lea rbp,[rsp+10]
add rdi,8
mov eax,[rdi]
M00_L00:
mov ecx,eax
and ecx,2A
mov [rbp-4],eax
lock cmpxchg [rdi],ecx
mov ecx,[rbp-4]
cmp eax,ecx
je short M00_L01
mov ecx,eax
mov eax,ecx
jmp short M00_L00
M00_L01:
add rsp,10
pop rbp
ret
; Total bytes of code 47
; Tests.Test_ResultUsed()
push rbp
sub rsp,10
lea rbp,[rsp+10]
add rdi,8
mov eax,[rdi]
M00_L00:
mov ecx,eax
and ecx,2A
mov [rbp-4],eax
lock cmpxchg [rdi],ecx
mov ecx,[rbp-4]
cmp eax,ecx
je short M00_L01
mov ecx,eax
mov eax,ecx
jmp short M00_L00
M00_L01:
add rsp,10
pop rbp
ret
; Total bytes of code 47
// .NET 9
; Tests.Test_ResultNotUsed()
add rdi,8
mov eax,2A
lock and [rdi],eax
ret
; Total bytes of code 13
; Tests.Test_ResultUsed()
add rdi,8
mov ecx,2A
mov eax,[rdi]
M00_L00:
mov edx,eax
and edx,ecx
lock cmpxchg [rdi],edx
jne short M00_L00
ret
; Total bytes of code 22
Method | Runtime |
---|---|
Test_ResultNotUsed | .NET 8.0 |
Test_ResultNotUsed | .NET 9.0 |
Test_ResultUsed | .NET 8.0 |
Test_ResultUsed | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Code Size
[/th][td]
6.630 ns
[/td][td]
1.00
[/td][td]
47 B
[/td][td]
3.132 ns
[/td][td]
0.47
[/td][td]
13 B
[/td][td]
6.853 ns
[/td][td]
1.00
[/td][td]
47 B
[/td][td]
6.435 ns
[/td][td]
0.94
[/td][td]
22 B
[/td]Locks and interlocked operations are about coordinating between operations, at a relatively low level. There are higher level coordination constructs as well; that’s effectively what
Task
is, providing a representation for a piece of work with which you can later join. Such joining can be accomplished with await
along with a myriad of APIs that faciliate joining with tasks in various ways. In that regard, one of my favorite new APIs in .NET 9 is on Task
: Task.WhenEach
. I like it because it utilizes newer language features to cleanly solve a problem that we wanted to solve over a decade ago when Task
was originally introduced, and the lack of it has led to folks writing code with known pits of failure.Task.WhenAll
is fairly easy to understand: you give it a collection of tasks, and the task it returns will complete only when all of the constituent tasks have completed:
Code:
await Task.WhenAll([t1, t2, t3]);
... // only get here when t1, t2, and t3 have all completed successfully
Task.WhenAny
is a bit more complex, in that it returns when any of the constituent tasks has completed, and it gives you back a reference to that task:
Code:
Task tCompleted = await Task.WhenAny([t1, t2, t3]);
... // tCompleted is either t1, t2, or t3, and will be completed here
and you can then explicitly join with that returned task to observe any exceptions it may have incurred or consume its result value if it has one. But what do you then do to join with the remaining two tasks? You might end up writing code something like this:
Code:
List<Task> tasks = new() { t1, t2, t3 };
while (tasks.Count > 0)
{
Task completed = await Task.WhenAny(tasks);
Handle(completed);
tasks.Remove(completed);
}
That’s not terribly hard, but it’s also not terribly efficient. Or, rather, for larger number of tasks, it’s terribly inefficient, as it’s an
O(N^2)
algorithm. Some of the complexity is likely obvious: you’ve got a loop and inside that loop you’ve got a List<T>.Remove
call, which will do an O(N)
walk of the list looking for the target element to remove: there’s the O(N^2)
. But, there’s actually another less obvious O(N)
operation in the loop: the WhenAny
itself. Every call to WhenAny
needs to hook a continuation up to each of the constituent Task
objects. (There are of course cheaper ways to implement this functionality than using such a WhenAny
, but they’re all more complicated and thus not the answers towards which folks have gravitated.)Enter
Task.WhenEach
. WhenEach
‘s purpose is to make consuming tasks as they complete both simple and efficient. To do so, rather than returning a Task<Task>
as does WhenAny
, it returns an IAsyncEnumerable<Task>
, so one can simply iterate through the completing tasks as they complete.
Code:
await foreach (Task completed in Task.WhenEach([t1, t2, t3]))
{
Debug.Assert(completed.IsCompleted);
Handle(completed);
}
It’s a little hard to get a good applies-to-apples comparison of the overhead here, but this benchmark is a reasonable approximation:
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Params(10, 1_000)]
public int Count { get; set; }
[Benchmark]
public async Task WithWhenAny()
{
var tcs = Enumerable.Range(0, Count).Select(_ => new TaskCompletionSource()).ToList();
List<Task> tasks = tcs.Select(t => t.Task).ToList();
tcs[^1].SetResult();
while (tasks.Count > 0)
{
Task completed = await Task.WhenAny(tasks);
tasks.Remove(completed);
tcs.RemoveAt(tcs.Count - 1);
if (tasks.Count == 0) break;
tcs[^1].SetResult();
}
}
[Benchmark]
public async Task WithWhenEach()
{
var tcs = Enumerable.Range(0, Count).Select(_ => new TaskCompletionSource()).ToList();
int remaining = tcs.Count - 1;
tcs[remaining].SetResult();
await foreach (Task completed in Task.WhenEach(tcs.Select(t => t.Task)))
{
if (remaining == 0) break;
tcs[--remaining].SetResult();
}
}
}
Method | Count |
---|---|
WithWhenAny | 10 |
WithWhenEach | 10 |
WithWhenAny | 1000 |
WithWhenEach | 1000 |
[th]
Mean
[/th][th]
Allocated
[/th][td]
3.232 us
[/td][td]
3.47 KB
[/td][td]
1.223 us
[/td][td]
1.43 KB
[/td][td]
20,082.683 us
[/td][td]
4207.12 KB
[/td][td]
102.759 us
[/td][td]
94.24 KB
[/td]WhenAll
also gets a bit cheaper, in a couple of ways. dotnet/runtime#93953 utilizes a trick employed elsewhere in Task
in .NET 8, which is to use its used m_stateObject
field (unused because there’s no way to set it with WhenAll
) to store some of the state that previously had a dedicated field (a field for storing information about constituent tasks that failed or were canceled). That means the Task
object WhenAll
returns gets 8 bytes smaller (on 64-bit). On top of that, dotnet/runtime#101308 adds new ReadOnlySpan<T>
-based overloads to a bunch of methods, including Task.WhenAll
. This enables passing in any number of tasks without needing to allocate.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public async Task WhenAll()
{
var atmb1 = new AsyncTaskMethodBuilder();
var atmb2 = new AsyncTaskMethodBuilder();
Task whenAll = Task.WhenAll([atmb1.Task, atmb2.Task]);
atmb1.SetResult();
atmb2.SetResult();
await whenAll;
}
}
Method | Runtime |
---|---|
WhenAll | .NET 8.0 |
WhenAll | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
123.8 ns
[/td][td]
1.00
[/td][td]
264 B
[/td][td]
1.00
[/td][td]
103.8 ns
[/td][td]
0.86
[/td][td]
216 B
[/td][td]
0.82
[/td]There are some other interesting performance improvements in threading in .NET 9 as well.
- Debugger.NotifyOfCrossThreadDependency. This is a big deal. When you’re debugging a .NET process and you break in the debugger, it pauses all threads in the debuggee process so that nothing is making forward progress while you examine state. However, .NET debuggers, like the one in Visual Studio, support invoking properties and methods in the debuggee while debugging. That can be a big problem if the functionality being invoked relies on one of those paused threads to do something, e.g. if the property you access tries to take a lock that’s held by another thread or tries to
Wait
on aTask
. To mitigate problems here, theDebugger.NotifyOfCrossThreadDependency
method exists. Functionality that relies on another thread to do something can callNotifyOfCrossThreadDependency
; if there’s no debugger attached, it’s a nop, but if there is a debugger attached, this signals the problem to the debugger, which can then react accordingly. The Visual Studio debugger reacts by stopping the evaluation but then by offering an opt-in option of “slipping” all threads, unpausing all threads until the evaluated operation completes, at which point all threads will be paused again, thereby again trying to mitigate any problems that might occur from the cross-thread dependency.NotifyOfCrossThreadDependency
is generally not used by application code, but it’s used in a few critical choke points in the core libraries, in particular throughoutSystem.Threading
and the infrastructure forasync
/await
. That means, for example, that this method is being called any time youawait
aTask
that’s not yet completed. And, unfortunately, while the method is a cheap nop when the debugger isn’t attached, historically it’s been fairly expensive when the debugger is attached, to the point where it can meaningfully impact a developer’s experience in the tool. Thankfully, .NET 9 addresses this with dotnet/runtime#101864, which significantly improves the performance ofNotifyOfCrossThreadDependency
when a debugger is attached. We can see this with a low-tech benchmark. Replace the contents of yourProgram.cs
with this:
Code:using System.Diagnostics; const int Iters = 100_000; Stopwatch sw = new(); while (true) { sw.Restart(); for (int i = 0; i < Iters; i++) { Debugger.NotifyOfCrossThreadDependency(); } sw.Stop(); Console.WriteLine($"{sw.Elapsed.TotalMicroseconds / Iters:N3}us"); }
open the project in Visual Studio, ensure that .NET 8 is selected as the target framework and that you’re targeting Release:
and run with the debugger attached (F5, not ctrl-F5). When I do that, I see numbers like this (on Windows):
Code:48.360us 45.281us 46.714us 46.945us 46.525us
Then change the target framework to be .NET 9, and run with the debugger attached again. I then see numbers like this:
Code:1.973us 1.713us 1.714us 1.871us 1.963us
While such an improvement shouldn’t impact your production workloads, it can make an impactful difference to your productivity as a developer.- Volatile. A “memory model” is a description of how threads interact with memory and what guarantees are made about how different threads produce and consume changes in shared memory. Reads and writes to memory from a single thread are guaranteed to be observed by that thread in the order they occurred, but once multiple threads enter the picture, it’s up to the memory model to define what behaviors can be relied on and which can’t. For example, if there are two fields,
_a
and_b
, both of which start as0
, and if one thread does:
Code:_a = 1; _b = 2;
and then another does:
Code:while (_b != 2); Assert(_a == 1);
is that assert guaranteed to always pass? It depends on the memory model, and whether the writes from thread 1 might get reordered (by any of the involved compilers or even by the hardware) such that the write to_b
became visible to thread 2 before the write to_a
. For the longest time, the only official memory model for .NET was the one defined by the ECMA 335 specification, but real implementations, including coreclr, generally had stronger guarantees than what ECMA detailed. Thankfully, the official .NET memory model has now been documented. However, some of the practices that were being employed in the core libraries (due to defensive coding or uncertainty of the memory model or out-of-date requirements) are no longer necessary. One of the main tools available for folks coding at a level where memory model is relevant is thevolatile
keyword / theVolatile
class. Marking a field asvolatile
causes any reads or writes of that field to be considered “volatile,” just as does usingVolatile.Read
/Volatile.Write
to perform that read or write. Making the read or write volatile means it prevents certain kinds of “movement,” e.g. if both_a
and_b
in the previous example were marked asvolatile
, the assert would always pass. Marking fields or operations asvolatile
can come with an expense, depending on the circumstance and the target platform. For example, it can restrict the C# compiler and the JIT compiler from performing certain optimizations. Let’s take a simple example. This code:
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [DisassemblyDiagnoser] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private volatile int _volatile; private int _nonVolatile; [Benchmark] public int UsingVolatile() => _volatile + _volatile; [Benchmark] public int UsingNonVolatile() => _nonVolatile + _nonVolatile; }
results in this assembly on .NET 9:
Code:; Tests.UsingVolatile() mov eax,[rdi+8] add eax,[rdi+8] ret ; Total bytes of code 7 ; Tests.UsingNonVolatile() mov eax,[rdi+0C] add eax,eax ret ; Total bytes of code 6
The important difference between the two assembly blocks is in theadd
instruction. In theUsingVolatile
method, the first instruction is loading the value from memory stored at addressrcx+8
, and then re-reading that samercx+8
memory location again to add whatever is there with what it just read. InUsingNonVolatile
, it starts the same way, reading the value stored atrcx+0xc
, but then theadd
isn’t doing another memory read and is instead just doubling the value stored in the register. One of the effects ofvolatile
requiring that reads can’t be moved is also that they can’t be removed, which means both reads in the code are required to stay. Here’s another example:
Code:using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [DisassemblyDiagnoser] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private volatile bool _volatile; private bool _nonVolatile; [Benchmark] public int CountVolatile() { int count = 0; while (_volatile) count++; return count; } [Benchmark] public int CountNonVolatile() { int count = 0; while (_nonVolatile) count++; return count; } }
which on .NET 9 produces this assembly:
Code:; Tests.CountVolatile() push rbp mov rbp,rsp xor eax,eax cmp byte ptr [rdi+8],0 jne short M00_L01 M00_L00: pop rbp ret M00_L01: inc eax cmp byte ptr [rdi+8],0 jne short M00_L01 jmp short M00_L00 ; Total bytes of code 24 ; Tests.CountNonVolatile() push rbp mov rbp,rsp xor eax,eax cmp byte ptr [rdi+9],0 jne short M00_L00 pop rbp ret M00_L00: jmp short M00_L00 ; Total bytes of code 16
They look somewhat similar, in fact the first five instructions are almost identical, but there’s a critical difference. In both cases, thebool
value is being loaded and checked to see if it’sfalse
(thecmp
against0
followed by a conditional jump), in which case the implementations both fall through to the endingret
to exit out of the method. The compiler is rewriting thewhile (cond) { ... }
loop to instead be more like anif (cond) { do { ... } while(cond); }
, so this initial test is the one for thatif (cond)
. But then things diverge meaningfully.CountVolatile
then proceeds to do thedo while
equivalent, incrementing thecount
(stored ineax
), reading_volatile
and comparing it to0
(false
), and if it’s stilltrue
, jumping back up to loop again. So basically what you’d expect. But now look atCountNonVolatile
. The loop is just this:
Code:M00_L00: jmp short M00_L00
It’s now sitting in an infinite loop, with an unconditional jump back to the samejmp
instruction, looping forever. That’s because the JIT was able to hoist the read of_nonVolatile
out of the loop. It then also sees that no one will ever observecount
‘s changed value, so it also elides the increment. At which point it’s more like if I’d written this C#:
Code:public int CountNonVolatile() { int count = 0; if (_nonVolatile) { while (true); } return count; }
That hoisting can’t be done when the field isvolatile
, because it can’t reorder or remove reads associated with the field. But with_nonVolatile
, nothing prevents that. On multiple occasions I’ve seen folks trying to engage in low-lock programming experience the ramifications of this latter example: they’ll be using somebool
to signal to a consumer that it should break out of the loop, but thebool
isn’t volatile, and the consumer then never notices when the producer eventually sets it.
Those are examples of the ramifications ofvolatile
in terms of what the C# or JIT compiler are constrained from doing. But there are also things the JIT needs to do (rather than avoid) in order to ensure the hardware can respect the requirements put in place by the developer. On some hardware, like x64, the memory model of the hardware is relatively “strong,” meaning it doesn’t do most of the kinds of reorderings thatvolatile
inhibits, and therefore you won’t see anything emitted into the assembly code by the JIT to help the hardware enforce the constraints. On other hardware, like Arm64, though, the hardware has a relatively “weaker” model, meaning it allows more of these kinds of reorderings, and as a result, the JIT needs to actively inhibit such reorderings by inserting appropriate “memory barriers” into the code. On Arm, this shows up with instructions likedmb
(“data memory barrier”). Such barriers have some overhead associated with them.
For all of these reasons, fewervolatile
s is good for performance, but of course you need to ensure you have enoughvolatile
s to actually achieve a correct application (with the best answer being avoid writing lock-lock code in the first place, and then you never need to know or think aboutvolatile
). It’s a balance. Luckily, and bringing us full circle to why we’re talking about this, there are a set of common cases wherevolatile
used to be recommended but now that we have a well-defined memory model, those uses are obsolete. Removing them can help to avoid a layer of thin cost across the code. So dotnet/runtime#100969 and dotnet/runtime#101346 removed a bunch ofvolatile
usage where it was no longer necessary. Almost all of these uses were as part of lazy initialization of reference types, e.g.
Code:private volatile MyReferenceType? _instance; public MyReferenceType Instance => _instance ??= new MyReferenceType();
which if we expand that out to not use??=
looks something like this:
Code:private MyReferenceType? _instance; public MyReferenceType Instance { get { MyReferenceType? instance = _instance; if (instance is null) { _instance = instance = new MyReferenceType(); } return instance; } }
The reason for thevolatile
here would be two-fold, one for the part of the operation that reads and one for the part of the operation that writes. Without thevolatile
, the concern would be that one of the compilers or the hardware could actually “introduce a read,” effectively making the code equivalent to this:
Code:private MyReferenceType? _instance; public MyReferenceType Instance { get { MyReferenceType? instance = _instance; if (_instance is null) { _instance = instance = new MyReferenceType(); } return instance; } }
If that were to happen, there’s a problem that between the two reads, the value of_instance
could go fromnull
to non-null
, in which caseinstance
could be assignednull
,_instance is null
might then be false, andreturn instance
would returnnull
. However, the .NET memory model explicitly states “Reads cannot be introduced.” Then there’s the concern about the write. The concern there that leads tovolatile
being used is the initialization that happens inside ofMyReferenceType
. Imagine ifMyReferenceType
were defined like this:
Code:class MyReferenceType() { internal int _value; public MyReferenceType() => _value = 42; }
The question then becomes “is it possible for the write to_value
inside of the constructor to be viewed by another thread after the write of the instance to_instance
“? In other words, could the code logically become the equivalent of this:
Code:private MyReferenceType? _instance; public MyReferenceType Instance { get { MyReferenceType? instance = _instance; if (_instance is null) { _instance = instance = RuntimeHelpers.GetUninitializedObject(typeof(MyReferenceType)); instance._value = 42; } return instance; } }
If that could happen, then two threads could be racing to accessInstance
, one of them could get as far as setting_instance
(but_value
hasn’t been set yet), then another thread could accessInstance
, see_instance
as non-null
, and start using it, even though_value
hasn’t yet been initialized. Thankfully here as well, the .NET memory model doesn’t allow such transformations, explicitly covering this point:
“Object assignment to a location potentially accessible by other threads is a release with respect to accesses to the instance’s fields/elements and metadata. An optimizing compiler must preserve the order of object assignment and data-dependent memory accesses. The motivation is to ensure that storing an object reference to shared memory acts as a “committing point” to all modifications that are reachable through the instance reference.”
Phew!
- ManagedThreadId. dotnet/runtime#91232 is fun, in a “why didn’t we already do this” sort of way.
Thread.ManagedThreadId
is implemented as an internal call (an FCALL) into the runtime, resulting in a call toThreadNative::GetManagedThreadId
, which in turn reads the thread object’sm_ManagedThreadId
field. At least, that’s the field in the C definition of the object. The managedThread
object has corresponding fields at the exact location that are available for the C# code to use, in this case_managedThreadId
. So what did this PR do? It removed those complicated gymnastics and just made the whole implementation bepublic int ManagedThreadId => _managedThreadId
. (It’s worth noting, though, thatThread.CurrentThread.ManagedThreadId
was already previously recognized specially by the JIT, so this is only relevant when accessing theManagedThreadId
from some otherThread
instance.) The main benefit of this is avoiding the extra function call, as the FCALL can’t be inlined.
Code:using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [DisassemblyDiagnoser] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private Thread _thread = Thread.CurrentThread; [Benchmark] public int GetID() => _thread.ManagedThreadId; }
Code:// .NET 8 ; Tests.GetID() mov rdi,[rdi+8] cmp [rdi],edi jmp near ptr System.Threading.Thread.get_ManagedThreadId() ; Total bytes of code 11 **Extern method** System.Threading.Thread.get_ManagedThreadId() // .NET 9 ; Tests.GetID() mov rax,[rdi+8] mov eax,[rax+34] ret ; Total bytes of code 8
- Ports to NativeAOT. Previous releases of .NET enabled inlining the fast path of thread-local state (TLS) access on coreclr. With dotnet/runtime#104282, dotnet/runtime#89472, and dotnet/runtime#97910, this improvement comes to NativeAOT as well. Similarly, dotnet/runtime#103675 ports coreclr’s “yield normalization” implementation to NativeAOT; this is in support of enabling the runtime to measure the cost of various
pause
instructions, which can then be used as part of tuning spinning and spin waiting.
- Startup time. Performance improvements related to threading are generally about steady-state throughput improvements, e.g. reducing synchronization costs while processing requests. That’s what makes dotnet/runtime#106724 from @harisokanovic so interesting, in that it’s instead about reducing startup overheads of a process using .NET on Linux. The GC uses the equivalent of a process-wide memory barrier (also exposed publicly as
Interlocked.MemoryBarrierProcessorWide
) to ensure that all threads involved in a collection see a consistent state. On Linux, implementing this method efficiently involves using themembarrier
system call withMEMBARRIER_CMD_PRIVATE_EXPEDITED
, and using that requires the same syscall to have been made earlier on withMEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED
, which really means doing so at startup. However, the Linux kernel has some optimizations that make this registration a lot cheaper to use when there’s only one thread in the process. The way it was being used in .NET previously guaranteed there would be multiple. This PR changed where this initialization was performed in order to maximize the possibility of there only being the single thread in the process, which in turn makes startup faster. The improvement was upwards of 10ms on various systems on which it was measured, which is a large percentage of a .NET process’ startup overhead on Linux.
- Volatile. A “memory model” is a description of how threads interact with memory and what guarantees are made about how different threads produce and consume changes in shared memory. Reads and writes to memory from a single thread are guaranteed to be observed by that thread in the order they occurred, but once multiple threads enter the picture, it’s up to the memory model to define what behaviors can be relied on and which can’t. For example, if there are two fields,
Reflection
Reflection is a very powerful (though sometimes overused) capability of .NET that enables code to load and introspect .NET assemblies and invoke their functionality. It is used in all manner of library and application, including by the core .NET libraries themselves, and it’s important that we continue to find ways to reduce the overheads associated with reflection.
Several PRs in .NET 9 whittle away at some of the allocation overheads in reflection. dotnet/runtime#92310 and dotnet/runtime#93115 avoid some defensive array copies by instead handing around
ReadOnlySpan<T>
instances, while dotnet/runtime#95952 removed a use of string.Split
that turned out to only be used with constants and thus could be replaced by just manually splitting those constants. But a more interesting and impactful addition comes from dotnet/runtime#97683, which added an allocation-free way to get the invocation list from a delegate. Delegates in .NET are “multicast,” meaning a single delegate instance might actually represent multiple methods to be invoked; this is how .NET events are implemented. If I invoke a delegate, the delegate implementation handles invoking each constituent method, sequentially, in turn. But what if I want to customize the invocation logic? Maybe I want to wrap each individual method in a try/catch, or maybe I want to track the return values from all of the methods rather than just the last, or some such behavior. To achieve that, delegates expose a way to get an array of delegates, one for each method that’s part of the original. So, if I have:
Code:
Action action = () => Console.Write("A ");
action += () => Console.Write("B ");
action += () => Console.Write("C ");
action();
that’ll print out
"A B C "
, and if I have:
Code:
Action action = () => Console.Write("A ");
action += () => Console.Write("B ");
action += () => Console.Write("C ");
Delegate[] actions = action.GetInvocationList();
for (int i = 0; i < actions.Length; ++i)
{
Console.Write($"{i}: ");
((Action)actions[i])();
Console.WriteLine();
}
that’ll print out:
Code:
0: A
1: B
2: C
However, that
GetInvocationList
needs to allocate. Now in .NET 9, there’s the new Delegate.EnumerateInvocationList<TDelegate>
method, which returns a struct
-based enumerable for iterating through the delegates rather than needing to allocate a new array to store them all.
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private Action _action;
private int _count;
[GlobalSetup]
public void Setup()
{
_action = () => _count++;
_action += () => _count += 2;
_action += () => _count += 3;
}
[Benchmark(Baseline = true)]
public void GetInvocationList()
{
foreach (Action action in _action.GetInvocationList())
{
action();
}
}
[Benchmark]
public void EnumerateInvocationList()
{
foreach (Action action in Delegate.EnumerateInvocationList(_action))
{
action();
}
}
}
Method |
---|
GetInvocationList |
EnumerateInvocationList |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
32.11 ns
[/td][td]
1.00
[/td][td]
48 B
[/td][td]
1.00
[/td][td]
11.07 ns
[/td][td]
0.34
[/td][td]
–
[/td][td]
0.00
[/td]Reflection is particularly important with libraries involved in dependency injection, where object construction is frequently done in a more dynamic fashion.
ActivatorUtilities.CreateInstance
plays a key role there, and has also seen improvements from allocation reduction. dotnet/runtime#99383, in particular, helped to significantly reduce allocation by employing the ConstructorInvoker
type introduced in .NET 8, and by piggybacking on the changes from dotnet/runtime#99175 to cut down on the number of constructors it needs to examine.
Code:
// Add a <PackageReference Include="Microsoft.Extensions.DependencyInjection" Version="8.0.0" /> to the csproj.
// dotnet run -c Release -f net8.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using Microsoft.Extensions.DependencyInjection;
var config = DefaultConfig.Instance
.AddJob(Job.Default.WithRuntime(CoreRuntime.Core80)
.WithNuGet("Microsoft.Extensions.DependencyInjection", "8.0.0")
.WithNuGet("Microsoft.Extensions.DependencyInjection.Abstractions", "8.0.1").AsBaseline())
.AddJob(Job.Default.WithRuntime(CoreRuntime.Core90)
.WithNuGet("Microsoft.Extensions.DependencyInjection", "9.0.0-rc.1.24431.7")
.WithNuGet("Microsoft.Extensions.DependencyInjection.Abstractions", "9.0.0-rc.1.24431.7"));
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "NuGetReferences")]
public class Tests
{
private IServiceProvider _serviceProvider = new ServiceCollection().BuildServiceProvider();
[Benchmark]
public MyClass Create() => ActivatorUtilities.CreateInstance<MyClass>(_serviceProvider, 1, 2, 3);
public class MyClass
{
public MyClass() { }
public MyClass(int a) { }
public MyClass(int a, int b) { }
[ActivatorUtilitiesConstructor]
public MyClass(int a, int b, int c) { }
}
}
Method | Runtime |
---|---|
Create | .NET 8.0 |
Create | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
163.60 ns
[/td][td]
1.00
[/td][td]
288 B
[/td][td]
1.00
[/td][td]
83.46 ns
[/td][td]
0.51
[/td][td]
144 B
[/td][td]
0.50
[/td]The aforementioned
ConstructorInvoker
, along with a MethodInvoker
, was introduced in .NET 8 as a way to cache first-use information to enable all subsequent operations to be much faster. Without introducing a new public FieldInvoker
, dotnet/runtime#98199 is able to achieve similar levels of speedup for field access via a FieldInfo
by employing an internal FieldAccessor
that’s cached onto the FieldInfo
object (dotnet/runtime#92512 also helped here by moving some native runtime implementations back up into C#). Varying levels of large speedups are achieved depending on the exact nature of the field being accessed.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Reflection;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "NuGetReferences")]
public class Tests
{
private static object s_staticReferenceField = new object();
private object _instanceReferenceField = new object();
private static int s_staticValueField = 1;
private int _instanceValueField = 2;
private object _obj = new();
private FieldInfo _staticReferenceFieldInfo = typeof(Tests).GetField(nameof(s_staticReferenceField), BindingFlags.NonPublic | BindingFlags.Static)!;
private FieldInfo _instanceReferenceFieldInfo = typeof(Tests).GetField(nameof(_instanceReferenceField), BindingFlags.NonPublic | BindingFlags.Instance)!;
private FieldInfo _staticValueFieldInfo = typeof(Tests).GetField(nameof(s_staticValueField), BindingFlags.NonPublic | BindingFlags.Static)!;
private FieldInfo _instanceValueFieldInfo = typeof(Tests).GetField(nameof(_instanceValueField), BindingFlags.NonPublic | BindingFlags.Instance)!;
[Benchmark] public object? GetStaticReferenceField() => _staticReferenceFieldInfo.GetValue(null);
[Benchmark] public void SetStaticReferenceField() => _staticReferenceFieldInfo.SetValue(null, _obj);
[Benchmark] public object? GetInstanceReferenceField() => _instanceReferenceFieldInfo.GetValue(this);
[Benchmark] public void SetInstanceReferenceField() => _instanceReferenceFieldInfo.SetValue(this, _obj);
[Benchmark] public int GetStaticValueField() => (int)_staticValueFieldInfo.GetValue(null)!;
[Benchmark] public void SetStaticValueField() => _staticValueFieldInfo.SetValue(null, 3);
[Benchmark] public int GetInstanceValueField() => (int)_instanceValueFieldInfo.GetValue(this)!;
[Benchmark] public void SetInstanceValueField() => _instanceValueFieldInfo.SetValue(this, 4);
}
Method | Runtime |
---|---|
GetStaticReferenceField | .NET 8.0 |
GetStaticReferenceField | .NET 9.0 |
SetStaticReferenceField | .NET 8.0 |
SetStaticReferenceField | .NET 9.0 |
GetInstanceReferenceField | .NET 8.0 |
GetInstanceReferenceField | .NET 9.0 |
SetInstanceReferenceField | .NET 8.0 |
SetInstanceReferenceField | .NET 9.0 |
GetStaticValueField | .NET 8.0 |
GetStaticValueField | .NET 9.0 |
SetStaticValueField | .NET 8.0 |
SetStaticValueField | .NET 9.0 |
GetInstanceValueField | .NET 8.0 |
GetInstanceValueField | .NET 9.0 |
SetInstanceValueField | .NET 8.0 |
SetInstanceValueField | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
24.839 ns
[/td][td]
1.00
[/td][td]
1.720 ns
[/td][td]
0.07
[/td][td]
41.025 ns
[/td][td]
1.00
[/td][td]
6.964 ns
[/td][td]
0.17
[/td][td]
29.595 ns
[/td][td]
1.00
[/td][td]
5.960 ns
[/td][td]
0.20
[/td][td]
31.753 ns
[/td][td]
1.00
[/td][td]
9.577 ns
[/td][td]
0.30
[/td][td]
43.847 ns
[/td][td]
1.00
[/td][td]
36.011 ns
[/td][td]
0.82
[/td][td]
39.462 ns
[/td][td]
1.00
[/td][td]
10.396 ns
[/td][td]
0.26
[/td][td]
45.125 ns
[/td][td]
1.00
[/td][td]
39.104 ns
[/td][td]
0.87
[/td][td]
36.664 ns
[/td][td]
1.00
[/td][td]
13.571 ns
[/td][td]
0.37
[/td]Of course, if you can avoid using these more expensive reflection approaches in the first place, that’s very desirable. One reason for using reflection is to access private members of other types, and while that’s a scary thing to do and generally something to be avoided, there are valid cases for it where having an efficient solution is highly desirable. .NET 8 added such a mechanism in
[UnsafeAccessor]
, which enables a type to declare a method that effectively serves as direct access to a member of another type. So, for example, with this:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Reflection;
using System.Runtime.CompilerServices;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "NuGetReferences")]
public class Tests
{
private MyClass _myClass = new MyClass(new List<int>() { 1, 2, 3 });
private FieldInfo _fieldInfo = typeof(MyClass).GetField("_list", BindingFlags.NonPublic | BindingFlags.Instance)!;
private static class Accessors
{
[UnsafeAccessor(UnsafeAccessorKind.Field, Name = "_list")]
public static extern ref object GetList(MyClass myClass);
}
[Benchmark(Baseline = true)]
public object WithFieldInfo() => _fieldInfo.GetValue(_myClass)!;
[Benchmark]
public object WithUnsafeAccessor() => Accessors.GetList(_myClass);
}
public class MyClass(object list)
{
private object _list = list;
}
I get this:
Method | Runtime |
---|---|
WithFieldInfo | .NET 8.0 |
WithFieldInfo | .NET 9.0 |
WithUnsafeAccessor | .NET 8.0 |
WithUnsafeAccessor | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
27.5299 ns
[/td][td]
1.00
[/td][td]
4.0789 ns
[/td][td]
0.15
[/td][td]
0.5005 ns
[/td][td]
0.02
[/td][td]
0.5499 ns
[/td][td]
0.02
[/td]However, in .NET 8, this mechanism could only be used with non-generic members. Now in .NET 9, thanks to dotnet/runtime#99468 and dotnet/runtime#99830, this capability now extends to generics, as well.
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Reflection;
using System.Runtime.CompilerServices;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "NuGetReferences")]
public class Tests
{
private MyClass<int> _myClass = new MyClass<int>(new List<int>() { 1, 2, 3 });
private FieldInfo _fieldInfo = typeof(MyClass<int>).GetField("_list", BindingFlags.NonPublic | BindingFlags.Instance)!;
private static class Accessors<T>
{
[UnsafeAccessor(UnsafeAccessorKind.Field, Name = "_list")]
public static extern ref List<T> GetList(MyClass<T> myClass);
}
[Benchmark(Baseline = true)]
public List<int> WithFieldInfo() => (List<int>)_fieldInfo.GetValue(_myClass)!;
[Benchmark]
public List<int> WithUnsafeAccessor() => Accessors<int>.GetList(_myClass);
}
public class MyClass<T>(List<T> list)
{
private List<T> _list = list;
}
Method |
---|
WithFieldInfo |
WithUnsafeAccessor |
[th]
Mean
[/th][th]
Ratio
[/th][td]
4.4251 ns
[/td][td]
1.00
[/td][td]
0.5147 ns
[/td][td]
0.12
[/td]Parsing that occurs as part of reflection, and in particular as part of type names, was also improved as part of some work to consolidate type name parsing into a reusable component. dotnet/runtime#100094‘s primary purpose wasn’t to improve performance, but it ended up doing so, anyway.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public Type? Parse() =>
Type.GetType("System.Collections.Generic.Dictionary`2[" +
"[System.Collections.Generic.List`1[" +
"[System.Int32, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], " +
"System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]," +
"[System.Collections.Generic.List`1[" +
"[System.String, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], " +
"System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], " +
"System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e");
}
Method | Runtime |
---|---|
Parse | .NET 8.0 |
Parse | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
7.590 us
[/td][td]
1.00
[/td][td]
5.03 KB
[/td][td]
1.00
[/td][td]
6.361 us
[/td][td]
0.84
[/td][td]
4.73 KB
[/td][td]
0.94
[/td]And then there are more intrinsics. In compiler speak, an “intrinsic” is something the compiler has “intrinsic” knowledge of, a fancy way of saying it’s something the compiler implicitly knows about. This typically manifests as a method whose implementation is provided by the compiler, sometimes always or sometimes based on certain conditions. For example,
string.Equals
is attributed as [Intrinsic]
: it has its own fully-functional implementation, but if the JIT sees that at least one of the inputs is a constant string, the JIT may emit its own optimized implementation for Equals
that unrolls and vectorizes the comparison based on the exact value being compared.Several new members became intrinsics in .NET 9. dotnet/runtime#96226 turns
typeof(T).IsPrimitive
into an intrinsic, which allows the JIT to supply a constant replacement for the expression, which in turn allows branches to be eliminated and possibly whole swaths of then dead code to follow. For example, as part of its code path for moving to the next value, Parallel.ForAsync
has a code path that looks like this:
Code:
if (typeof(T).IsPrimitive)
{
UseInterlockedCompareExchangeToAdvance();
}
else
{
UseALockAroundAReadIncrementStoreToAdvance();
}
With
IsPrimitive
as an intrinsic, that if
/else
will reduce entirely to either:UseInterlockedCompareExchangeToAdvance();
or
UseALockAroundAReadIncrementStoreToAdvance();
based on the nature of
T
.typeof(T).IsGenericType
and typeof(T).GetGenericTypeDefinition
were also made into intrinsics, by dotnet/runtime#99555 and dotnet/runtime#103528, respectively. Imagine code like that in ASP.NET where it wants to special-case APIs that return Task<T>
vs ValueTask<T>
vs IAsyncEnumerable<T>
vs T
vs other types; it’ll often use members like IsGenericType
and GetGenericTypeDefinition
(which will throw an exception if IsGenericType
is false
) to determine whether a concrete instantiation of a generic type is the one in question. With this benchmark:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public bool Test() => IsTaskT<Task<string>>();
private static bool IsTaskT<T>() =>
typeof(T).IsGenericType &&
typeof(T).GetGenericTypeDefinition() == typeof(Task<>);
}
on .NET 8 we end up with over 250 bytes of assembly code for implementing this operation. On .NET 9, we get just this:
Code:
; Tests.Test()
mov eax,1
ret
; Total bytes of code 6
The magic of intrinsics.
Numerics
Primitive Types
The core data types in .NET sit at the very bottom of the stack and are used everywhere. It’s thus a desire every release to whittle away at any overheads we can avoid. .NET 9 is no exception, where a multitude of PRs have gone into reducing overheads of various operations on these core types.
Consider
DateTime
. When it comes to performance optimization, we typically focus on the happy path, on the “hot path,” on the successful path. Exceptions already add significant expense to error paths, and are intended to be “exceptional” and relatively rare, and so we generally don’t worry about an extra operation here or an extra allocation there. But, sometimes, one type’s error path is another type’s success path. This is especially true with Try
methods, where failure is conveyed via a bool
rather than with an expensive exception. As part of profiling a commonly-used .NET library, the profiler highlighted some unexpected allocations coming from DateTime
handling, unexpected because we’ve spent a lot of time over the years trying to eliminate allocations in this area of the code. The allocation, it turned out, was occurring on an error path, both with DateTime.Parse
when an exception would be thrown, but also with DateTime.TryParse
when false
would be returned. As it happened, deep in the call tree where parsing work is going on, if an error is encountered, the code stores information about the failure (e.g. a ParseFailureKind
enum value); after unwinding the call stack back to the public method, Parse
uses that to throw an appropriately-detailed exception, while TryParse
just ignores it and returns false
. But the way the code was written, that enum value would end up getting boxed when it was stored, resulting in an allocation as part of TryParse
returning false
. The consuming library was using TryParse
on a bunch of different data primitive types as part of interpreting the data, e.g.
Code:
if (int.TryParse(value, out int parsedInt32)) { ... }
else if (DateTime.TryParse(value, out DateTime parsedDateTime)) { ... }
else if (double.TryParse(value, out double parsedDouble)) { ... }
else if ...
such that its success path might include the failure path from some number of these primitives’
TryParse
methods. dotnet/runtime#91303 tweaked how the information is stored to avoid that boxing, while also reducing a bit of additional overhead along the way.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "input")]
public class Tests
{
[Benchmark]
[Arguments("hello")]
public bool TryParse(string input) => DateTime.TryParse(input, out _);
}
Method | Runtime |
---|---|
TryParse | .NET 8.0 |
TryParse | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
31.95 ns
[/td][td]
1.00
[/td][td]
24 B
[/td][td]
1.00
[/td][td]
25.96 ns
[/td][td]
0.81
[/td][td]
–
[/td][td]
0.00
[/td]Both
DateTime
and TimeSpan
also saw parsing and formatting gains from dotnet/runtime#101640 from @lilinus. The PR takes advantage of an existing internal CountDigits
helper that was optimized in .NET 8 as part of integer parsing; it employs a lookup table to compute the number of digits that will be required for a number, doing so in just a few instructions. And it replaces a switch
with a lookup table as part of computing powers of ten, replacing a method like Pow10_Old
in this benchmark with one more like Pow10_New
:
Code:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "input")]
public class Tests
{
private int _pow = 3;
[Benchmark(Baseline = true)]
public long Pow10_Old() =>
_pow switch
{
0 => 1,
1 => 10,
2 => 100,
3 => 1000,
4 => 10000,
5 => 100000,
6 => 1000000,
_ => 10000000, // _pow will never be greater than 7
};
[Benchmark]
public long Pow10_New()
{
ReadOnlySpan<int> powersOfTen =
[
1,
10,
100,
1000,
10000,
100000,
1000000,
10000000, // _pow will never be greater than 7
];
return powersOfTen[_pow];
}
}
The JIT is able to do a bit better job with the latter, for the former producing:
Code:
; Tests.Pow10_Old()
push rbp
mov rbp,rsp
M00_L00:
mov ecx,[rdi+8]
cmp ecx,3
jne short M00_L02
mov edx,3E8
M00_L01:
movsxd rax,edx
pop rbp
ret
M00_L02:
cmp ecx,6
ja short M00_L03
mov edx,ecx
lea rax,[7F3D29A690E8]
mov eax,[rax+rdx*4]
lea rcx,[M00_L00]
add rax,rcx
jmp rax
M00_L03:
mov edx,989680
jmp short M00_L01
mov edx,0F4240
jmp short M00_L01
mov edx,186A0
jmp short M00_L01
mov edx,2710
jmp short M00_L01
mov edx,64
jmp short M00_L01
mov edx,0A
jmp short M00_L01
mov edx,1
jmp short M00_L01
; Total bytes of code 100
but for the latter producing:
Code:
; Tests.Pow10_New()
push rax
mov eax,[rdi+8]
cmp eax,8
jae short M00_L00
mov rcx,7F3CC0AE6018
movsxd rax,dword ptr [rcx+rax*4]
add rsp,8
ret
M00_L00:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 34
The net result is a nice improvement to these operations, e.g.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private string _input = TimeSpan.FromMilliseconds(12345.6789).ToString();
[Benchmark]
public TimeSpan Parse() => TimeSpan.Parse(_input);
}
Method | Runtime |
---|---|
Parse | .NET 8.0 |
Parse | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
137.55 ns
[/td][td]
1.00
[/td][td]
117.78 ns
[/td][td]
0.86
[/td]Various operations on the primitive types were also improved across a plethora of PRs:
- Round. dotnet/runtime#98186 from @MichalPetryka optimized the various
Math.Round
andMathF.Round
overloads (which are the same implementations asdouble.Round
andfloat.Round
).
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private double _value = 12345.6789; [Benchmark] public double RoundDigits() => Math.Round(_value, 2); }
Method Runtime RoundDigits .NET 8.0 RoundDigits .NET 9.0 [th]Mean[/th][th]Ratio[/th][td]1.6930 ns[/td][td]1.00[/td][td]0.3496 ns[/td][td]0.21[/td] - SinCos. dotnet/runtime#103724 updated
Math.SinCos
andMathF.SinCos
to use the internalRuntimeHelpers.IsKnownConstant
intrinsic. This method enables code inCoreLib
to check whether the argument to the method is coming in as a constant the JIT can see, at which point the implementation might choose to do something special. In this case, theSin
andCos
functions are already capable of producing constant results for constant input, so rather than doing the normal implementation, which tries to reuse most of the computation that’s shared betweenSin
andCos
, it instead just calls to each, knowing that the constant input for each will result in a constant output overall.
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [DisassemblyDiagnoser(maxDepth: 0)] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { [Benchmark] public float Sum() => SumSinCos(123.456f); private float SumSinCos(float f) { (float sin, float cos) = MathF.SinCos(f); return sin + cos; } }
Method Runtime Sum .NET 8.0 Sum .NET 9.0 [th]Mean[/th][th]Ratio[/th][th]Code Size[/th][td]5.2719 ns[/td][td]1.000[/td][td]46 B[/td][td]0.0177 ns[/td][td]0.003[/td][td]9 B[/td]
In cases like this, it’s helpful to pay attention to the warnings BenchmarkDotNet issues:
Code:// * Warnings * ZeroMeasurement Tests.Sum: Runtime=.NET 9.0, Toolchain=net9.0 -> The method duration is indistinguishable from the empty method duration
The .NET 9 run is indistinguishable from an empty method because it is an empty method, or at least a method that just returns a constant. We can see that by looking at the disassembly. The .NET 8 code has a few moves and loads and then calls toSinCos
:
Code:; Tests.Sum() push rax vzeroupper vmovss xmm0,dword ptr [7F7686979610] lea rdi,[rsp+4] lea rsi,[rsp] call System.MathF.SinCos(Single, Single*, Single*) vmovss xmm0,dword ptr [rsp+4] vmovss xmm1,dword ptr [rsp] vaddss xmm0,xmm0,xmm1 add rsp,8 ret ; Total bytes of code 46
In contrast, here’s .NET 9:
Code:; Tests.Sum() vmovss xmm0,dword ptr [7F4D10FD9080] ret ; Total bytes of code 9
It’s simply loading a value and returning it, since the whole operation compiled down to a constant.- Enum.{Try}Parse. Interop scenarios drove the introduction of two new
RuntimeHelpers
APIs,SizeOf
in dotnet/runtime#100618 andBox
in dotnet/runtime#100561. But, dotnet/runtime#100846 was then able to utilize these APIs to optimize the implementation of the non-genericEnum.Parse
andEnum.TryParse
overloads, which give back the parsed enum value asobject
. This is a special kind of boxing, because the parse methods internally extract a numerical value but then need the boxed object to be of the enum type (rather than the numerical type) specified via theType
argument.
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private string _input = "Monday"; [Benchmark] public object Parse() => Enum.Parse(typeof(DayOfWeek), _input); }
Method Runtime Parse .NET 8.0 Parse .NET 9.0 [th]Mean[/th][th]Ratio[/th][td]62.01 ns[/td][td]1.00[/td][td]28.13 ns[/td][td]0.45[/td]
- Integer Division. Consider this benchmark:
Code:// dotnet run -c Release -f net8.0 --filter "*" using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [DisassemblyDiagnoser] [MemoryDiagnoser(false)] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { [Benchmark] [Arguments(5)] public uint DivideBy4_UInt32(uint value) => value / 4; [Benchmark] [Arguments(5)] public int DivideBy4_Int32(int value) => value / 4; }
With theuint
-based example, dividing by 4 is already optimized into a simple right shift, since for auint
,value / 4
andvalue >> 2
are functionally equivalent. However, that’s not the case for anint
, or at least, not always. For a non-negativeint
, the same optimization could be employed, but if theint
is negative, for some values switching fromvalue / 4
tovalue >> 2
would be functionally incorrect. Consider-5 / 4
… the answer is-1
. But-5 >> 2
is-2
. Oops. So when you look at the assembly code for theint
case (here on .NET 8), it’s more complex:
Code:; Tests.DivideBy4_UInt32(UInt32) mov eax,esi shr eax,2 ret ; Total bytes of code 6 ; Tests.DivideBy4_Int32(Int32) mov eax,esi sar eax,1F and eax,3 add eax,esi sar eax,2 ret ; Total bytes of code 14
Given that, you might hope that if the compiler could prove that theint
was non-negative, it could still employ the simpler shifting:
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [DisassemblyDiagnoser] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { [Benchmark] [Arguments(5)] public int DivideBy4_Int32(int value) => value < 4 ? 0 : value / 4; }
But alas, on .NET 8, we still get:
Code:; Tests.DivideBy4_Int32(Int32) cmp esi,4 jl short M00_L00 mov eax,esi sar eax,1F and eax,3 add eax,esi sar eax,2 ret M00_L00: xor eax,eax ret ; Total bytes of code 22
On .NET 9, dotnet/runtime#94347 updates the JIT for exactly that, replacing signed division with unsigned division if it can prove that both the numerator and denominator are non-negative.
Code:; Tests.DivideBy4_Int32(Int32) cmp esi,4 jl short M00_L00 mov eax,esi shr eax,2 ret M00_L00: xor eax,eax ret ; Total bytes of code 14
- Nullable. Several optimizations went into making
Nullable<T>
cheaper, in particular when used with generics. Consider this benchmark:
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using System.Runtime.CompilerServices; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [DisassemblyDiagnoser] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { [Benchmark] public void TestStruct() => Dispose<List.Enumerator>(default); [Benchmark] public void TestNullableStruct() => Dispose<List<int>.Enumerator?>(default); [MethodImpl(MethodImplOptions.AggressiveInlining)] private static void Dispose<T>(T t) { if (t is IDisposable) { ((IDisposable)t).Dispose(); } } }
We have an unconstrained generic methodDispose
whose job it is to cast the argument toIDisposable
and invoke itsDispose
. While such an operation would seemingly box ifT
were a value type, for a long time now the JIT has had optimizations that end up eliminating that boxing. In the case of theList<T>.Enumerator
, itsDispose
implementation is a nop, so withDispose<T>
getting inlined, no boxing, and theIDisposable.Dispose
implementation nop’ing, this whole method is a nop (on both .NET 8 and .NET 9):
Code:; Tests.TestStruct() ret ; Total bytes of code 1
That’s unfortunately not the case forTestNullableStruct
. The only difference betweenTestStruct
andTestNullableStruct
is that pesky?
in the generic type argument, which meansT
will be aNullable<List<int>.Enumerator>
rather thanList<int>.Enumerator
. That complicates things.Nullable<T>
is very special, with a boxed nullable implementing the same interfaces as does the underlying struct, but it ends up being very hard for the JIT to deal with. On .NET 8, we end up with this assembly:
Code:; Tests.TestNullableStruct() push rbp sub rsp,20 lea rbp,[rsp+20] vxorps xmm8,xmm8,xmm8 vmovdqa xmmword ptr [rbp-20],xmm8 vmovdqa xmmword ptr [rbp-10],xmm8 cmp byte ptr [rbp-20],0 jne short M00_L01 M00_L00: add rsp,20 pop rbp ret M00_L01: lea rsi,[rbp-20] mov rdi,offset MT_System.Nullable`1[[System.Collections.Generic.List`1+Enumerator[[System.Int32, System.Private.CoreLib]], System.Private.CoreLib]] call CORINFO_HELP_BOX_NULLABLE mov rsi,rax mov rdi,offset MT_System.IDisposable call qword ptr [7F178F2543C0]; System.Runtime.CompilerServices.CastHelpers.ChkCastInterface(Void*, System.Object) mov rdi,rax mov r11,7F178E5904C0 call qword ptr [r11] jmp short M00_L00 ; Total bytes of code 93
That’s a whole lot more than aret
. Thankfully, for .NET 9, dotnet/runtime#95764 makes this better by optimizingcastclass
forNullable<T>
:
Code:; Tests.TestNullableStruct() sub rsp,28 xor eax,eax mov [rsp+8],rax vxorps xmm8,xmm8,xmm8 vmovdqa xmmword ptr [rsp+10],xmm8 mov [rsp+20],rax cmp byte ptr [rsp+8],0 jne short M00_L01 M00_L00: add rsp,28 ret M00_L01: lea rsi,[rsp+8] mov rdi,offset MT_System.Nullable`1[[System.Collections.Generic.List`1+Enumerator[[System.Int32, System.Private.CoreLib]], System.Private.CoreLib]] call CORINFO_HELP_BOX_NULLABLE movsx rax,byte ptr [rax+8] jmp short M00_L00 ; Total bytes of code 66
We still have thecall
toCORINFO_HELP_BOX_NULLABLE
, but the relatively expensivecall
toChkCastInterface
is now gone. While this may seem a little corner case, it actually shows up in well-known places. For example:
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private int? _value = 42; [Benchmark] public string Interpolate() => $"{_value}"; }
Here we’re just doing string interpolation, using a nullable value type as one of the arguments. TheDefaultInterpolatedStringHandler
has a genericAppendFormatted
method which this will end up using, passing theNullable<int>
as its argument, and that method employs similar patterns of type testing for an interface and using it if it’s available. And as a result, this optimization can have a measurable impact on such interpolated string use:
Method Runtime Interpolate .NET 8.0 Interpolate .NET 9.0 [th]Mean[/th][th]Ratio[/th][td]78.12 ns[/td][td]1.00[/td][td]62.95 ns[/td][td]0.81[/td]
AnotherNullable<T>
-related optimization is dotnet/runtime#95711, which ends up avoiding boxing for some forms of type testing. Consider this:
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [DisassemblyDiagnoser] [MemoryDiagnoser(false)] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private int? _value = 42; [Benchmark] public bool Test() => IsInt(_value); private static bool IsInt<T>(T value) => value is int; }
This should be relatively straightforward: the JIT can see thatT
is aNullable<int>
, and then whether it satisfies the type test is a question of whether the value isnull
or not, since if it’snull
, it’s not anint
, and if it’s notnull
, it is anint
. Unfortunately, on .NET 8, not so much:
Code:; Tests.Test() push rbp sub rsp,10 lea rbp,[rsp+10] mov rsi,[rdi+8] mov [rbp-8],rsi lea rsi,[rbp-8] mov rdi,offset MT_System.Nullable`1[[System.Int32, System.Private.CoreLib]] call CORINFO_HELP_BOX_NULLABLE test rax,rax je short M00_L00 mov rcx,offset MT_System.Int32 xor edx,edx cmp [rax],rcx cmovne rax,rdx M00_L00: test rax,rax setne al movzx eax,al add rsp,10 pop rbp ret ; Total bytes of code 76
In fact, we can see it’s usingCORINFO_HELP_BOX_NULLABLE
tobox
theNullable<int>
, which means we actually end up with an allocation as part of this type test. And that’s visible in the benchmark results:
Method Runtime Test .NET 8.0 Test .NET 9.0 [th]Mean[/th][th]Ratio[/th][th]Code Size[/th][th]Allocated[/th][th]Alloc Ratio[/th][td]39.1567 ns[/td][td]1.000[/td][td]76 B[/td][td]24 B[/td][td]1.00[/td][td]0.0006 ns[/td][td]0.000[/td][td]5 B[/td][td]–[/td][td]0.00[/td]
On .NET 9, it ends up being what we thought it should be, a simplenull
check:
Code:; Tests.Test() movzx eax,byte ptr [rdi+8] ret ; Total bytes of code 5
where the result of the method is simplyNullable<T>.HasValue
.
As a small tangent since we’re talking about optimizing casting, dotnet/runtime#98284 improves code generation for casts where the JIT can end up seeing that the object being cast isnull
(while you’d probably never explicitly writeif (null is SomeClass)
, you might very well writeif (GetObject() is SomeClass)
wereGetObject()
might get inlined and returnnull
, especially ifGetObject()
is virtual and due to dynamic PGO anull
-returning override gets inlined).
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [DisassemblyDiagnoser] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { [Benchmark] public Tests? NullCast() => GetObj() as Tests; private object? GetObj() => null; }
On .NET 8, it doesn’t pay attention to whether it knows that the source will benull
, but now in .NET 9, it does:
Code:// .NET 8 ; Tests.NullCast() push rax mov rdi,offset MT_Tests xor esi,esi call qword ptr [7F0457E24360]; System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object) nop add rsp,8 ret ; Total bytes of code 25 // .NET 9 ; Tests.NullCast() push rax xor eax,eax add rsp,8 ret ; Total bytes of code 8
Back toNullable<T>
, dotnet/runtime#105073 enables the JIT to inline the fast path of the unboxing helper that’s used when extracting aNullable<T>
from anobject
. There’s anCORINFO_HELP_UNBOX_NULLABLE
helper function that’s called to perform the unboxing (e.g.(int?)o
for someobject o
), but the success path (where the object is eithernull
or the boxed target type) is small and it’s worth inlining that.
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [DisassemblyDiagnoser] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private object _o = 42; [Benchmark] public int? Unbox() => (int?)_o; }
On .NET 8, we get the following, effectively just a call toCORINFO_HELP_UNBOX_NULLABLE
:
Code:; Tests.Unbox() push rax mov rdx,[rdi+8] lea rdi,[rsp] mov rsi,offset MT_System.Nullable`1[[System.Int32, System.Private.CoreLib]] call CORINFO_HELP_UNBOX_NULLABLE mov rax,[rsp] add rsp,8 ret ; Total bytes of code 33
whereas on .NET 9, we get the following, which is creating a defaultNullable<int>
if the object isnull
, or aNullable<int>
with the value from the object if it’s a boxedint
, or callingCORINFO_HELP_UNBOX_NULLABLE
if it’s something else (in which case we’ll be throwing an exception shortly):
Code:; Tests.Unbox() push rbp sub rsp,10 lea rbp,[rsp+10] mov rdx,[rdi+8] test rdx,rdx jne short M00_L00 xor edx,edx mov [rbp-8],rdx jmp short M00_L01 M00_L00: mov rax,offset MT_System.Int32 cmp [rdx],rax jne short M00_L02 mov byte ptr [rbp-8],1 mov eax,[rdx+8] mov [rbp-4],eax M00_L01: mov rax,[rbp-8] add rsp,10 pop rbp ret M00_L02: lea rdi,[rbp-8] mov rsi,offset MT_System.Nullable`1[[System.Int32, System.Private.CoreLib]] call CORINFO_HELP_UNBOX_NULLABLE jmp short M00_L01 ; Total bytes of code 83
This is one of those cases where you actually want the code to be larger, at least for the micro-benchmark, because the inlining is the purpose and is bringing in more code.
Method Runtime Unbox .NET 8.0 Unbox .NET 9.0 [th]Mean[/th][th]Ratio[/th][th]Code Size[/th][td]6.014 ns[/td][td]1.00[/td][td]33 B[/td][td]2.854 ns[/td][td]0.47[/td][td]83 B[/td]
- Enum.{Try}Parse. Interop scenarios drove the introduction of two new
BigInteger
Not exactly a “primitive” type, but in the same ballpark, is
BigInteger
. As with sbyte
, short
, int
, and long
, System.Numerics.BigInteger
is an IBinaryInteger<>
and ISignedNumber<>
. Unlike those types, which are all of a fixed bit size (8, 16, 32, and 64 bits, respectively), BigInteger
can represent signed integers with any number of bits (within reason… the current representation allows up to Array.MaxLength / 64
bits, which means representing numbers up to 2^33,554,432… that’s… big). Such large sizes brings with it performance complexities, and historically BigInteger
hasn’t been a beacon of high throughput. While there’s still more that can be done (and in fact there are several pending PRs even as I write this), a bunch of nice changes have landed for .NET 9.dotnet/runtime#91176 from @Rob-Hague improved
BigInteger
‘s byte
-based constructors (e.g. public BigInteger(byte[] value)
) by utilizing vectorized operations from MemoryMarshal
and BinaryPrimitives
. In particular, a lot of the time spent in these BigInteger
constructors is in walking the list of bytes, building up integers out of each grouping of four, and storing those into a destination uint[]
. With spans, however, that whole operation is unnecessary and can be achieved with an optimized CopyTo
operation (effectively a memcpy
) with the destination just being that uint[]
reinterpreted as a span of bytes.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Numerics;
using System.Security.Cryptography;
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private byte[] _bytes;
[GlobalSetup]
public void Setup()
{
_bytes = new byte[10_000];
new Random(42).NextBytes(_bytes);
}
[Benchmark]
public BigInteger NewBigInteger() => new BigInteger(_bytes);
}
Method | Runtime |
---|---|
NewBigInteger | .NET 8.0 |
NewBigInteger | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
5.886 us
[/td][td]
1.00
[/td][td]
1.434 us
[/td][td]
0.24
[/td]Parsing is another common way of creating
BigInteger
s. dotnet/runtime#95543 improved the performance of parsing hex and binary-formatted values (this is on top of the .NET 9 addition in dotnet/runtime#85392 from @lateapexearlyspeed that added support for the "b"
format specifier for formatting and parsing BigInteger
as binary). Previously, parsing would go digit-by-digit, but with the new algorithm, it parses multiple chars at the same time, using a vectorized implementation for larger inputs.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Globalization;
using System.Numerics;
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private string _hex = string.Create(1024, 0, (dest, _) => new Random(42).GetItems<char>("0123456789abcdef", dest));
[Benchmark]
public BigInteger ParseHex() => BigInteger.Parse(_hex, NumberStyles.HexNumber);
}
Method | Runtime |
---|---|
ParseHex | .NET 8.0 |
ParseHex | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
5,155.5 ns
[/td][td]
1.00
[/td][td]
5208 B
[/td][td]
1.00
[/td][td]
236.8 ns
[/td][td]
0.05
[/td][td]
536 B
[/td][td]
0.10
[/td]This isn’t the first time efforts have been made to improve
BigInteger
parsing. .NET 7, for example, included a change that introduced a new parsing algorithm. The previous algorithm was O(N^2)
in the number of digits, and the new algorithm had a lower algorithmic complexity, but due to the constants involved was only worthwhile with a larger number of digits. Both algorithms were included, switching between them based on a cut-off of 20,000 digits. As it turns out, with more analysis, that threshold was significantly higher than it needed to be, and dotnet/runtime#97101 from @kzrnm lowered the threshold to a much smaller value (1233). On top of this, dotnet/runtime#97589 from @kzrnm improves parsing further by a) recognizing that the multiplier being used during parsing (shifting down digits to make room for adding in the next set) includes many leading zeros that can be ignored during the operation, and b) trailing zeros when parsing powers of 10 could be calculated more efficiently.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Numerics;
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private string _digits = string.Create(2000, 0, (dest, _) => new Random(42).GetItems<char>("0123456789", dest));
[Benchmark]
public BigInteger ParseDecimal() => BigInteger.Parse(_digits);
}
Method | Runtime |
---|---|
ParseDecimal | .NET 8.0 |
ParseDecimal | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
24.60 us
[/td][td]
1.00
[/td][td]
5528 B
[/td][td]
1.00
[/td][td]
18.95 us
[/td][td]
0.77
[/td][td]
856 B
[/td][td]
0.15
[/td]Once you have a
BigInteger
, there are of course various operations you can do with it. BigInteger.Equals
was improved by dotnet/runtime#91416 from @Rob-Hague, which changed the implementation to use the optimized MemoryExtensions.SequenceEqual
rather than walking the arrays backing each BigInteger
element-by-element. dotnet/runtime#104513 from @Rob-Hague improved BigInteger.IsPowerOfTwo
by similarly replacing a manual walk of the elements with a call to ContainsAnyExcept
, looking to see whether all elements after a certain point were 0.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Numerics;
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private BigInteger _value1, _value2;
[GlobalSetup]
public void Setup()
{
var value1 = new byte[10_000];
new Random(42).NextBytes(value1);
_value1 = new BigInteger(value1);
_value2 = new BigInteger(value1.AsSpan().ToArray());
}
[Benchmark]
public bool Equals() => _value1 == _value2;
}
Method | Runtime |
---|---|
Equals | .NET 8.0 |
Equals | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
1,110.38 ns
[/td][td]
1.00
[/td][td]
79.80 ns
[/td][td]
0.07
[/td]dotnet/runtime#92208 from @kzrnm also improved
BigInteger.Multiply
, and in particular when multiplying a first value that’s much larger than a second value.
Code:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Numerics;
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private BigInteger _value1 = BigInteger.Parse(string.Concat(Enumerable.Repeat("1234567890", 1000)));
private BigInteger _value2 = BigInteger.Parse(string.Concat(Enumerable.Repeat("1234567890", 300)));
[Benchmark]
public BigInteger MultiplyLargeSmall() => _value1 * _value2;
}
Method | Runtime |
---|---|
MultiplyLargeSmall | .NET 8.0 |
MultiplyLargeSmall | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
231.0 us
[/td][td]
1.00
[/td][td]
118.8 us
[/td][td]
0.51
[/td]Lastly, in addition to parsing,
BigInteger
formatting also saw some improvements. dotnet/runtime#100181 removed various temporary buffer allocations that were occurring as part of formatting and optimized various calculations in order to reduce overheads while formatting these values.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Numerics;
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private BigInteger _value = BigInteger.Parse(string.Concat(Enumerable.Repeat("1234567890", 300)));
private char[] _dest = new char[10_000];
[Benchmark]
public bool TryFormat() => _value.TryFormat(_dest, out _);
}
Method | Runtime |
---|---|
TryFormat | .NET 8.0 |
TryFormat | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
102.49 us
[/td][td]
1.00
[/td][td]
7456 B
[/td][td]
1.00
[/td][td]
94.52 us
[/td][td]
0.92
[/td][td]
–
[/td][td]
0.00
[/td]TensorPrimitives
Numerics has been a big focus for .NET over the last several releases. A large stable of numerical operations is now exposed on every numerical type as well as on a set of generic interfaces those types implement. But sometimes you want to perform the same operation on a set of values rather than on an individual value, and for that, we have
TensorPrimitives
. .NET 8 introduced the TensorPrimitive
type, which provides a plethora of numerical APIs, but for spans of them rather than for individual values. For example, float
has a Cosh
method:public static float Cosh(float x);
which provides the hyberbolic cosine of one
float
, and a corresponding method shows up on the IHyperbolicFunctions<TSelf>
interface:static abstract TSelf Cosh(TSelf x);
TensorPrimitives
then has a corresponding method, but rather than accepting one float
, it accepts a span of them, and rather than returning the results, it writes the results into a provided destination span:public static void Cosh(ReadOnlySpan<float> x, Span<float> destination);
In .NET 8,
TensorPrimitives
provided approximately 40 such methods, and only did so for float
. Now in .NET 9, this has been significantly expanded. There are now over 200 overloads on TensorPrimitives
, covering most of the numerical operations that are also exposed on the generic math interfaces (and some that aren’t), and they’re exposed using generics, such that they can work with many more data types than just float
. For example, while it maintains its float
-specific overload of Cosh
for backwards binary compatibility, TensorPrimitives
now also sports this overload:
Code:
public static void Cosh<T>(ReadOnlySpan<T> x, Span<T> destination)
where T : IHyperbolicFunctions<T>
such that it can be used with
Half
, float
, double
, NFloat
, or any custom floating-point type you might have, as long as it implements the relevant interface. Most of these operations are also vectorized, such that it’s more than just a simple loop around the corresponding scalar function.
Code:
// Add a <PackageReference Include="System.Numerics.Tensors" Version="9.0.0" /> to the csproj.
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Numerics.Tensors;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private float[] _source, _destination;
[GlobalSetup]
public void Setup()
{
var r = new Random(42);
_source = Enumerable.Range(0, 1024).Select(_ => (float)r.NextSingle()).ToArray();
_destination = new float[1024];
}
[Benchmark(Baseline = true)]
public void ManualLoop()
{
ReadOnlySpan<float> source = _source;
Span<float> destination = _destination;
for (int i = 0; i < source.Length; i++)
{
destination[i] = float.Cosh(source[i]);
}
}
[Benchmark]
public void BuiltIn()
{
TensorPrimitives.Cosh<float>(_source, _destination);
}
}
Method |
---|
ManualLoop |
BuiltIn |
[th]
Mean
[/th][th]
Ratio
[/th][td]
7,804.4 ns
[/td][td]
1.00
[/td][td]
621.6 ns
[/td][td]
0.08
[/td]A huge number of APIs is available, most of which see similar or better gains over the simple loop. Here’s what’s currently available in .NET 9, all as generic methods, and with multiple overloads available for most:
Abs, Acosh, AcosPi, Acos, AddMultiply, Add, Asinh, AsinPi, Asin, Atan2Pi, Atan2, Atanh, AtanPi, Atan, BitwiseAnd, BitwiseOr, Cbrt, Ceiling, ConvertChecked, ConvertSaturating, ConvertTruncating, ConvertToHalf, ConvertToSingle, CopySign, CosPi, Cos, Cosh, CosineSimilarity, DegreesToRadians, Distance, Divide, Dot, Exp, Exp10M1, Exp10, Exp2M1, Exp2, ExpM1, Floor, FusedMultiplyAdd, HammingDistance, HammingBitDistance, Hypot, Ieee754Remainder, ILogB, IndexOfMaxMagnitude, IndexOfMax, IndexOfMinMagnitude, IndexOfMin, LeadingZeroCount, Lerp, Log2, Log2P1, LogP1, Log, Log10P1, Log10, MaxMagnitude, MaxMagnitudeNumber, Max, MaxNumber, MinMagnitude, MinMagnitudeNumber, Min, MinNumber, MultiplyAdd, MultiplyAddEstimate, Multiply, Negate, Norm, OnesComplement, PopCount, Pow, ProductOfDifferences, ProductOfSums, Product, RadiansToDegrees, ReciprocalEstimate, ReciprocalSqrtEstimate, ReciprocalSqrt, Reciprocal, RootN, RotateLeft, RotateRight, Round, ScaleB, ShiftLeft, ShiftRightArithmetic, ShiftRightLogical, Sigmoid, SinCosPi, SinCos, Sinh, SinPi, Sin, SoftMax, Sqrt, Subtract, SumOfMagnitudes, SumOfSquares, Sum, Tanh, TanPi, Tan, TrailingZeroCount, Truncate, Xor
The possible speedups are even more pronounced on other operations and data types; for example, here is a manual implementation of hamming distance on two input
byte
arrays (hamming distance is the number of elements that differ between the two inputs), and an implementation using TensorPrimitives.HammingDistance<byte>
:
Code:
// Add a <PackageReference Include="System.Numerics.Tensors" Version="9.0.0" /> to the csproj.
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Numerics.Tensors;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private byte[] _x, _y;
[GlobalSetup]
public void Setup()
{
var r = new Random(42);
_x = Enumerable.Range(0, 1024).Select(_ => (byte)r.Next(0, 256)).ToArray();
_y = Enumerable.Range(0, 1024).Select(_ => (byte)r.Next(0, 256)).ToArray();
}
[Benchmark(Baseline = true)]
public int ManualLoop()
{
ReadOnlySpan<byte> source = _x;
Span<byte> destination = _y;
int count = 0;
for (int i = 0; i < source.Length; i++)
{
if (source[i] != destination[i])
{
count++;
}
}
return count;
}
[Benchmark]
public int BuiltIn() => TensorPrimitives.HammingDistance<byte>(_x, _y);
}
Method |
---|
ManualLoop |
BuiltIn |
[th]
Mean
[/th][th]
Ratio
[/th][td]
484.61 ns
[/td][td]
1.00
[/td][td]
15.76 ns
[/td][td]
0.03
[/td]A slew of PRs went into making this happen. The generic method surface area was added via dotnet/runtime#94555, dotnet/runtime#97192, dotnet/runtime#97572, dotnet/runtime#101435, dotnet/runtime#103305, and dotnet/runtime#104651. And then many more PRs added or improved vectorization, including dotnet/runtime#97361, dotnet/runtime#97623, dotnet/runtime#97682, dotnet/runtime#98281, dotnet/runtime#97835, dotnet/runtime#97846, dotnet/runtime#97874, dotnet/runtime#97999, dotnet/runtime#98877, dotnet/runtime#103214 from @neon-sunset, and dotnet/runtime#103820.
As part of all of this work, there was also a recognition that we had the scalar operations and we had the operations on an unbounded number of elements as part of spans, but doing the latter efficiently required effectively having the same set of operations on the various
Vector128<T>
, Vector256<T>
, and Vector512<T>
types, since the typical structure of one of these operations will process vectors of elements at time. As such, progress has been made towards also exposing the same set of operations on these vector types. That’s been done in dotnet/runtime#104848, dotnet/runtime#102181, dotnet/runtime#103837, dotnet/runtime#97114, and dotnet/runtime#96455. More to come.Other related numerical types have also seen improvements. Quaternion multiplication was vectorized in dotnet/runtime#96624 by @TJHeuvel, and dotnet/runtime#103527 accelerated a variety of operations on
Quaternion
, Plane
, Vector2
, Vector3
, Vector4
, Matrix4x4
, and Matrix3x2
.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Numerics;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private Quaternion _value1 = Quaternion.CreateFromYawPitchRoll(0.5f, 0.3f, 0.2f);
private Quaternion _value2 = Quaternion.CreateFromYawPitchRoll(0.1f, 0.2f, 0.3f);
[Benchmark]
public Quaternion Multiply() => _value1 * _value2;
}
Method | Runtime |
---|---|
Multiply | .NET 8.0 |
Multiply | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
3.064 ns
[/td][td]
1.00
[/td][td]
1.086 ns
[/td][td]
0.35
[/td]dotnet/runtime#102301 also moves a lot of the implementation for types like
Quaternion
out of the JIT / native code into C#, something that’s only possible now because of many of the other improvements discussed elsewhere in this post.Strings, Arrays, Spans
IndexOf
As previously noted in Performance Improvements in .NET 8 and earlier in this post, my single favorite performance improvement in .NET 8 came from enabling dynamic PGO. But my second favorite improvement came from the introduction of
SearchValues<T>
. SearchValues<T>
enables optimizing searches, by pre-computing an algorithm to use when searching for a specific set of values (or for anything other than those specific values) and storing that information for later repeated use. Internally, .NET 8 included upwards of 15 different implementations that might be chosen based on the nature of the supplied data. The type was so good at what it did that it was used in over 60 places as part of the .NET 8 release. In .NET 9, it’s used even more, and it gets even better, in a multitude of ways.The
SearchValues<T>
type is generic, so in theory it can be used for any T
, but in practice, the algorithms involved need to special-case the nature of the data, and so the SearchValues.Create
factory methods only enabled creating SearchValues<byte>
and SearchValues<char>
instances for which dedicated implementations were provided. For example, many of the previously noted uses of SearchValues<T>
are searching for a subset of ASCII, such as this use from Regex.Escape
, which enables quickly searching for all characters that require escaping:private static readonly SearchValues<char> s_metachars = SearchValues.Create("\t\n\f\r #$()*+.?[\\^{|");
If you print out the name of the type of the instance returned by that
Create
call, as an implementation detail today you’ll see something like this:System.Buffers.AsciiCharSearchValues`1[System.Buffers.IndexOfAnyAsciiSearcher+Default]
That type provides a specialization of
SearchValues<char>
optimized for searching for any ASCII subset, doing so with an implementation based on the “Universal algorithm” described at http://0x80.pl/articles/simd-byte-lookup.html. Essentially, the algorithm maintains an 8 by 16 bitmap, which not coincidentally is the size of ASCII (0 through 127). Each of the 128 bits in the bitmap represents whether the corresponding ASCII value is in the set. The input chars are mapped down to bytes in a way where chars greater than 127 are mapped to a value meaning no match. The lower nibble (4 bits) of the ASCII value is used to select one of the 16 bitmap rows, and the upper nibble is used to select one of the 8 bitmap columns. And the beauty of this algorithm is, on most supported platforms, there exist SIMD instructions that enable the processing of many characters concurrently as part of just a few instructions.So, in .NET 8,
SearchValues<T>
was only for byte
and char
. But, now in .NET 9, thanks to dotnet/runtime#88394, dotnet/runtime#96429, dotnet/runtime#96928, dotnet/runtime#98901, and dotnet/runtime#98902, you can also create SearchValues<string>
instances. The string handling is different from byte
and char
, however. With byte
, you’re searching for one of a set of byte
s within a span of byte
s. With char
, you’re searching for one of a set of char
s within a span of char
s. But with string
, SearchValues<string>
doesn’t search for one of a set of string
s within a span of string
s, but rather it enables searching for one of a set of string
s within a span of char
s. In other words, it’s a multi-string search. For example, let’s say you want to search some text for the ISO 8601 days of the week, and to do so in an ordinal case-insensitive manner (such that, for example, both “Monday” and “MONDAY” would match). That can now be expressed like this:
Code:
private static readonly SearchValues<string> s_daysOfWeek = SearchValues.Create(
["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"],
StringComparison.OrdinalIgnoreCase);
...
ReadOnlySpan<char> textToSearch = ...;
int i = textToSearch.IndexOfAny(s_daysOfWeek);
This also highlights another interesting difference from the existing
byte
and char
support. For those types, SearchValues
is purely an optimization: IndexOfAny
overloads have long existed for searching for sets of T
values within a larger collection of T
s (e.g. string.IndexOfAny(char[] anyOf)
was introduced over two decades ago), and the SearchValues
support simply makes those use cases faster (often much faster). In contrast, until .NET 9 there have not been any built-in methods for doing multi-string search, so this new support both adds such support and adds it in a way that is highly-efficient.But, let’s say we did want to perform such a search, without that functionality existing in the core libraries. One approach is to simply walk through the input, position by position, comparing each of the target values at that location:
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;
private static readonly string[] s_daysOfWeek = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"];
[Benchmark(Baseline = true)]
public bool Contains_Iterate()
{
ReadOnlySpan<char> input = s_input;
for (int i = 0; i < input.Length; i++)
{
foreach (string dow in s_daysOfWeek)
{
if (input.Slice(i).StartsWith(dow, StringComparison.OrdinalIgnoreCase))
{
return true;
}
}
}
return false;
}
}
Method |
---|
Contains_Iterate |
[th]
Mean
[/th][th]
Ratio
[/th][td]
227.526 us
[/td][td]
1.000
[/td]Classic. Functional. And slow. This is doing a fair amount of work for every single character in the input, for each looping over every day name and doing a comparison. How can we do better? First, we could try making the inner loop more efficient. Rather than iterating through the strings, we could hardcode our own switch:
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;
[Benchmark]
public bool Contains_Iterate_Switch()
{
ReadOnlySpan<char> input = s_input;
for (int i = 0; i < input.Length; i++)
{
ReadOnlySpan<char> slice = input.Slice(i);
switch ((char)(input[i] | 0x20))
{
case 's' when slice.StartsWith("Sunday", StringComparison.OrdinalIgnoreCase) || slice.StartsWith("Saturday", StringComparison.OrdinalIgnoreCase):
case 'm' when slice.StartsWith("Monday", StringComparison.OrdinalIgnoreCase):
case 't' when slice.StartsWith("Tuesday", StringComparison.OrdinalIgnoreCase) || slice.StartsWith("Thursday", StringComparison.OrdinalIgnoreCase):
case 'w' when slice.StartsWith("Wednesday", StringComparison.OrdinalIgnoreCase):
case 'f' when slice.StartsWith("Friday", StringComparison.OrdinalIgnoreCase):
return true;
}
}
return false;
}
}
The main benefit of this is it makes the
StartsWith
calls much more efficient. Because each call is dedicated to a specific needle that the JIT can see, it can emit customized code to optimize that comparison (for context on my choice of language, “needle” is often used when describing a thing being searched for, a reference to the proverbial “needle in a haystack,” and thus “haystack” is used to describe the thing being searched). We’re also reducing the number of cases in the switch by employing an ASCII casing trick; the upper case ASCII letters differ in numerical value from the lower case ASCII letters by a single bit, so we simply ensure that bit is set and then compare against only the lower case letters.Method |
---|
Contains_Iterate |
Contains_Iterate_Switch |
[th]
Mean
[/th][th]
Ratio
[/th][td]
227.526 us
[/td][td]
1.000
[/td][td]
13.885 us
[/td][td]
0.061
[/td]Much better, more than a 16x improvement. What if we instead just kept things simple and searched for each individual string using the already-optimized
IndexOf
?
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;
private static readonly string[] s_daysOfWeek = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"];
[Benchmark]
public bool Contains_ContainsEachNeedle()
{
ReadOnlySpan<char> input = s_input;
foreach (string dow in s_daysOfWeek)
{
if (input.Contains(dow, StringComparison.OrdinalIgnoreCase))
{
return true;
}
}
return false;
}
}
Nice and simple, but…
Method |
---|
Contains_Iterate |
Contains_Iterate_Switch |
Contains_ContainsEachNeedle |
[th]
Mean
[/th][th]
Ratio
[/th][td]
227.526 us
[/td][td]
1.000
[/td][td]
13.885 us
[/td][td]
0.061
[/td][td]
302.330 us
[/td][td]
1.329
[/td]Ouch. On the positive side, this approach benefits from vectorization, as the
Contains
operation itself is vectorized to efficiently check multiple locations at once using SIMD. Unfortunately, this case is heavily impacted by the order in which we perform the search. As it turns out, most of the days of the week show up in the input text (in this case, “War and Peace”), but at very different positions, and Monday doesn’t show up at all. This:
Code:
using var hc = new HttpClient();
var s = await hc.GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt");
Console.WriteLine($"Length: {s.Length}");
Console.WriteLine($"Monday: {s.IndexOf("Monday", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"Tuesday: {s.IndexOf("Tuesday", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"Wednesday: {s.IndexOf("Wednesday", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"Thursday: {s.IndexOf("Thursday", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"Friday: {s.IndexOf("Friday", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"Saturday: {s.IndexOf("Saturday", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"Sunday: {s.IndexOf("Sunday", StringComparison.OrdinalIgnoreCase)}");
yields this:
Code:
Length: 3293614
Monday: -1
Tuesday: 971396
Wednesday: 10652
Thursday: 107470
Friday: 640801
Saturday: 1529549
Sunday: 891753
That means that whereas
Contains_Iterate_Switch
only needs to examine 10,652 positions (the position of the first “Wednesday”) before it finds a match, Contains_ContainsEachNeedle
needs to examine 3,293,614 (no match found for “Monday” so it’ll look at everything) + 971,396 (the index of “Tuesday”) == 4,265,010 positions before it finds a match. That’s 400x as many positions to be examined as the iterative approach. Even the SIMD vectorization gains can’t match that gap in amount of work to be performed.Ok, so what if we changed approach, and instead searched for the first letter in each word in order to quickly skip past the locations that couldn’t possibly match. We could even use
SearchValues<char>
to perform that search.
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;
private static readonly SearchValues<char> s_daysOfWeekFCSV = SearchValues.Create(['S', 's', 'M', 'm', 'T', 't', 'W', 'w', 'F', 'f']);
[Benchmark]
public bool Contains_IndexOfAnyFirstChars_SearchValues()
{
ReadOnlySpan<char> input = s_input;
int i;
while ((i = input.IndexOfAny(s_daysOfWeekFCSV)) >= 0)
{
ReadOnlySpan<char> slice = input.Slice(i);
switch ((char)(input[i] | 0x20))
{
case 's' when slice.StartsWith("Sunday", StringComparison.OrdinalIgnoreCase) || slice.StartsWith("Saturday", StringComparison.OrdinalIgnoreCase):
case 'm' when slice.StartsWith("Monday", StringComparison.OrdinalIgnoreCase):
case 't' when slice.StartsWith("Tuesday", StringComparison.OrdinalIgnoreCase) || slice.StartsWith("Thursday", StringComparison.OrdinalIgnoreCase):
case 'w' when slice.StartsWith("Wednesday", StringComparison.OrdinalIgnoreCase):
case 'f' when slice.StartsWith("Friday", StringComparison.OrdinalIgnoreCase):
return true;
}
input = input.Slice(i + 1);
}
return false;
}
}
In some situations, this is a very viable strategy; in fact, it’s a technique often employed by
Regex
. In other situations, it’s less appropriate. The potential problem is that letters like ‘s’ and ‘t’ are incredibly common. The characters here (‘s’, ‘m’, ‘t’, ‘w’, and ‘f’), both upper- and lower-case variants, make up ~17% of the input text (in contrast to just the capital subset, which makes up only ~0.54%). That means that, on average, this IndexOfAny
call needs to break out of its inner vectorized processing loop every six characters, which decreases the possible efficiency gains from said vectorization. Even so, this is still our best so far:Method |
---|
Contains_Iterate |
Contains_Iterate_Switch |
Contains_ContainsEachNeedle |
Contains_IndexOfAnyFirstChars_SearchValues |
[th]
Mean
[/th][th]
Ratio
[/th][td]
227.526 us
[/td][td]
1.000
[/td][td]
13.885 us
[/td][td]
0.061
[/td][td]
302.330 us
[/td][td]
1.329
[/td][td]
7.151 us
[/td][td]
0.031
[/td]Now, let’s try with
SearchValues<string>
:
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;
private static readonly SearchValues<string> s_daysOfWeekSV = SearchValues.Create(
["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"],
StringComparison.OrdinalIgnoreCase);
[Benchmark]
public bool Contains_StringSearchValues() =>
s_input.AsSpan().ContainsAny(s_daysOfWeekSV);
}
The functionality is built-in, so we haven’t had to write any custom logic other than the call to
ContainsAny
. And the results:Method |
---|
Contains_Iterate |
Contains_Iterate_Switch |
Contains_ContainsEachNeedle |
Contains_IndexOfAnyFirstChars_SearchValues |
Contains_StringSearchValues |
[th]
Mean
[/th][th]
Ratio
[/th][td]
227.526 us
[/td][td]
1.000
[/td][td]
13.885 us
[/td][td]
0.061
[/td][td]
302.330 us
[/td][td]
1.329
[/td][td]
7.151 us
[/td][td]
0.031
[/td][td]
2.153 us
[/td][td]
0.009
[/td]Not only simpler, then, but also several times faster than the fastest result we’d previously managed, and ~105x faster than our original attempt. Sweet!
How does this all work? The algorithms behind it are quite fascinating. As with
byte
and char
, there are multiple concrete implementations that might be employed, selected based on the exact needle values passed to Create
. The simplest implementations are those for handling degenerate cases, like zero inputs (in which case all of the methods can just return hard-coded “not found” results). There’s also a dedicated implementation for a single input, in which case it can perform the same search as IndexOf(needle)
would have done, but lifting out the choice of characters within the needle for which to perform a vectorized search. IndexOf(string)
chooses a couple of characters from the needle (typically just the first and last character in the needle), creates a vector for each of those, and then with appropriate offsets based on the distance between the chosen characters, iterates through the input, comparing against those vectors, and doing a full string comparison only if both vectors match at a particular location. SearchValues<string>
does the same thing (in an internal implementation today called SingleStringSearchValuesThreeChars
), except it uses three instead of two characters, and it employs frequency analysis to choose those characters rather than simply picking the first and last, trying to use characters that are less likely to appear in general (e.g. given the string “amazing”, it’d likely pick the ‘m’, ‘z’, and ‘g’, as it deems those statistically less likely in average inputs than ‘a’, ‘i’, or ‘n’). It can take more time to do this given it can perform the computation once and then cache it for all subsequent searches. We’ll refer back to this in a bit.Beyond those special cases, it starts to get really interesting. There’s been a lot of research done over the last 50 years for the most efficient ways to perform a multi-string search. One popular algorithm is Rabin-Karp, which was created by Richard Karp and Michael Rabin in the 1980s, and which works via a “rolling hash.” Imagine creating a hash of the first N characters in the haystack (input) text, where N is the length of the needle (the substring) for which you’re searching, and comparing the haystack hash against the needle hash; if they match, do the actual full comparison at that location, otherwise continue. Then update the hash by removing the first character and adding the next character, and repeat the check. And then repeat, and repeat, and so on. Each time you move forward, you’re just updating the hash via a fixed number of operations, meaning that all of the updates to the hash function for the whole operation are only
O(Haystack)
. Best case, you only find a single location that the substring could match, and you’ve got O(Haystack + Needle)
algorithmic complexity. Worst case (but generally unlikely), every location is a possible match, and you’ve got O(Haystack * Needle)
algorithmic complexity. A simple implementation might look like this (for pedagogical purposes, this uses a terrible hash function that just sums the character’s numerical values; the real algorithm recommends a better one):
Code:
private static bool RabinKarpContains(ReadOnlySpan<char> haystack, ReadOnlySpan<char> needle)
{
if (haystack.Length >= needle.Length)
{
// Hash the needle and the first needle.Length chars of the haystack.
// Super simple hash for pedagogical purposes: just sum the chars.
int i, rollingHash = 0, needleHash = 0;
for (i = 0; i < needle.Length; i++)
{
rollingHash += haystack[i];
needleHash += needle[i];
}
while (true)
{
// If the hashes match, compare the strings.
if (needleHash == rollingHash && haystack.Slice(i - needle.Length).StartsWith(needle))
{
return true;
}
// If we've reached the end of the haystack, break.
if (i == haystack.Length)
{
break;
}
// Update the rolling hash.
rollingHash += haystack[i] - haystack[i - needle.Length];
i++;
}
}
return needle.IsEmpty;
}
This supports one needle, but extending to support multiple needles can be accomplished in a variety of ways, such as by bucketing needles by their hash codes (ala what a hash map does), and then either checking all needles in the corresponding bucket when there’s a hit, or further reduction in what needs to be checked based on using a Bloom filter or similar technique.
SearchValues<string>
will utilize Rabin-Karp, but only for very short inputs, as for longer inputs there are more efficient approaches.Another popular algorithm is Aho-Corasick, which was designed by Alfred Aho and Margaret Corasick even earlier, in the 1970s. Its primary purpose is multi-string search, enabling finding a match to be performed in linear time in the length of the input, assuming a fixed set of needles. It works by building up a form of a trie, a finite automaton where you start at the root of the graph and transition to children based on matching the character associated with the edge to that child. But, it extends a typical trie with additional edges between nodes that can be used as fallbacks. For example, here’s the automaton for the days of the week discussed earlier: Given the input text “wednesunday”, this will start at the root, progress through the “w”, “we”, “wed”, “wedn”, “wedne”, and “wednes” nodes, but then upon encountering the subsequent ‘u’ and not being able to progress down that path, it’ll employ the fallback link over to the “s” node, at which point it’ll be able to traverse down through “s”, “su”, and so on, until it hits the leaf “sunday” node and can declare success. Aho-Corasick efficiently supports larger strings, and is the go-to implementation
SearchValues<string>
uses as a general fallback. However, in many situations, it can do even better…The real workhorse of
SearchValues<string>
that’s chosen whenever possible is a vectorized implementation of the “Teddy” algorithm. This algorithm originated in Intel’s Hyperscan library, was later adopted by the Rust aho_corasick crate, and is now employed as part of SearchValues<string>
in .NET 9. It is super cool, and super efficient.Earlier, I gave a rough summary of how the
SingleStringSearchValuesThreeChars
and IndexOfAnyAsciiSearcher
implementations work. SingleStringSearchValuesThreeChars
optimizes finding likely positions where a substring might start, reducing the number of false positives by checking for multiple contained characters, and then likely positions are validated by doing the full string comparison at that location. And IndexOfAnyAsciiSearcher
optimizes finding the next position of any character in a large-ish set. You can think of Teddy as a combination of those. There’s a really nice description of the algorithm in the source, so I won’t go into much detail here. In summary, though, it maintains a similar bitmap as with IndexOfAnyAsciiSearcher
, but instead of a single bit per ASCII character, it maintains an 8-bit bitmap for each nibble, and instead of just one bitmap, it maintains two or three, each of which corresponds to a location in the substrings (e.g. one bitmap for the 0th character and one bitmap for the 1st character). Those 8 bits in the bitmap are used to indicate which of up to 8 needles contain that nibble at that location. If there are fewer than 8 needles being searched for, then each of these bits individually identifies one of them, and if there are more than 8 needles, just as with Rabin-Karp, we can create buckets of the needle substrings, with a bit in the bitmap referring to one of the buckets. If the comparisons against the bitmaps indicates a likely match, the full match is performed against the relevant needle (or needles, in the case of matching a bucket). And as with IndexOfAnyAsciiSearcher
, all of this support employs SIMD instructions to perform the lookups on chunks of input text from between 16 and 64 characters at a time, yielding significant speedups.SearchValues<string>
is great for larger numbers of strings, but it’s relevant even for just a few. Consider, for example, this code from MSBuild that’s part of parsing build output looking for warnings and errors:
Code:
if (message.IndexOf("warning", StringComparison.OrdinalIgnoreCase) == -1 &&
message.IndexOf("error", StringComparison.OrdinalIgnoreCase) == -1)
{
return null;
}
Rather than doing two individual searches, we can perform a single search:
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;
private static readonly SearchValues<string> s_warningError = SearchValues.Create(["warning", "error"], StringComparison.OrdinalIgnoreCase);
[Benchmark(Baseline = true)]
public bool TwoContains() =>
s_input.Contains("warning", StringComparison.OrdinalIgnoreCase) ||
s_input.Contains("error", StringComparison.OrdinalIgnoreCase);
[Benchmark]
public bool ContainsAny() =>
s_input.AsSpan().ContainsAny(s_warningError);
}
This is searching “War and Peace” for both “warning” or “error”, but even though both appear in the text, such that the second search for “error” in the original code will never happen, the
SearchValues<string>
search ends up being faster because “error” appears much earlier in the text than does “warning”.Method |
---|
TwoContains |
ContainsAny |
[th]
Mean
[/th][th]
Ratio
[/th][td]
70.03 us
[/td][td]
1.00
[/td][td]
14.05 us
[/td][td]
0.20
[/td]Beyond
SearchValues<string>
, the existing SearchValues<byte>
and SearchValues<char>
support also gets a variety of boosts in .NET 9. dotnet/runtime#96588, for example, makes some common SearchValues<char>
searches faster, specifically when there are 2 or 4 characters being searched for that represent 1 or 2 ASCII case-insensitive characters, such as ['A', 'a']
or ['A', 'a', 'B', 'b']
. In .NET 8, for ['A', 'a']
, for example, SearchValues.Create
will end up picking an implementation that will create a vector for each of 'A'
and 'a'
, and then in the inner loop of the search, it’ll compare each vector against the input haystack text. This PR teaches it to do a similar ASCII trick we discussed earlier: rather than having two separate vectors, it can have a single vector for 'a'
, and then do a single comparison against the input vector that’s been OR’d with 0x20
, such that any 'A'
s become 'a'
s. The OR plus a single comparison is cheaper than the two comparisons plus the OR of the resulting comparisons. Funnily enough, this needn’t even be about casing: since all we’re doing is OR’ing in 0x20
, it applies to any two characters that differ by that same single bit.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static readonly string s_input = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/100/pg100.txt").Result;
private static readonly SearchValues<char> s_symbols = SearchValues.Create("@`");
[Benchmark]
public bool ContainsAny() => s_input.AsSpan().ContainsAny(s_symbols);
}
Method | Runtime |
---|---|
ContainsAny | .NET 8.0 |
ContainsAny | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
262.7 us
[/td][td]
1.02
[/td][td]
232.3 us
[/td][td]
0.90
[/td]The same thing applies with four characters: instead of doing four vector comparisons and three OR operations to combine them, we can do a single OR on the input to mix in
0x20
, two vector comparisons, and a single OR to combine those results. In fact, the four-vector approach was already more expensive than the IndexOfAnyAsciiSearcher
implementation previously described, and since that supports any number of ASCII characters, when applicable even for just four-character needles, SearchValues.Create
would have preferred that. But now in .NET 9 with this optimization, SearchValues.Create
will prefer to use this specialized two-comparison path.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static readonly string s_input = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/100/pg100.txt").Result;
private static readonly SearchValues<char> s_symbols = SearchValues.Create("@`^~");
[Benchmark]
public bool ContainsAny() => s_input.AsSpan().ContainsAny(s_symbols);
}
Method | Runtime |
---|---|
ContainsAny | .NET 8.0 |
ContainsAny | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
247.5 us
[/td][td]
1.01
[/td][td]
196.2 us
[/td][td]
0.80
[/td]Other
SearchValues
implementations also improve in .NET 9, notably the “ProbabilisticMap” implementations. These implementations are preferred by SearchValues<char>
as a fallback when the faster vectorized implementations aren’t applicable but when the number of characters in the needle isn’t exorbitant (the current limit is 256). It works via a form of Bloom filter. Effectively, it maintains a 256-bit bitmap, with needle characters mapping to one or two bits, depending on the char
. If any of the bits for a given char
isn’t 1
, then the char
is definitively not in the set. If all of the bits for a given char
are 1
, then the char
may be in the set, and a more expensive check needs to be performed to determine inclusion. Whether those bits are set is a vectorizable operation, and so as long as false positives are relatively rare (which is why there’s a limit on the number of characters; the more characters are represented, the more false positives there’s likely to be), it’s an efficient means for doing the search. However, this vectorization only applies to positive cases (e.g. IndexOfAny
/ ContainsAny
) but not negative cases (e.g. IndexOfAnyExcept
/ ContainsAnyExcept
); for those “Except” methods, the implementation still walks character by character, and the check it employed per character was O(Needle)
. Thanks to dotnet/runtime#101001, which replaces a linear search with a “perfect hash,” that O(Needle)
drops to O(1)
, making such “Except” calls much more efficient.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
using System.Text.RegularExpressions;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
public static readonly string s_aristotle = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/39963/pg39963.txt").Result;
public static readonly SearchValues<char> s_greekOrAsciiDigits = SearchValues.Create(
Enumerable.Range(0, char.MaxValue + 1)
.Where(i => Regex.IsMatch(((char)i).ToString(), @"[\p{IsGreek}0-9]"))
.Select(i => (char)i)
.ToArray());
[Benchmark]
public int CountNonGreekOrAsciiDigitsChars()
{
int count = 0;
ReadOnlySpan<char> text = s_aristotle;
int index;
while ((index = text.IndexOfAnyExcept(s_greekOrAsciiDigits)) >= 0)
{
count++;
text = text.Slice(index + 1);
}
return count;
}
}
Method | Runtime |
---|---|
CountNonGreekOrAsciiDigitsChars | .NET 8.0 |
CountNonGreekOrAsciiDigitsChars | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
1,814.7 us
[/td][td]
1.00
[/td][td]
881.7 us
[/td][td]
0.49
[/td]That same PR also made another significant improvement related to the probabilistic map: not using it as much. It’s a terrific implementation for some sets of inputs, but for others it can end up performing poorly. .NET 8 included a
Latin1CharSearchValues
, which was used when some of the needle characters were non-ASCII but all were less than 256. In such cases, if the probabilistic map couldn’t vectorize, SearchValues.Create
would return a Latin1CharSearchValues
instance, which maintained a simple 256-bit bitmap that detailed whether each character is in the needle. For .NET 9, that type has been replaced by a more general one that supports arbitrarily large bitmaps, used when there are simply too many for the probabilistic map implementation to handle well or when the values are sufficiently dense. Consider a case like this:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
using System.Text.RegularExpressions;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
public static readonly string s_markTwain = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;
public static readonly SearchValues<char> s_greekChars = SearchValues.Create(
Enumerable.Range(0, char.MaxValue + 1)
.Where(i => Regex.IsMatch(((char)i).ToString(), @"[\p{IsGreek}\p{IsGreekExtended}]"))
.Select(i => (char)i)
.ToArray());
[Benchmark]
public int CountGreekChars()
{
int count = 0;
ReadOnlySpan<char> text = s_markTwain;
int index;
while ((index = text.IndexOfAny(s_greekChars)) >= 0)
{
count++;
text = text.Slice(index + 1);
}
return count;
}
}
The needle here includes all of the characters in the Greek and Greek Extended Unicode blocks, approximately 400 characters. With the way the probabilistic map builds up its filter bitmap, every single bit in the bitmap ends up being set, which means every examined character will fall back to the expensive path. Now in .NET 9, it’ll use a simpler, non-probabilistic bitmap, and even though it’s not vectorized, it yields significantly faster throughput.
Method | Runtime |
---|---|
CountGreekChars | .NET 8.0 |
CountGreekChars | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
126.454 ms
[/td][td]
1.00
[/td][td]
8.956 ms
[/td][td]
0.07
[/td]dotnet/runtime#96931 also extended this probabilistic map support to benefit from AVX512 such that when the probabilistic map implementation is used, it can be significantly faster. Previously, its implementation would utilize 128-bit or 256-bit vectors, depending on hardware support, but now in .NET 9, it can also use 512-bit vectors. This not only possibly doubles throughput due to vector width, AVX512 also includes some applicable instructions that the older instruction sets don’t have (e.g.
VPERMB
, which is exposed as Avx512Vbmi.PermuteVar64x8
), enabling even faster processing due to employing those more sophisticated instructions where relevant. This ends up being particularly impactful when searching for a reasonably small number of non-ASCII characters.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
public static readonly string s_aristotle = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/39963/pg39963.txt").Result;
public static readonly SearchValues<char> s_setSymbols = SearchValues.Create("⊂⊃⊆⊇⊄∩∪∈∊∉∋∍∌∅");
[Benchmark]
public int Count()
{
int count = 0;
ReadOnlySpan<char> text = s_aristotle;
int index;
while ((index = text.IndexOfAny(s_setSymbols)) >= 0)
{
count++;
text = text.Slice(index + 1);
}
return count;
}
}
Method | Runtime |
---|---|
Count | .NET 8.0 |
Count | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
28.35 us
[/td][td]
1.00
[/td][td]
13.19 us
[/td][td]
0.47
[/td]Further, while the probabilistic map implementations were previously vectorized for
IndexOfAny
(and therefore implicitly for ContainsAny
), they weren’t for LastIndexOfAny
, which means just changing whether you were searching from start to end or from end to start could have a significant impact on throughput. dotnet/runtime#102331 improves that as well, enabling the LastIndexOfAny
path to also take advantage of SIMD.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
public static readonly string s_markTwain = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;
public static readonly SearchValues<char> s_accentedChars = SearchValues.Create(
['À', 'È', 'Ì', 'Ò', 'Ù', 'Á', 'É', 'Í', 'Ó', 'Ú',
'Â', 'Ê', 'Î', 'Ô', 'Û', 'Ã', 'Ẽ', 'Ĩ', 'Õ', 'Ũ',
'Ä', 'Ë', 'Ï', 'Ö', 'Ü', 'Ÿ']);
[Benchmark]
public bool HasAnyAccented_IndexOfAny() => s_markTwain.AsSpan().IndexOfAny(s_accentedChars) >= 0;
[Benchmark]
public bool HasAnyAccented_LastIndexOfAny() => s_markTwain.AsSpan().LastIndexOfAny(s_accentedChars) >= 0;
}
Method | Runtime |
---|---|
HasAnyAccented_IndexOfAny | .NET 8.0 |
HasAnyAccented_IndexOfAny | .NET 9.0 |
HasAnyAccented_LastIndexOfAny | .NET 8.0 |
HasAnyAccented_LastIndexOfAny | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
7.910 ms
[/td][td]
1.00
[/td][td]
4.476 ms
[/td][td]
0.57
[/td][td]
17.491 ms
[/td][td]
1.00
[/td][td]
5.253 ms
[/td][td]
0.30
[/td]In many of my examples, I’ve used
ContainsAny
rather than IndexOfAny
. The former is functionally equivalent to input.IndexOfAny(searchValues) >= 0
, and in fact that was the entirety of the implementation in .NET 8. However, as IndexOfAny
employs vectorization and is comparing multiple elements as part of the same instruction, when a match is found, there is a bit of overhead involved to then determine exactly which element matched (or if multiple matched, which match had the lowest index). ContainsAny
doesn’t actually need to care about the exact index: as exemplified by its implementation, it only cares about whether there was a match rather than where one was. As such, we can shave off some cycles by customizing the implementation for ContainsAny
to avoid that unnecessary computation, and that’s exactly what dotnet/runtime#96924 does. The effects of this are most notable where that overhead would be measurable, which is when a match is found really early in the input.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private string _haystack = "Hello, world! How are you today?";
private static readonly SearchValues<char> s_vowels = SearchValues.Create("aeiou");
[Benchmark]
public bool ContainsAny() => _haystack.AsSpan().ContainsAny(s_vowels);
}
Method | Runtime |
---|---|
ContainsAny | .NET 8.0 |
ContainsAny | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
3.640 ns
[/td][td]
1.00
[/td][td]
2.382 ns
[/td][td]
0.65
[/td]Improvements around
SearchValues
aren’t limited just to new APIs or the implementation of the APIs; there’s also been work to help developers better consume SearchValues
. dotnet/roslyn-analyzers#6898 and dotnet/roslyn-analyzers#7252 added a new analyzer (CA1870) that will find opportunities to use SearchValues
and automatically fix the call sites to do so.It’s also worth highlighting that there have been improvements around
IndexOf
/ Contains
in .NET 9 besides with SearchValues
. One simple but interesting change is dotnet/runtime#97632. This simply added an if
block to string.Contains(string)
:
Code:
public bool Contains(string value)
{
if (value == null)
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.value);
// PR added this if block
if (RuntimeHelpers.IsKnownConstant(value) && value.Length == 1)
return Contains(value[0]);
return SpanHelpers.IndexOf(
ref _firstChar,
Length,
ref value._firstChar,
value.Length) >= 0;
}
What’s interesting about this is the
SpanHelpers.IndexOf
it delegates to already contains a fast path that special-cases single-character strings:
Code:
if (valueTailLength == 0)
{
// for single-char values use plain IndexOf
return IndexOfChar(ref searchSpace, value, searchSpaceLength);
}
Why then is this extra
if
block helpful? It’s taking advantage of that same internal IsKnownConstant
intrinsic we saw earlier. The JIT will always compile this method down to a true
/false
constant, so it ends up adding no runtime overhead. If the value is false
, the whole if
block evaporates. But if it’s true
, that necessarily means the argument passed to the method is recognized by the JIT as being a constant, e.g. a developer called someString.Contains("-")
such that the JIT can see that value
is "-"
. In such a case, the JIT also knows value.Length
, such that it can see at compile time whether it’s 1
or not. And that in turn means this whole method becomes:
Code:
public bool Contains(string value)
{
if (value == null)
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.value);
return SpanHelpers.IndexOf(
ref _firstChar,
Length,
ref value._firstChar,
value.Length) >= 0;
}
if the JIT can’t prove the argument is a constant or if it’s not exactly one character in length, or:
Code:
public bool Contains(string value)
{
return Contains('the constant char');
}
if it can. This eliminates a bit of overhead from the call.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private string _input = "!@#$%^&";
[Benchmark]
public bool Contains() => _input.Contains("$");
}
Method | Runtime |
---|---|
Contains | .NET 8.0 |
Contains | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
3.7649 ns
[/td][td]
1.00
[/td][td]
0.9614 ns
[/td][td]
0.26
[/td]Regex
Regular expression support in .NET has received a lot of love over the past few years. The implementation was overhauled in .NET 5 to yield significant performance gains, and then in .NET 7 it not only saw another round of huge performance gains, it also gained a source generator, a new non-backtracking implementation, and more. In .NET 8, it saw additional performance improvements, in part because of using
SearchValues
.Now in .NET 9, the trend continues. First and foremost, it’s important to recognize that many of the changes discussed thus far implicitly accrue to
Regex
. Regex
already uses SearchValues
, and so improvements to SearchValues
benefit Regex
(it’s one of my favorite things about working at the lowest levels of the stack: improvements there have a multiplicative effect, in that direct use of them improves, but so too does indirect use via intermediate components that instantly get better as the lower level does). Beyond that, though, Regex
has increased its reliance on SearchValues
.There are multiple engines backing
Regex
today:- An interpreter, which is what you get when you don’t explicitly ask for one of the other engines.
- A reflection-emit-based compiler, which at run-time emits custom IL for the specific regular expression and options. This is what you get when you specify
RegexOptions.Compiled
. - A non-backtracking engine, which doesn’t support all of
Regex
‘s features but which guaranteesO(N)
throughput in the length of the input. This is what you get when you specifyRegexOptions.NonBacktracking
. - And a source generator, which is very similar to the compiler, except it emits C# at build-time rather than emitting IL at run-time. This is what you get when you use
[GeneratedRegex(...)]
.
As of dotnet/runtime#98791, dotnet/runtime#103496, and dotnet/runtime#98880, all of the engines other than the interpreter avail themselves of the new
SearchValues<string>
support (the interpreter could as well, but we make an assumption that someone is using the interpreter in order to optimize for the speed of Regex
construction, and the analyses involved in choosing to use SearchValues<string>
can take measurable time). The best way to see what this looks like is via the source generator, as we can easily examine the code it outputs in both .NET 8 and .NET 9. Consider this code:
Code:
using System.Text.RegularExpressions;
internal partial class Example
{
[GeneratedRegex("(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday): (.*)", RegexOptions.IgnoreCase)]
public static partial Regex ParseEntry();
}
In Visual Studio, you can right-click on
ParseEntry
, select “Go To Definition,” and the tool will take you to the C# code for this pattern as generated by the regular expression source generator (the pattern is looking for a day of the week, followed by a colon, then followed by any text, and is capturing both the day and that subsequent text into capture groups for subsequent exploration). The generated code contains two relevant methods: a TryFindNextPossibleStartingPosition
method, which is used to skip ahead as quickly as possible to the first location that might possibly match, and a TryMatchAtCurrentPosition
method, which performs the full match attempt at that location. For our purposes here, we care about TryFindNextPossibleStartingPosition
, as that’s the most impactful place SearchValues
shows up. On .NET 8, we see this code:
Code:
private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
{
int pos = base.runtextpos;
ulong charMinusLowUInt64;
// Any possible match is at least 8 characters.
if (pos <= inputSpan.Length - 8)
{
// The pattern matches a character in the set [ADSYadsy] at index 5.
// Find the next occurrence. If it can't be found, there's no match.
ReadOnlySpan<char> span = inputSpan.Slice(pos);
for (int i = 0; i < span.Length - 7; i++)
{
int indexOfPos = span.Slice(i + 5).IndexOfAny(Utilities.s_ascii_1200080212000802);
if (indexOfPos < 0)
{
goto NoMatchFound;
}
i += indexOfPos;
if (((long)((0x8106400081064000UL << (int)(charMinusLowUInt64 = (uint)span[i] - 'F')) & (charMinusLowUInt64 - 64)) < 0) &&
((long)((0x8023400080234000UL << (int)(charMinusLowUInt64 = (uint)span[i + 3] - 'D')) & (charMinusLowUInt64 - 64)) < 0))
{
base.runtextpos = pos + i;
return true;
}
}
}
// No match found.
NoMatchFound:
base.runtextpos = inputSpan.Length;
return false;
}
The code is using an
IndexOfAny
with Utilities.s_ascii_1200080212000802
; what is that? It’s a SearchValues<char>
:
Code:
/// <summary>Supports searching for characters in or not in "ADSYadsy".</summary>
internal static readonly SearchValues<char> s_ascii_1200080212000802 = SearchValues.Create("ADSYadsy");
The source generator is employing the approach we looked at earlier, searching for a single character from each string. Here it’s decided that its best chance for an optimal search is to look for the character at offset 5 in each string, so ‘y’ for “Monday”, ‘a’ for “Tuesday”, etc., plus looking for the upper-case variants since
RegexOptions.IgnoreCase
was specified. Then after the single-character search, it’s doing a quick test for a couple of other positions in the string to try to weed out false positives, looking at the 0th offset to ensure the character is in the set [MTWFSmtwfs]
and the 3rd offset to ensure the character is in the set [DSNRUdsnru]
. (The check for those is obscured by it using a branchless technique to query a bitmap stored in a 64-bit ulong.)Now, here’s what we get in .NET 9:
Code:
private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
{
int pos = base.runtextpos;
// Any possible match is at least 8 characters.
if (pos <= inputSpan.Length - 8)
{
// The pattern has multiple strings that could begin the match. Search for any of them.
// If none can be found, there's no match.
int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_OrdinalIgnoreCase_B7E3C0B8368AC400913BEA56D1872F43698FDA2C54D1AD4886F6734244613374);
if (i >= 0)
{
base.runtextpos = pos + i;
return true;
}
}
// No match found.
base.runtextpos = inputSpan.Length;
return false;
}
Again, we see an
IndexOfAny
, but notice that the subsequent checks for other positions are gone. Why? Because that SearchValues
being passed to the IndexOfAny
now is a SearchValues<string>
, and thus already confirms that one of the provided strings matches:
Code:
/// <summary>Supports searching for the specified strings.</summary>
internal static readonly SearchValues<string> s_indexOfAnyStrings_OrdinalIgnoreCase_B7E3C0B8368AC400913BEA56D1872F43698FDA2C54D1AD4886F6734244613374 =
SearchValues.Create(
["monday", "tuesday", "wednesda", "thursday", "friday", "saturday", "sunday"],
StringComparison.OrdinalIgnoreCase);
The sharp-eyed amongst you might notice that there’s no ‘y’ at the end of “Wednesday”; that’s simply due to a heuristic in the
Regex
implementation. When it searches for strings to use as part of such a SearchValues<string>
, it limits itself to using no more than length 8 strings. And “searches” is an appropriate word here, as the implementation isn’t limited just to clean alternations as in the previous example. If I instead change the program to be:
Code:
using System.Text.RegularExpressions;
internal partial class Example
{
[GeneratedRegex("[Aa]([Bb][Cc]|[Dd][Ee])")]
public static partial Regex ParseEntry();
}
we still end up with a
SearchValues<string>
, now for this:
Code:
/// <summary>Supports searching for the specified strings.</summary>
internal static readonly SearchValues<string> s_indexOfAnyStrings_OrdinalIgnoreCase_33A76C255741CD9630059173F803FB92EBDDFBF62328261428CF8838D6379CE9 =
SearchValues.Create(["abc", "ade"], StringComparison.OrdinalIgnoreCase);
Interestingly, as of dotnet/runtime#96402,
SearchValues<string>
will also be used when doing a single string search. As previously noted, IndexOf(string)
will try to pick two characters and do a vectorized search for both, whereas SearchValues<string>
for that same input can spend a bit more time trying to pick more characters and characters that will be better for the search. As such, Regex
now opts to use SearchValues<string>
as part of TryFindNextPossibleStartingPosition
. We can see this with the following benchmarks that count the number of occurrences of the words “Hello” or “earth”:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public partial class Tests
{
public static readonly string s_markTwain = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;
[GeneratedRegex(@"\bHello\b")]
public static partial Regex FindHello();
[GeneratedRegex(@"\bearth\b")]
public static partial Regex FindEarth();
[Benchmark]
public int CountHello() => FindHello().Count(s_markTwain);
[Benchmark]
public int CountEarth() => FindEarth().Count(s_markTwain);
}
On .NET 8, the code generated for
TryFindNextPossibleStartingPosition
for FindEarth
includes:int i = inputSpan.Slice(pos).IndexOf("earth");
whereas on .NET 9, the generated code is:
Code:
int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_earth_Ordinal);
...
internal static readonly SearchValues<string> s_indexOfString_earth_Ordinal =
SearchValues.Create(["earth"], StringComparison.Ordinal);
And the results:
Method | Runtime |
---|---|
CountHello | .NET 8.0 |
CountHello | .NET 9.0 |
CountEarth | .NET 8.0 |
CountEarth | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
2.020 ms
[/td][td]
1.00
[/td][td]
2.042 ms
[/td][td]
1.01
[/td][td]
2.738 ms
[/td][td]
1.00
[/td][td]
2.339 ms
[/td][td]
0.85
[/td]This highlights that using
SearchValues<string>
in this one-string case doesn’t always help, but it can improve things, in particular in situations where the extra one-time work done by the SearchValues.Create
enables it to find meaningfully better characters for which to search.My seeming obsession with
SearchValues<T>
might lead one to believe that it’s the only source of improvements in Regex
, but that’s far from the truth. There are many other PRs in .NET 9 focused on different aspects that improved the area.dotnet/runtime#93190 is a nice addition. One of the optimizations introduced to
Regex
in .NET 7 was a “literal-after-loop” search. A lot of effort goes into finding ways to help Regex
‘s TryFindNextPossibleStartingPosition
be as efficient as possible at skipping unnecessary locations, and this “literal-after-loop” search is one such mechanism. It looks for a particular shape of pattern, where the pattern starts with a loop that’s then followed by a literal. For example, the industry regex benchmarks in mariomka/regex-benchmark includes this pattern for finding URIs:@"[\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?"
The pattern starts with a word character loop. We don’t have a good way to vectorize a search for any word character, nor would we really want to; there are over 50,000 word characters that are part of the
\w
set, and in most inputs we’d typically find an occurrence so quickly that it wouldn’t be worth the vectorization. However, the "://"
that follows is easily searchable and much less likely to occur, making it a good candidate for TryFindNextPossibleStartingPosition
. However, we can’t just search for the "://"
because it doesn’t start the pattern, nor is it at a fixed-offset from the beginning of the pattern that would enable us to find the "://"
and then jump backwards a known number of positions. Instead, with the “literal-after-loop” optimization, we can find the "://"
and then iterate backwards to the beginning of the loop in order to find the actual starting position for the match attempt (we can also keep track of where the loop ends so that we don’t need to re-match it).There were, however, a number of gaps in this optimization. Most notably, the implementation needs to examine the pattern to determine whether it’s applicable. If the starting loop was wrapped in a capture or an atomic group, it was unnecessarily giving up and would fail to discover the loop for the purposes of enabling the “literal-after-loop” mechanism. The search would also give up if the literal after the loop was a set inside of various grouping constructs, like a concatenation.
This PR fixed those gaps. The impact of this can be seen by looking at another industry benchmark, this time from the BurntSushi/rebar repo:
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public partial class Tests
{
private static string s_haystack = new HttpClient().GetStringAsync("https://raw.githubusercontent.com/BurntSushi/rebar/master/benchmarks/haystacks/wild/cpython-226484e4.py").Result;
[GeneratedRegex(@"(\s*)((?:# [Nn][Oo][Qq][Aa])(?::\s?(([A-Z]+[0-9]+(?:[,\s]+)?)+))?)")]
private static partial Regex RuffNoQA();
[Benchmark]
public int Count() => RuffNoQA().Count(s_haystack);
}
The impact of the literal-after-loop optimization ends up being obvious in the resulting numbers:
Method | Runtime |
---|---|
Count | .NET 8.0 |
Count | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
197.47 ms
[/td][td]
1.00
[/td][td]
18.67 ms
[/td][td]
0.09
[/td]Improvements in
Regex
go beyond just the initial searching, as well. An interesting changes comes in dotnet/runtime#98723, not because it results in massive performance improvements (though it does yield some nice benefits), but rather because it highlights how improvement possibilities can be found in all manner of places. One of the areas we (and pretty much everyone else in the world it seems) has been investing a lot of energy into lately is AI, including into tokenizers, which are components that take text and translate it into a series of numbers that are meaningful to the model into which they’ll be fed. Each model is trained on a set of tokens from a specific tokenizer, and in the case of OpenAI’s models, that tokenizer algorithm is “tiktoken.” The official .NET implementation of tiktoken lives in the Microsoft.ML.Tokenizers library, and as part of implementing tiktoken, it follows the reference implementation provided by OpenAI, which uses a regular expression as part of parsing; for consistency and to help ensure correctness, therefore, the .NET implementation does as well. This regex includes the following pattern:(?i:'s|'t|'re|'ve|'m|'ll|'d)
What jumped out about this pattern is that it should trigger an optimization in the regex source generator that emits alternations like this as a C#
switch
statement, as the analyzer should be able to determine that all branches of the alternation are distinct, such that picking one branch because the first character matches necessarily means that no other branch could match. The benefit of a switch
here is it allows the C# compiler to implement a jump table, which means we’re not stuck exploring each branch when we could instead jump right to the correct one. But that optimization wasn’t kicking in. Why? A series of unfortunate events. An earlier optimization was seeing the ll
and rewriting that into a repeater (l{2}
), which then defeated this alternation optimization because the implementation wasn’t written to examine loops. Loops were explicitly being skipped because a loop could be empty, and if empty, it wouldn’t have a first character required by the switch
. However, we can see whether a loop has a non-zero minimum bound on number of iterations set, as it does in this case, and in such cases we can still factor in up to that minimum number, which are all guaranteed iterations. This PR improved the analysis to handle loops well, as evidenced by this micro-benchmark (which has been crafted to accentuate this aspect of the pattern):
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public partial class Tests
{
private static string s_haystack = new string('y', 10_000);
[GeneratedRegex("(?i:a|bb|c|dd|e|ff|g|hh|i|jj|k|ll|m|nn|o|pp|q|rr|s|tt|u|vv|w|xx|y|zz)")]
public static partial Regex Parse();
[Benchmark]
public int Count() => Parse().Count(s_haystack);
}
Method | Runtime |
---|---|
Count | .NET 8.0 |
Count | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
315.7 us
[/td][td]
1.00
[/td][td]
211.6 us
[/td][td]
0.67
[/td]The non-backtracking engine also got some nice attention in .NET 9. dotnet/runtime#102655 from @ieviev (who submitting a small subset of the changes they’d made as part of some exciting regex research being done in a fork of the library), followed by dotnet/runtime#104766 and dotnet/runtime#105668 made a variety of changes to the non-backtracking implementation, including:
- DFA limits. The non-backtracking implementation works by constructing a finite automata, which can be thought of as a graph, with the implementation walking around the graph as it consumes additional characters from the input and uses those to guide what node(s) it transitions to next. The graph is built out lazily, such that nodes are only added as those states are explored, and the nodes can be one of two kinds: DFA (deterministic) or NFA (non-deterministic). DFA nodes ensure that for any given character that comes next in the input, there’s only ever one possible node to which to transition. Not so for NFA, where at any point in time there’s a list of all the possible nodes the system could be in, and moving to the next state means examining each of the current states, finding all possible transitions out of each, and treating the union of all of those new positions as the next state. DFA is thus much cheaper than NFA in terms of the overheads involved in walking around the graph, and we want to fall back to NFA only when we absolutely have to, which is when the DFA graph would be too large: some patterns have the potential to create massive numbers of DFA nodes. Thus, there’s a threshold where once that number of constructed nodes in the graph is hit, new nodes are constructed as NFA rather than DFA. In .NET 8 and earlier, that limit was somewhat arbitrarily set at 10,000. For .NET 9 as part of this PR, analysis was done to show that a much higher limit was worth the memory trade-offs, and the limit was raised to 125,000, which means many more patterns can fully execute as DFA.
- Minterm mappings. The implementation works in terms of “minterms,” which are equivalence classes for all characters that behave the same in the pattern. For example, with the pattern
"[a-z]*"
, the lowercase ASCII letters are all treated the same and are all treated differently from every other character, so there are two minterms here, one for the 26 lowercase ASCII letters, and the other for the remaining 65,510 characters. This is used as a compression mechanism, as rather than needing to describe the transitions between nodes for every character, the system can instead do so for every minterm. Of course, that means during matching there’s a step where a character needs to be mapped to its minterm in order to know which edge to follow to the next state. Previously, that mapping was cached for all ASCII characters but recomputed each time for non-ASCII (recomputing it amounts to a binary search on a tree data structure). As you can imagine, this can lead to significant overhead when non-ASCII is encountered. Now in .NET 9, mappings for all characters represented in the pattern are stored. In degenerate cases, this can measurably increase memory consumption for theRegex
instance, but on average it doesn’t; in fact, for common cases the new scheme actually reduces memory consumption, as it takes into account the fact that all but the most niche patterns have fewer than 256 minterms, and the per-character mapping can thus be stored in abyte
rather than aushort
oruint
. Additionally, for cases where only a subset of ASCII is used in the pattern (which is common), theRegex
instance needn’t allocate an array to represent all 128 ASCII characters, but can instead be shrunk to only support those characters that need be represented. - Timeout checks.
Regex
has long supported a timeout mechanism, where if a match operation takes longer than a specified limit, an exception is thrown. This mechanism exists to help mitigate possible regex denial of service (ReDOS) attacks, where a maliciously-constructed pattern when fed to a backtracking engine could lead to “catastrophic backtracking” (you can see an example of this in my Deep .NET discussion on Regex with Scott Hanselman). These timeouts are thus enabled in the interpreter, the compiler, and the source generator. For the non-backtracking engine, timeouts aren’t necessary to avoid catastrophic backtracking, as there is no backtracking. The engine still pays some attention to timeouts, though, purely for consistency with the other engines, yet the frequency of the checks was actually adding measurable overhead in some cases. The PR reduced the frequency of the checks to mitigate that overhead while not meaningfully affecting the effectiveness of the checks. - Hot path inner loop. The inner matching loop is the hot path for a matching operation: read the next character, look up its minterm, follow the corresponding edge to the next node in the graph, rinse and repeat. Performance of the engine is tied to efficiency of this loop. These PRs recognized that there were some checks being performed in that inner loop which were only relevant to a minority of patterns. For the majority, the code could be specialized such that those checks wouldn’t be needed in the hot path.
- General good hygiene. Care was taken to remove unnecessary overheads, such as duplicate array lookups that could be removed, bounds checks that could be avoided, indirect reads via
ref
s that could instead be done against locals, and so on.
The net result of these changes is most patterns get faster, some significantly, especially on non-ASCII inputs.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
public static readonly string s_aristotle = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/39963/pg39963.txt").Result;
private readonly Regex _word = new Regex(@"\b[\p{IsGreek}]+\b", RegexOptions.NonBacktracking);
[Benchmark]
public int CountWords() => _word.Matches(s_aristotle).Count;
}
Method | Runtime |
---|---|
CountWords | .NET 8.0 |
CountWords | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
14.808 ms
[/td][td]
1.00
[/td][td]
9.673 ms
[/td][td]
0.65
[/td]The dotnet/runtime employs an automated performance regression testing system, with tests in dotnet/performance constantly running on various operating systems and hardware, with the goal of detecting regressions. When a possible regression is noticed, an issue is opened containing the details. However, the system also notices statistically-significant improvements and also opens issues on those, just to ensure that we’re all aware of when and how things change in a meaningful way. When possible, the issues reference the PR known to have caused the regression or improvement, so it’s always a treat to see a list of references like this on a PR, as was the case with dotnet/runtime#102655:
Both the non-backtracking engine and the interpreter now also gain additional optimized searching for certain classes of prefixes they didn’t previously support. With dotnet/runtime#100315, patterns that begin with ranges can now be optimized with an
IndexOfAny{Except}InRange
call, whereas previously such patterns would likely result in walking character by character.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
public static readonly string s_markTwain = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;
private readonly Regex _interpreter = new Regex(@"\b[0-9]+\b");
private readonly Regex _nonBacktracking = new Regex(@"\b[0-9]+\b", RegexOptions.NonBacktracking);
[Benchmark]
public int Interpreter() => _interpreter.Count(s_markTwain);
[Benchmark]
public int NonBacktracking() => _nonBacktracking.Count(s_markTwain);
}
Method | Runtime |
---|---|
Interpreter | .NET 8.0 |
Interpreter | .NET 9.0 |
NonBacktracking | .NET 8.0 |
NonBacktracking | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
21.223 ms
[/td][td]
1.00
[/td][td]
1.726 ms
[/td][td]
0.08
[/td][td]
21.945 ms
[/td][td]
1.00
[/td][td]
1.749 ms
[/td][td]
0.08
[/td]Finally,
Regex
gains some new APIs in .NET 9, focused on performance. Regex
currently has a set of Split
overloads; these logically behave like Match
, except instead of returning what matched, they effectively return what’s between the matches, treating the match as a split separator. As with string.Split
, these Regex.Split
methods return a string[]
, which means allocating the array to store all the results and allocating each of the individual string
s. There was also no overload for supporting span inputs, which meant if one had a span to search, that span would first need to be converted into a string, yet another allocation. .NET 7 saw a similar predicament fixed with the introduction of the EnumerateMatches
method, which provided an allocation-free alternative to Match
or Matches
. Now in .NET 9, thanks to dotnet/runtime#103307, Regex
gets new EnumerateSplits
methods, that similarly provide an allocation-free way to access the same splits. The method accepts a ReadOnlySpan<char>
, and then rather than returning an array of strings, it returns an enumerator of Range
s pointing into the original.
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
public static readonly string s_markTwain = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;
private readonly Regex _whitespace = new Regex(@"\s+", RegexOptions.Compiled);
[Benchmark(Baseline = true)]
public int SplitOnWhitespace_Split()
{
int lengths = 0;
foreach (string split in _whitespace.Split(s_markTwain))
{
lengths += split.Length;
}
return lengths;
}
[Benchmark]
public int SplitOnWhitespace_EnumerateSplits()
{
int lengths = 0;
foreach (Range range in _whitespace.EnumerateSplits(s_markTwain))
{
ReadOnlySpan<char> split = s_markTwain.AsSpan(range);
lengths += split.Length;
}
return lengths;
}
}
Method |
---|
SplitOnWhitespace_Split |
SplitOnWhitespace_EnumerateSplits |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
189.1 ms
[/td][td]
1.00
[/td][td]
185305389 B
[/td][td]
1.000
[/td][td]
116.6 ms
[/td][td]
0.62
[/td][td]
272 B
[/td][td]
0.000
[/td]Encoding
Base64 encoding has been supported in .NET since the beginning, with methods like
Convert.ToBase64String
and Convert.FromBase64CharArray
. More recently, a plethora of Base64-related APIs have been added, including span-based APIs on Convert
but also a dedicated System.Buffers.Text.Base64
with methods for encoding and decoding between arbitrary bytes and UTF8 text, and most recently for very efficiently checking whether UTF8 and UTF16 text represents a valid Base64 payload.Base64 is a fairly simple encoding scheme for taking arbitrary binary data and converting it to ASCII text, splitting the input up into groups of 6 bits (2^6 == 64 possible values) and mapping each of those values to a specific character in the Base64 alphabet: the 26 upper-case ASCII letters, the 26 lower-case ASCII letters, the 10 ASCII digits,
'+'
, and '/'
. While this is an incredibly popular encoding mechanism, it runs into problems for some use cases because of the exact choice of alphabet. Including Base64 data in a URI is possibly problematic, as '+'
and '/'
both have special meaning in URIs, as does the special '='
symbol used for padding Base64 data out to a specific length. That means that in addition to Base64-encoding data, the resulting data might also need to be URL-encoded for such a use, both taking additional time and further increasing the size of the payload. To address this, a variant was introduced, Base64Url, which does away with the need for padding, and which uses a slightly different alphabet, '-'
instead of '+'
and '_'
instead of '/'
. Base64Url is used in a variety of domains, including as part of JSON Web Tokens (JWT), where it’s used to encode each segment of the token.While .NET has had Base64 support for a long time, it hasn’t had Base64Url support, and as such, developers have had to craft their own. Many have done so by layering on top of the Base64 implementations in
Convert
or Base64
. For example, here’s what the core part of ASP.NET’s implementation for WebEncoders.Base64UrlEncode
looked like in .NET 8:
Code:
private static int Base64UrlEncode(ReadOnlySpan<byte> input, Span<char> output)
{
if (input.IsEmpty)
return 0;
Convert.TryToBase64Chars(input, output, out int charsWritten);
for (var i = 0; i < charsWritten; i++)
{
var ch = output[i];
if (ch == '+') output[i] = '-';
else if (ch == '/') output[i] = '_';
else if (ch == '=') return i;
}
return charsWritten;
}
We can obviously write more code to do that more efficiently, but with .NET 9 we don’t have to. With dotnet/runtime#102364, .NET now has a fully-featured
Base64Url
type that is also very efficient. It actually shares almost all of its implementation with the same functionality on Base64
and Convert
, using generic tricks to substitute the different alphabets in an optimized manner. (The ASP.NET implementation has also been updated to use Base64Url
with dotnet/aspnetcore#56959 and dotnet/aspnetcore#57050.)
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers.Text;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private byte[] _data;
private char[] _destination = new char[Base64.GetMaxEncodedToUtf8Length(1024 * 1024)];
[GlobalSetup]
public void Setup()
{
_data = new byte[1024 * 1024];
new Random(42).NextBytes(_data);
}
[Benchmark(Baseline = true)]
public int Old() => Base64UrlOld(_data, _destination);
[Benchmark]
public int New() => Base64Url.EncodeToChars(_data, _destination);
static int Base64UrlOld(ReadOnlySpan<byte> input, Span<char> output)
{
if (input.IsEmpty)
return 0;
Convert.TryToBase64Chars(input, output, out int charsWritten);
for (var i = 0; i < charsWritten; i++)
{
var ch = output[i];
if (ch == '+')
{
output[i] = '-';
}
else if (ch == '/')
{
output[i] = '_';
}
else if (ch == '=')
{
return i;
}
}
return charsWritten;
}
}
Method |
---|
Old |
New |
[th]
Mean
[/th][th]
Ratio
[/th][td]
1,314.20 us
[/td][td]
1.00
[/td][td]
81.36 us
[/td][td]
0.06
[/td]This also benefits from a set of changes that improved the performance of
Base64
, and thus also Base64Url
, since they now share the same code. dotnet/runtime#92241 from @DeepakRajendrakumaran added an AVX512-optimized Base64 encoding/decoding implementation, and dotnet/runtime#95513 from @SwapnilGaikwad and dotnet/runtime#100589 from @SwapnilGaikwad optimized Base64 encoding and decoding for Arm64.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private byte[] _toEncode;
private char[] _encoded;
[GlobalSetup]
public void Setup()
{
_toEncode = new byte[1000];
new Random(42).NextBytes(_toEncode);
_encoded = new char[Convert.ToBase64String(_toEncode).Length];
}
[Benchmark]
public void ConvertToBase64() => Convert.ToBase64CharArray(_toEncode, 0, _toEncode.Length, _encoded, 0);
}
Method | Runtime |
---|---|
ConvertToBase64 | .NET 8.0 |
ConvertToBase64 | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
104.55 ns
[/td][td]
1.00
[/td][td]
60.19 ns
[/td][td]
0.58
[/td]Another simpler form of encoding is hex, effectively employing an alphabet of 16 characters (for each group of 4 bits) rather than 64 (for each group of 6 bits). .NET 5 introduced the
Convert.ToHexString
set of methods, which take an input ReadOnlySpan<byte>
or byte[]
and produce an output string
with two hex chars per input byte. The alphabet selected for that encoding are the hexademical characters of ‘0’ through ‘9’ and then upper-case ‘A’ through ‘F’. That’s great when you want upper-case, but sometimes you want the lower-case ‘a’ through ‘f’ instead. As a result, it’s not uncommon now to see calls like this:string result = Convert.ToHexString(bytes).ToLowerInvariant();
where
ToHexString
produces one string and then ToLowerInvariant
possibly produces another (“possibly” because it’ll only need to create a new string if the data contained any letters). With .NET 9 and dotnet/runtime#92483 from @determ1ne, the new Convert.ToHexStringLower
methods may be used to go directly to the lower-case version; that PR also introduced the TryToHexString
and TryToHexStringLower
methods, which format directly into a provided destination span rather than allocating anything.
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers.Text;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private byte[] _data = new byte[100];
private char[] _dest = new char[200];
[GlobalSetup]
public void Setup() => new Random(42).NextBytes(_data);
[Benchmark(Baseline = true)]
public string Old() => Convert.ToHexString(_data).ToLowerInvariant();
[Benchmark]
public string New() => Convert.ToHexStringLower(_data).ToLowerInvariant();
[Benchmark]
public bool NewTry() => Convert.TryToHexStringLower(_data, _dest, out int charsWritten);
}
Method |
---|
Old |
New |
NewTry |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
136.69 ns
[/td][td]
1.00
[/td][td]
848 B
[/td][td]
1.00
[/td][td]
119.09 ns
[/td][td]
0.87
[/td][td]
424 B
[/td][td]
0.50
[/td][td]
21.97 ns
[/td][td]
0.16
[/td][td]
–
[/td][td]
0.00
[/td]Before .NET 5 introduced
Convert.ToHexString
, there actually already was some functionality in .NET for converting bytes to hex: BitConverter.ToString
. BitConverter.ToString
does the same thing Convert.ToHexString
now does, except inserting dashes between every two hex characters (i.e. between every byte). As a result, it became fairly common for folks that wanted the equivalent of ToHexString
to instead write BitConverter.ToString(bytes).Replace("-", "")
. It’s so common to want the dashes removed, in fact, that it’s what GitHub copilot suggests for me just by typing BitConverter.ToString
: Of course, that operation is much more expensive (and more complicated) than just using Convert.ToHexString
, so it’d be nice to help developers switch over to ToHexString{Lower}
. That’s exactly what dotnet/roslyn-analyzers#6967 from @mpidash does. CA1872 will now flag both cases that can be converted to Convert.ToHexString
: and cases that can be converted to Convert.ToHexStringLower
: And that’s good for performance, as the difference is quite stark:
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private byte[] _bytes = Enumerable.Range(0, 100).Select(i => (byte) i).ToArray();
[Benchmark(Baseline = true)]
public string WithBitConverter() => BitConverter.ToString(_bytes).Replace("-", "").ToLowerInvariant();
[Benchmark]
public string WithConvert() => Convert.ToHexStringLower(_bytes);
}
Method |
---|
WithBitConverter |
WithConvert |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
1,707.46 ns
[/td][td]
1.00
[/td][td]
1472 B
[/td][td]
1.00
[/td][td]
61.66 ns
[/td][td]
0.04
[/td][td]
424 B
[/td][td]
0.29
[/td]There are a variety of reasons for that difference, including the obvious one that the
Replace
is having to search the input, find all the dashes, and allocate a brand new string without them. However, BitConverter.ToString
is also slower in general as it’s not as easily vectorized, due to needing to insert dashes between the resulting characters.In the other direction,
Convert.FromHexString
decodes a string of hex back into a new byte[]
. dotnet/runtime#86556 from @hrrrrustic adds overloads of FromHexString
that write into a destination span rather than allocating a new byte[]
each time.
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private string _hex = string.Concat(Enumerable.Repeat("0123456789abcdef", 10));
private byte[] _dest = new byte[100];
[Benchmark(Baseline = true)]
public byte[] FromHexString() => Convert.FromHexString(_hex);
[Benchmark]
public OperationStatus FromHexStringSpan() => Convert.FromHexString(_hex.AsSpan(), _dest, out int charsWritten, out int bytesWritten);
}
Method |
---|
FromHexString |
FromHexStringSpan |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
33.78 ns
[/td][td]
1.00
[/td][td]
104 B
[/td][td]
1.00
[/td][td]
18.22 ns
[/td][td]
0.54
[/td][td]
–
[/td][td]
0.00
[/td]Span, Span, and more Span
The introduction of
Span<T>
and ReadOnlySpan<T>
back in .NET Core 2.1 have revolutionized how we write .NET code (especially in the core libraries) and what APIs we expose (see A Complete .NET Developer’s Guide to Span if you’re interested in a deeper dive.) .NET 9 has continued the trend of doubling-down on spans as a great way to both implicitly provide performance boosts and also expose APIs that enables developers to do more for performance in their own code.One great example of this is the new C# 13 support for “params collections,” which merged into the C# compiler’s main branch in dotnet/roslyn#72511. This feature enables the C#
params
keyword to be used with more than just array parameters, but rather any collection type that’s usable with collection expressions… that includes span. In fact, the feature makes it so that if there are two overloads, one taking a params T[]
and one taking a params ReadOnlySpan<T>
, the latter overload will win overload resolution. Moreover, the code generated for a call site for a params ReadOnlySpan<T>
is the same non-allocating approach you get for collection expressions, e.g. given code like this:
Code:
using System;
public class C
{
public void M()
{
Helpers.DoAwesomeStuff("Hello", "World");
}
}
public static class Helpers
{
public static void DoAwesomeStuff<T>(params T[] values) { }
public static void DoAwesomeStuff<T>(params ReadOnlySpan<T> values) { }
}
the IL the C# compiler generates for
C.M
will be equivalent to something like the following C#:
Code:
<>y__InlineArray2<string> buffer = default;
<PrivateImplementationDetails>.InlineArrayElementRef<<>y__InlineArray2<string>, string>(ref buffer, 0) = "Hello";
<PrivateImplementationDetails>.InlineArrayElementRef<<>y__InlineArray2<string>, string>(ref buffer, 1) = "World";
Helpers.DoAwesomeStuff(<PrivateImplementationDetails>.InlineArrayAsReadOnlySpan<<>y__InlineArray2<string>, string>(ref buffer, 2));
This is using the
[InlineArray]
feature introduced in .NET 8 to stack-allocate a span of strings, and then pass that span into the method. No heap allocation. This is awesome for library developers, because it means any place where you have a method taking a params T[]
, you can add a params ReadOnlySpan<T>
overload, and when consuming code calling that method recompiles, it just gets better. dotnet/runtime#101308 and dotnet/runtime#101499 rely on that to add ~40 new overloads for methods that didn’t previously accept spans and now do, and added params
to over 20 existing overloads that were already taking spans. For example, if code had been using Path.Join
to build up a path comprised of five or more segments, it previously would have been using the params string[]
overload, but now upon recompilation it’ll switch to using the params ReadOnlySpan<string>
overload, and won’t need to allocate a string[]
for the inputs.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
using System.Numerics;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public string Join() => Path.Join("a", "b", "c", "d", "e");
}
Method | Runtime |
---|---|
Join | .NET 8.0 |
Join | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
30.83 ns
[/td][td]
1.00
[/td][td]
104 B
[/td][td]
1.00
[/td][td]
24.85 ns
[/td][td]
0.81
[/td][td]
40 B
[/td][td]
0.38
[/td]The C# compiler has also improved around spans in other ways. For example, dotnet/roslyn#71261 extends the assembly data support for initializing arrays and
ReadOnlySpan<T>
to also apply to stackalloc
. If you have code like this:var array = new char[] { 'a', 'b', 'c', 'd', 'e', 'f', 'g' };
the compiler will generate code along the lines of the following:
Code:
char[] array = new char[7];
RuntimeHelpers.InitializeArray(array, (RuntimeFieldHandle)&<PrivateImplementationDetails>.FD43C34A357FF620C00C04D0247059F8628CBB3DB349DF05DFA15EF6C7AC514C2);
The compiler has taken that char data and blit it into the assembly; then when it creates the array, rather than setting each individual value into the array, it just copies that data directly from the assembly into the array. Similarly, if you have:
ReadOnlySpan<char> span = new char[] { 'a', 'b', 'c', 'd', 'e', 'f', 'g' };
the compiler recognizes that all of the data is constant and is being stored into a “read-only” location, so it doesn’t actually need to allocate an array. Instead, it emits code like:
Code:
ReadOnlySpan<char> span =
RuntimeHelpers.CreateSpan<char>((RuntimeFieldHandle)&<PrivateImplementationDetails>.FD43C34A357FF620C00C04D0247059F8628CBB3DB349DF05DFA15EF6C7AC514C2);
which effectively creates a span that points directly into the assembly data; no allocation and no copy needed. However, if you have this:
ReadOnlySpan<char> span = stackalloc char[] { 'a', 'b', 'c', 'd', 'e', 'f', 'g' };
or this:
Span<char> span = stackalloc char[] { 'a', 'b', 'c', 'd', 'e', 'f', 'g' };
you’d get codegen more like this:
Code:
char* ptr = stackalloc char[7];
*(char*)ptr = 97;
*(char*)(ptr + 1) = 98;
*(char*)(ptr + 2) = 99;
*(char*)(ptr + 3) = 100;
*(char*)(ptr + 4) = 101;
*(char*)(ptr + 5) = 102;
*(char*)(ptr + 6) = 103;
Span<char> span = new Span<char>(ptr, 7);
But now, thanks to dotnet/roslyn#71261, that last example will also be unified with the same approach for the other constructions, resulting in code more like this:
Code:
char* ptr = stackalloc char[7];
Unsafe.CopyBlockUnaligned(ptr, &<PrivateImplementationDetails>.FD43C34A357FF620C00C04D0247059F8628CBB3DB349DF05DFA15EF6C7AC514C2, 14);
Span<char> span = new Span<char>(ptr, 7);
(the compiler will actually generate a
cpblk
IL instruction rather than a call to Unsafe.CopyBlockUnaligned
).The C# compiler has also improved its ability to avoid allocations when creating
ReadOnlySpan<T>
from some expressed array constructions or collection expressions. One of the really nice optimizations the C# compiler added several years back was the ability to recognize when a new byte
/sbyte
/bool
array was being constructed, filled with only constants, and directly assigned to a ReadOnlySpan<T>
. In such a case, it would recognize that the data was all blittable and could never be modified, so rather than allocating an array and wrapping a span around it, it would blit the data into the assembly and then just construct a span around a pointer into the assembly data with the appropriate length. So this:ReadOnlySpan<byte> Values => new[] { (byte)0, (byte)1, (byte)2 };
got lowered into something more like this:
Code:
ReadOnlySpan<byte> Values => new ReadOnlySpan<byte>(
&<PrivateImplementationDetails>.AE4B3280E56E2FAF83F414A6E3DABE9D5FBE18976544C05FED121ACCB85B53FC),
3);
The optimization at the time was limited to only single-byte primitive types because of endianness concerns, but .NET 7 added a
RuntimeHelpers.CreateSpan
method which handled such endianness concerns, so then this was expanded to all such primitive types regardless of size. So this:
Code:
ReadOnlySpan<char> Values1 => new[] { 'a', 'b', 'c' };
ReadOnlySpan<int> Values2 => new[] { 1, 2, 3 };
ReadOnlySpan<long> Values3 => new[] { 1L, 2, 3 };
ReadOnlySpan<DayOfWeek> Values4 => new[] { DayOfWeek.Monday, DayOfWeek.Friday };
gets lowered into something more like this:
Code:
ReadOnlySpan<byte> Values1 => new ReadOnlySpan<byte>(
&<PrivateImplementationDetails>.13E228567E8249FCE53337F25D7970DE3BD68AB2653424C7B8F9FD05E33CAEDF2),
3);
ReadOnlySpan<byte> Values2 => new ReadOnlySpan<byte>(
&<PrivateImplementationDetails>.4636993D3E1DA4E9D6B8F87B79E8F7C6D018580D52661950EABC3845C5897A4D4),
3);
ReadOnlySpan<byte> Values3 => new ReadOnlySpan<byte>(
&<PrivateImplementationDetails>.E2E2033AE7E19D680599D4EB0A1359A2B48EC5BAAC75066C317FBF85159C54EF8),
3);
ReadOnlySpan<byte> Values3 => new ReadOnlySpan<byte>(
&<PrivateImplementationDetails>.ECA75F8497701D6223817CDE38BF42CDD1124E01EF6B705BCFE9A584F7B42F0F4),
2);
Lovely. But… what about types that are supported as constants at the C# level but that aren’t blittable in this fashion? That includes
nint
and nuint
(which vary in size based on the bitness of the process), decimal
(for which a constant is actually represented in metadata via a [DecimalConstant(...)]
attribute), and string
(which is a reference type). In those cases, even though we’re still targeting something that can be mutated and we’re still using constants, we still get the array allocation:
Code:
ReadOnlySpan<nint> Values1 => new nint[] { 1, 2, 3 };
ReadOnlySpan<nuint> Values2 => new nuint[] { 1, 2, 3 };
ReadOnlySpan<decimal> Values3 => new[] { 1m, 2m, 3m };
ReadOnlySpan<string> Values4 => new[] { "a", "b", "c" };
which are lowered to, well, themselves, such that there’s still an allocation. Or, at least there was. Thanks to dotnet/roslyn#69820, these cases are now handled as well. They’re addressed by lazily allocating an array that’s then cached for all subsequent use. So now, that same example gets lowered into the equivalent of something more like this:
Code:
ReadOnlySpan<nint> Values1 =>
<PrivateImplementationDetails>.4636993D3E1DA4E9D6B8F87B79E8F7C6D018580D52661950EABC3845C5897A4D_B8 ??=
new nint[] { 1, 2, 3 };
ReadOnlySpan<nuint> Values2 =>
<PrivateImplementationDetails>.4636993D3E1DA4E9D6B8F87B79E8F7C6D018580D52661950EABC3845C5897A4D_B16 ??=
new nuint[] { 1, 2, 3 };
ReadOnlySpan<decimal> Values3 =>
<PrivateImplementationDetails>.04B64E80BCEFE521678C4D6565B6EEBCE2791130A600CCB5D23E1B5538155110_B18 ??=
new[] { 1m, 2m, 3m };
ReadOnlySpan<string> Values4 =>
<PrivateImplementationDetails>.13E228567E8249FCE53337F25D7970DE3BD68AB2653424C7B8F9FD05E33CAEDF_B11 ??=
new[] { "a", "b", "c" };
There are, of course, many more span-related improvements in the libraries, too. One improvement for an existing span-related method is dotnet/runtime#103728, which further optimizes
MemoryExtensions.Count
used to count the number of occurrences of an element in a span. The implementation is vectorized, processing a vector’s worth of data at a time, e.g. if 256-bit vectors are hardware accelerated, and it’s searching char
s, it’ll process 16 char
s at a time (16 char
s 2 bytes per char
8 bits per byte == 256). What happens if the number of elements isn’t an even multiple of 16? Then we’re left with some remaining elements after processing the last full vector. Previously the implementation would fall back to processing those remaining elements one at a time; now, it’ll process one last vector at the end of the input. Doing so means we’ll end up re-examining one ore more elements we already examined, but that doesn’t really matter, as we can examine all of the elements in approximately the same number of instructions as processing just a single element.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private char[][] _values = new char[10_000][];
[GlobalSetup]
public void Setup()
{
var rng = new Random(42);
for (int i = 0; i < _values.Length; i++)
{
_values[i] = new char[rng.Next(0, 128)];
rng.NextBytes(MemoryMarshal.AsBytes(_values[i].AsSpan()));
}
}
[Benchmark]
public int Count()
{
int count = 0;
foreach (char[] numbers in _values)
{
count += numbers.AsSpan().Count('a');
}
return count;
}
}
Method | Runtime |
---|---|
Count | .NET 8.0 |
Count | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
133.25 us
[/td][td]
1.00
[/td][td]
74.30 us
[/td][td]
0.56
[/td]New span-related functionality also shows up in .NET 9. String splitting is an operation that’s used all over the place; a search for “.Split(” in C# code on GitHub yields millions of hits, and data from a variety of sources suggests that just the simplest overload
Split(params char[]? separator)
is used by upwards of 90% of applications and 20% of nuget packages. So it should come as no surprise that a request to have this functionality for spans is very popular.The devil is in the details, of course, and it’s taken a long time to figure out exactly how it should be exposed. There are largely two different use cases for splitting we see in the wild. One case is where the content being split has an expected or max number of segments, and splitting is used to extract them. For example,
FileVersionInfo
needs to be able to take a version string and parse from it up to 4 components separated by periods. .NET 8 introduced new Split
extension methods on MemoryExtensions
to address this use case, by having Split
take a destination Span<Range>
to write the bounds of each segment into. That, however, still leaves the second major category of usage, which is for iterating through an unbounded number of segments. A representative example there is this snippet from HttpListener
‘s web sockets implementation:
Code:
string[] requestProtocols = clientSecWebSocketProtocol.Split(',', StringSplitOptions.TrimEntries | StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < requestProtocols.Length; i++)
{
if (string.Equals(acceptProtocol, requestProtocols[i], StringComparison.OrdinalIgnoreCase))
{
return true;
}
}
The
clientSecWebSocketProtocol
string is composed of comma-separated values, and this is iterating through them to see if any is equal to the target acceptProtocol
. It’s doing that, though, with a relatively expensive operation. That Split
call needs to allocate the string[]
that’s returned and that holds all the constituent strings, and then each segment results in a string
being allocated. We can do better, and dotnet/runtime#104534 from @bbartels enables that. It adds four new overloads of MemoryExtensions.Split
and MemoryExtensions.SplitAny
:
Code:
public static SpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> source, T separator) where T : IEquatable<T>;
public static SpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> source, ReadOnlySpan<T> separator) where T : IEquatable<T>;
public static SpanSplitEnumerator<T> SplitAny<T>(this ReadOnlySpan<T> source, params ReadOnlySpan<T> separators) where T : IEquatable<T>;
public static SpanSplitEnumerator<T> SplitAny<T>(this ReadOnlySpan<T> source, SearchValues<T> separators) where T : IEquatable<T>;
With that, this same operation can be written as:
Code:
foreach (Range r in clientSecWebSocketProtocol.AsSpan().Split(','))
{
if (clientSecWebSocketProtocol.AsSpan(r).Trim().Equals(acceptProtocol, StringComparison.OrdinalIgnoreCase))
{
return true;
}
}
In doing so, it becomes allocation-free, as this
Split
doesn’t need to allocate a string[]
to hold results and doesn’t need to allocate a string
for each segment: instead, it’s returning a ref struct
enumerator that yields a Range
representing each segment. The caller can then use that Range
to slice the input. It’s yielding a Range
rather than, say, a ReadOnlySpan<T>
, to enable the splitting to be used with original sources other than spans and be able to get the segments in the original form. For example, if I had a ReadOnlyMemory<T>
and wanted to add segments from it into a list, I could do:
Code:
ReadOnlyMemory<T> source = ...;
List<ReadOnlyMemory<T>> list = ...;
foreach (Range r in source.Split(separator))
{
list.Add(source.Slice(r));
}
whereas that wouldn’t be possible if
Split
forced all yielded results to be spans.You might notice that there’s no
StringSplitOptions
on these overloads. That’s because it’s both not applicable and not necessary. It’s not applicable because we’re working here with T
, which might be something other than char
, but an option like StringSplitOptions.TrimEntries
implies a notion of whitespace, and that’s only relevant for text. And it’s not necessary, because the main benefit of StringSplitOptions
, both TrimEntries
and RemoveEmptyEntries
, is reducing allocation overheads. If these options didn’t exist with the string
overloads, and you wanted to simulate them with our original example (and spans didn’t exist), it would end up looking like this:
Code:
string[] requestProtocols = clientSecWebSocketProtocol.Split(',');
for (int i = 0; i < requestProtocols.Length; i++)
{
if (string.Equals(acceptProtocol, requestProtocols[i].Trim(), StringComparison.OrdinalIgnoreCase))
{
return true;
}
}
There are several possible performance problems here. Imagine the
clientSecWebSocketProtocol
input was "a , b, , , , , , c"
. There are only three entries we care about here ("a"
, "b"
, and "c"
), but the returned array is going to be a string[8]
instead of a string[3]
, because it’s going to have entries for each of those whitespace-only segments. That’s a larger allocation than is necessary. Then, we’ll be producing string
s for all eight of those segments, even though only three of the string
s were necessary. And, all of "a "
, " b"
, and " c"
have some extraneous whitespace that needs to be trimmed, such that the following Trim()
call will allocate a new string for each. The StringSplitOptions
enables the implementation of Split
to avoid all of that overhead, by only allocating what’s desired. But with the span version, none of that allocation exists anyway. The consuming loop can trim the spans itself without incurring more overhead than would the Split
implementation, and the consuming loop can choose to ignore empty entries without increasing the size of a string[]
allocation.The net result is such operations can be significantly more efficient while not sacrificing much if anything in the way of maintainability.
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private string _input = "a , b, , , , , , c";
private string _target = "d";
[Benchmark(Baseline = true)]
public bool ContainsString()
{
foreach (string item in _input.Split(',', StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries))
{
if (item.Equals(_target, StringComparison.OrdinalIgnoreCase))
{
return true;
}
}
return false;
}
[Benchmark]
public bool ContainsSpan()
{
foreach (Range r in _input.AsSpan().Split(','))
{
if (_input.AsSpan(r).Trim().Equals(_target, StringComparison.OrdinalIgnoreCase))
{
return true;
}
}
return false;
}
}
Method |
---|
ContainsString |
ContainsSpan |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
127.26 ns
[/td][td]
1.00
[/td][td]
208 B
[/td][td]
1.00
[/td][td]
61.89 ns
[/td][td]
0.49
[/td][td]
–
[/td][td]
0.00
[/td]The nature of this new set of splitting APIs is that they find just the next separator / segment; that’s both practical and possibly a performance improvement by itself. It’s practical because we’re only yielding a single segment at a time, and we don’t have anywhere to store all possible found separator positions (nor do we want to allocate space to do so). And it’s desirable because the consumer may early-exit from the consuming loop, in which case we don’t want to have spent time unnecessarily searching for additional segments that are going to be ignored. The existing set of splitting APIs, however, hand back all found segments in one go, either via a returned
string[]
or via ranges being written to a destination span. And as such, it makes more sense for those overloads to find all separators at once because that operation can be vectorized. In fact, previous versions have done so. But, that vectorization has only benefited from 128-bit vectors. With dotnet/runtime#93043 from @khushal1996 in .NET 9, that vectorization will now light-up with 512-bit or 256-bit vectors if they’re available, enabling that separator searching that happens as part of splitting to run up to four times faster.Spans show up in other new methods as well. dotnet/runtime#93938 from @TheMaximum added new overloads of
StringBuilder.Replace
that accept ReadOnlySpan<char>
instead of string
. As is the case with most such overloads, they share the same implementation, with the string
-based overloads just creating a span from the string
and using a span-based implementation. In practice, the majority of use of StringBuilder.Replace
uses constant strings as arguments, for example to escape some known delimiter (Replace("$", "\\$")
), or use previously-created string
instances, such as to remove some substring from text (Replace(substring, "")
). But, there are a minority of cases where Replace
is used with something that’s created on the spot, and for that, these new overloads can help to avoid allocation for creating the arguments. For example, here’s some escaping code used today by MSBuild:
Code:
char[] charsToEscape = ...;
StringBuilder escapedString = ...;
foreach (char unescapedChar in charsToEscape)
{
string escapedCharacterCode = string.Format(CultureInfo.InvariantCulture, "%{0:x00}", (int)unescapedChar);
escapedString.Replace(unescapedChar.ToString(CultureInfo.InvariantCulture), escapedCharacterCode);
}
This is having to perform two
string
allocations to create the input to this Replace
, which is going to be invoked for each char
in charsToEscape
. If charsToEscape
is something fixed, it could be better to avoid these formatting operations per iteration, and instead just cache the necessary strings for all uses, e.g.
Code:
private static readonly char[] charsToEscape = ...;
private static readonly string[] escapedCharsToEscape = charsToEscape.Select(c => $"%{(uint)unescapedChar:x00}").ToArray();
private static readonly string[] stringsToEscape = charsToEscape.Select(c => c.ToString()).ToArray();
...
for (int i = 0; i < charsToEscape.Length; i++)
{
escapedString.Replace(stringsToEscape[i], escapedCharsToEscape[i]);
}
but if
charsToEscape
isn’t predictable, then we can at least avoid the allocation by employing the new overloads, e.g.
Code:
char[] charsToEscape = ...;
StringBuilder escapedString = ...;
Span<char> escapedSpan = stackalloc char[5];
foreach (char unescapedChar in charsToEscape)
{
escapedSpan.TryWrite($"%{(uint)unescapedChar:x00}", out int charsWritten);
escapedString.Replace(new ReadOnlySpan<char>(in unescapedChar), escapedSpan.Slice(0, charsWritten));
}
and, boom, no more allocation for the arguments.
A variety of other improvements were made to
string
manipulation, mainly around better employing vectorization. StringComparison.OrdinalIgnoreCase
operations were previously vectorized, but only with 128-bit vectors, which means handling up to 8 char
s at a time. Thanks to dotnet/runtime#93116, those code paths have been updated to support 256-bit and 512-bit vectors, which means handling up to 16 or 32 char
s at a time on hardware accelerated to support it.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static string s_s1 = """
Let me not to the marriage of true minds
Admit impediments; love is not love
Which alters when it alteration finds,
Or bends with the remover to remove.
O no, it is an ever-fixed mark
That looks on tempests and is never shaken;
It is the star to every wand'ring bark
Whose worth's unknown, although his height be taken.
Love's not time's fool, though rosy lips and cheeks
Within his bending sickle's compass come.
Love alters not with his brief hours and weeks,
But bears it out even to the edge of doom:
If this be error and upon me proved,
I never writ, nor no man ever loved.
""";
private static string s_s2 = s_s1[0..^1] + "!";
[Benchmark]
public bool EqualsIgnoreCase() => s_s1.Equals(s_s2, StringComparison.OrdinalIgnoreCase);
}
Method | Runtime |
---|---|
EqualsIgnoreCase | .NET 8.0 |
EqualsIgnoreCase | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
86.79 ns
[/td][td]
1.00
[/td][td]
20.97 ns
[/td][td]
0.24
[/td]EndsWith
also gets better, for both strings and spans. Previous releases saw StartsWith
become a JIT intrinsic, enabling the JIT to generate dedicated SIMD code for StartsWith
in the case where it’s passed a constant. Now with dotnet/runtime#98593, the same thing is done for EndsWith
.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[DisassemblyDiagnoser(maxDepth: 0)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
[Arguments("helloworld.txt")]
public bool EndsWith(string path) => path.EndsWith(".txt", StringComparison.OrdinalIgnoreCase);
}
Method | Runtime | path |
---|---|---|
EndsWith | .NET 8.0 | helloworld.txt |
EndsWith | .NET 9.0 | helloworld.txt |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Code Size
[/th][td]
3.5006 ns
[/td][td]
1.00
[/td][td]
26 B
[/td][td]
0.6653 ns
[/td][td]
0.19
[/td][td]
61 B
[/td]More interesting to me than these nice gains is the code that was generated to achieve them. This is what the assembly for this benchmark looked like with .NET 8:
Code:
; Tests.EndsWith(System.String)
mov rdi,rsi
mov rsi,7EE3C2D25E38
mov edx,5
cmp [rdi],edi
jmp qword ptr [7F24678663A0]; System.String.EndsWith(System.String, System.StringComparison)
; Total bytes of code 26
Pretty straightforward, a bit of argument manipulation and then jumping to the actual
string.EndsWith
implementation. Now here’s .NET 9:
Code:
; Tests.EndsWith(System.String)
push rbp
mov rbp,rsp
mov eax,[rsi+8]
cmp eax,4
jge short M00_L00
xor ecx,ecx
jmp short M00_L01
M00_L00:
mov ecx,eax
lea rax,[rsi+rcx*2-8]
mov rcx,20002000200000
or rcx,[rax+0C]
mov rax,7400780074002E
cmp rcx,rax
sete cl
movzx ecx,cl
M00_L01:
movzx eax,cl
pop rbp
ret
; Total bytes of code 61
Notice there’s no call to
string.EndsWith
in sight. That’s because the JIT has implemented the EndsWith
functionality here, specific to ".txt"
and OrdinalIgnoreCase
, in just a few instructions. The address of the string is being passed into this method in the rsi
register, and the second mov
instruction is grabbing its Length
(which is stored 8 bytes from the start of the string object) and storing that into the eax
register. It’s then checking whether the string is at least 4 characters long; if it’s not, it can’t possibly end with ".txt"
and thus it jumps to the end to return false
. If it was at least 4 characters long, it then proceeds to load the last four characters of the string as a 64-bit value into rcx
and OR it with the value 20002000200000
. Why? It’s playing the same ASCII trick we’ve seen before. The '.'
is not subject to casing, so we don’t need to manipulate its value, and hence the 16-bits that aligns with the '.'
are 0. But the other three characters all need to be comparable with both their lower-case and upper-case forms, so this is OR’ing each of the three 16-bit characters with 0x2000
to produce the lower-case form. At that point, the 64-bit value can be compared against the 64-bit representation of ".txt"
, which is 7400780074002E
(the ASCII value for '.'
is 0x2E, for 't'
is 0x74, and for 'x'
is 0x78). Then it’s just a simple matter of whether that compared equally or not.Finally, we’ve not talked much about arrays separate from spans, but there have been improvements there as well. dotnet/runtime#102739 and dotnet/runtime#104103 move more logic for array handling from native code in the runtime up into C# code in
CoreLib
. For example, Array.Copy
has to handle a wide array of cases (pun intended), some of which can be implemented very efficiently and some of which are more laborious, and it tries to optimize the “simple” cases, such as whether the bits from one array can simply be memcpy’d over to the other, with as little overhead as possible. Some of those cases are easy to determine, such as single-dimensional arrays having the exact same type, but other cases require more introspection, such as if one array is enums and the other array is of the underlying type of that enum. The checks to make those determinations previously lived in native code in the runtime, but as of this PR they’re now implemented in C#, and in doing so, some of the overhead associated with the checks has been removed.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private DayOfWeek[] _enums = Enum.GetValues<DayOfWeek>();
private int[] _ints = new int[7];
[Benchmark]
public void Copy() => Array.Copy(_enums, _ints, _enums.Length);
}
Method | Runtime |
---|---|
Copy | .NET 8.0 |
Copy | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
16.25 ns
[/td][td]
1.00
[/td][td]
11.05 ns
[/td][td]
0.68
[/td]In addition to other benefits that come from moving such logic into managed code (better maintainability, more implicitly safe code, reduced overhead from transitioning between managed and native, etc.), there’s another less obvious benefit: impact on GC pause times. And that’s nowhere more obvious than with dotnet/runtime#98623, which moved the implementations of
memset
/memcpy
helpers used for core operations like Span<T>.Fill
and Array.Copy
from native to managed. Consider this C# console app:
Code:
using System.Diagnostics;
new Thread(() =>
{
var a = new object[1000];
while (true) a.AsSpan().Fill(a);
})
{ IsBackground = true }.Start();
var sw = new Stopwatch();
while (true)
{
sw.Restart();
for (int i = 0; i < 10; i++)
{
GC.Collect();
Thread.Sleep(15);
}
Console.WriteLine(sw.Elapsed.TotalSeconds);
}
This is sitting in a loop that simply times how long it takes to perform 10 gen2 collections, each spaced out by ~15 ms. If each collection were free then, this loop should take ~150 ms. Since it’s not free, let’s round up and estimate that the loop should be around ~200 ms. Before we run the loop, though, we launch a thread that just sits in an infinite loop filling a span. That shouldn’t mess with our timing loop… or should it? When I run this on .NET 8, I get values like this:
Code:
1.0683524
0.8884759
0.8420748
1.1101804
1.2730635
Those values are in seconds, and that’s approximately 5x larger than we’d predicated. Now I try on .NET 9, and I get results like this:
Code:
0.1638237
0.2129748
0.2859566
0.3020449
0.2871952
What happened? In order to do some of its work, the GC needs to be able to get a consistent view of the world, which is violated if things are concurrently changing out from under it. As such, it may need to temporarily suspend all threads in the process, but to do that, it needs to wait for each thread to get to a safe point, and if a thread is executing code in the runtime, that can be hard to do. In this particular case, there’s a thread spending almost all of its time sitting in a call to
Span<T>.Fill
, aka memset
, which was implemented in .NET 8 as a native function in the runtime; this couldn’t be interrupted, and the GC would need to wait until the call returned and it could catch it before it could interrupt that thread. In .NET 9, these implementations are all in managed code, and the GC can trivially get the threads to a safe point.Collections
LINQ
Language Integrated Query, or LINQ, is a mainstay of .NET. At its heart, LINQ is a specification for hundreds of overloads of methods that manipulate data, and then implementations of that specification for different types. One of the most prominent implementations comes from
System.Linq.Enumerable
, sometimes referred to as “LINQ to Objects,” which provides an implementation of these operations as methods for working with IEnumerable<T>
. It’s an incredibly useful set of operations, used ubiquitously, and thus it’s a common target for performance optimization. In many .NET releases, it’ll get a new additional method here or an optimized method there, a trickle of focused improvements. But in .NET 9, it’s received a huge amount of attention, with some improvements localized to particular methods and others broadly applicable across much of the surface area.One of the more sweeping LINQ changes in .NET 9 has to do with how various optimizations are implemented. In the original implementation of LINQ circa 2007, almost every method was logically independent from every other. A method like
SelectMany
took in an IEnumerable<TSource>
and didn’t know anything about where that input came from; every enumerable was processed the same as every other. Some methods would special-case more optimizable data types, though, for example ToArray
would check whether the incoming IEnumerable<TSource>
implemented ICollection<TSource>
, preferring if it did to use the collection’s Count
and CopyTo
in order to avoid having to MoveNext
/Current
through the whole input. But a couple of methods, in particular some overloads of Select
and Where
, did something more interesting. Much of LINQ was implemented using the C# compiler’s support for iterators, where a method that returns IEnumerable<T>
can use yield return t;
to produce instances of T
, and the compiler handles rewriting that method into a class that implements IEnumerable<T>
and handles all the gnarly state-machine details for you. These few Select
and Where
overloads, however, didn’t use iterators, with the developer that authored them instead preferring to write a custom enumerable by hand. Why? It’s possible to hand-author an ever-so-slightly more efficient implementation in some cases, but the compiler is actually really good at doing it well, so that’s not the reason. The reason is because it a) gives the type a name that can be referred to elsewhere in the code, and b) it allows that type to expose state that other code can interrogate. This enables information to flow from one query operator to the next. So, for example, Where
could return a WhereEnumerableIterator
instance:
Code:
class WhereEnumerableIterator<TSource> : Iterator<TSource>
{
IEnumerable<TSource> source;
Func<TSource, bool> predicate;
...
}
And then
Select
can look for that type, or, rather, its base type, Iterator<TSource>
:
Code:
public static IEnumerable<TResult> Select<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector) {
if (source == null) throw Error.ArgumentNull("source");
if (selector == null) throw Error.ArgumentNull("selector");
if (source is Iterator<TSource>) return ((Iterator<TSource>)source).Select(selector);
...
}
and that
WhereEnumerableIterator<TSource>
can override that virtual Select
method on Iterator<TSource>
to specialize what happens when a Where
is followed by Select
:
Code:
public override IEnumerable<TResult> Select<TResult>(Func<TSource, TResult> selector) {
return new WhereSelectEnumerableIterator<TSource, TResult>(source, predicate, selector);
}
This is useful because it allows for avoiding one of the major sources of overhead with enumerables. Without this optimization, if I had
source.Where(x => true).Select(x => x)
, the resulting enumerable would be for the Select
, which would in turn wrap the enumerable for the Where
, which would in turn wrap the source
enumerable. That means that when I call MoveNext
on the select iterator, it in turn needs to call MoveNext
on the Where
, which will in turn call MoveNext
on the source, and then the same for Current
. That means for each element in the source, we end up making 6 interface calls. With the cited optimization, we no longer have separate iterators for the Select
and Where
. Those end up being combined into a single iterator that does the work of both, eliminating one level from the call chain, so instead of 6 interface calls per element, there’s only 4. (See Deep Dive on LINQ and An event DEEPER Dive into LINQ for a more in-depth exploration of how exactly this works.)Over the last decade with .NET, those optimizations have been significantly extended, and in some cases to much greater benefit than just saving a few interface calls. For example, in a previous .NET release, a similar mechanism was used to special-case
OrderBy
followed by First
. Without special-casing, the OrderBy
would need to do both a full copy of the input source and an O(N log N)
sort of the data, all as part of the first call to MoveNext
from First
. But with the optimization, First
is able to see that its source is that OrderBy
, in which case it doesn’t need a copy or sort at all, and can instead simply do an O(N)
search of OrderBy
‘s source for its minimum value. That difference can yield monstrous wins.This additional special-casing was achieved with internal interfaces in the library. An
IIListProvider<TElement>
provided ToArray
, ToList
, and GetCount
methods, and an IPartition<TElement>
interface (which inherited IIListProvider<TElement>
) provided additional methods like Skip
, Take
, and TryGetFirst
. Custom iterators used to back various LINQ methods could then also implement one or more of these interfaces to specialize being followed by a call like ToArray
or Count()
. For example, it’s very common (e.g. as part of “paging”) to see call sequences like .Skip(...).Take(...)
; with these optimizations, those two operations can be consolidated down into a single iterator, and if it were followed by an operation like Last()
or ToList()
, those could see through both operators to the source in order to possibly optimize based on it (e.g. if the source were an array, Last()
could calculate the exact element to return without needing to do any iteration at all).dotnet/runtime#98969 and dotnet/runtime#99344 remove those internal interfaces and consolidate all of their members down to the base
Iterator<TSource>
type. This has a variety of benefits. Not directly related to performance, it simplifies the code base, making it easier to maintain (and easier to maintain code is also generally easier to optimize); the interface members of IPartition<TElement>
became virtual
methods on the base class, which also resulted in some code reduction due to being able to share the same default implementation (though with the introduction of default interface methods a few releases ago, this could have been done separately without this consolidation). On the performance front, though, there are three main benefits of this PR:- Virtual dispatch is generally a bit cheaper than interface dispatch. All of those interface methods became virtual methods, enabling all call sites to them to be a bit cheaper.
- In various places, type tests were being done for multiple targets, and those could now be consolidated to reduce type checks. For example,
Select
looked something like this:
Code:if (source is Iterator<TSource> iterator) { ... } if (source is IPartition partition) { ... }
That means for non-specialized iterators,Select
was incurring a type check forIterator<TSource>
and an interface check forIPartition<TSource>
. With this change, the latter check is now removed.- Some types were only inheriting from the base class but not implementing any of the interfaces, some were implementing an interface but not the other, some were even implementing one of the interfaces but not deriving from the base class. The new approach makes it such that all of the provided virtual methods are implemented by any iterator deriving from the base class.
dotnet/runtime#97905, dotnet/runtime#97956, dotnet/runtime#98874, and dotnet/runtime#99216 also added more implementations.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private IEnumerable<int> _arrayDistinct = Enumerable.Range(0, 1000).ToArray().Distinct();
private IEnumerable<int> _appendSelect = Enumerable.Range(0, 1000).ToArray().Append(42).Select(i => i * 2);
private IEnumerable<int> _rangeReverse = Enumerable.Range(0, 1000).Reverse();
private IEnumerable<int> _listDefaultIfEmptySelect = Enumerable.Range(0, 1000).ToList().DefaultIfEmpty().Select(i => i * 2);
private IEnumerable<int> _listSkipTake = Enumerable.Range(0, 1000).ToList().Skip(500).Take(100);
private IEnumerable<int> _rangeUnion = Enumerable.Range(0, 1000).Union(Enumerable.Range(500, 1000));
[Benchmark] public int DistinctFirst() => _arrayDistinct.First();
[Benchmark] public int AppendSelectLast() => _appendSelect.Last();
[Benchmark] public int RangeReverseCount() => _rangeReverse.Count();
[Benchmark] public int DefaultIfEmptySelectElementAt() => _listDefaultIfEmptySelect.ElementAt(999);
[Benchmark] public int ListSkipTakeElementAt() => _listSkipTake.ElementAt(99);
[Benchmark] public int RangeUnionFirst() => _rangeUnion.First();
}
Method | Runtime |
---|---|
DistinctFirst | .NET 8.0 |
DistinctFirst | .NET 9.0 |
AppendSelectLast | .NET 8.0 |
AppendSelectLast | .NET 9.0 |
RangeReverseCount | .NET 8.0 |
RangeReverseCount | .NET 9.0 |
DefaultIfEmptySelectElementAt | .NET 8.0 |
DefaultIfEmptySelectElementAt | .NET 9.0 |
ListSkipTakeElementAt | .NET 8.0 |
ListSkipTakeElementAt | .NET 9.0 |
RangeUnionFirst | .NET 8.0 |
RangeUnionFirst | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
49.844 ns
[/td][td]
1.00
[/td][td]
328 B
[/td][td]
1.00
[/td][td]
7.928 ns
[/td][td]
0.16
[/td][td]
–
[/td][td]
0.00
[/td][td]
3,668.347 ns
[/td][td]
1.000
[/td][td]
144 B
[/td][td]
1.00
[/td][td]
2.222 ns
[/td][td]
0.001
[/td][td]
–
[/td][td]
0.00
[/td][td]
8.703 ns
[/td][td]
1.00
[/td][td]
–
[/td][td]
NA
[/td][td]
3.465 ns
[/td][td]
0.40
[/td][td]
–
[/td][td]
NA
[/td][td]
2,772.283 ns
[/td][td]
1.000
[/td][td]
144 B
[/td][td]
1.00
[/td][td]
4.399 ns
[/td][td]
0.002
[/td][td]
–
[/td][td]
0.00
[/td][td]
3.699 ns
[/td][td]
1.00
[/td][td]
–
[/td][td]
NA
[/td][td]
2.103 ns
[/td][td]
0.57
[/td][td]
–
[/td][td]
NA
[/td][td]
53.670 ns
[/td][td]
1.00
[/td][td]
344 B
[/td][td]
1.00
[/td][td]
5.181 ns
[/td][td]
0.10
[/td][td]
–
[/td][td]
0.00
[/td]Subsequent PRs also further benefited from this consolidation. dotnet/runtime#99218, for example, uses it to improve
Enumerable.Any(IEnumerable<T>)
. Any
just needs to say whether the source has any elements, and it tries hard to determine that without having to get an enumerator from the source, which allocates, and call MoveNext
(an interface call) to see if it returns true. In .NET 8, it was doing this using Enumerable.TryGetNonEnumeratedCount
, which uses Iterator<T>.GetCount(onlyIfCheap: true)
(the “onlyIfCheap” part basically means “don’t enumerate to compute the count”). However, for iterators where it’s not “cheap”, TryGetNonEnumeratedCount
would return false
, and Any
would still be forced to get an enumerator. However, now that every Iterator<T>
provides a TryGetFirst
, Any
can use that in the case where the source is an Iterator<T>
but GetCount
isn’t successful. Worst case, TryGetFirst
will itself end up calling GetEnumerator
, but best case, the iterator will have provided a more efficient implementation of TryGetFirst
. And either way, it’s still a win, because enumerating would require not only a GetEnumerator
call on the Iterator<T>
, but that in turn would need to call GetEnumerator<T>
on whatever source it was wrapping, whereas this ends up saving one layer.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private IEnumerable<int> _data1 = Iterations(100).Where(i => i % 2 == 0).Select(i => i);
private IEnumerable<int> _data2 = Enumerable.Range(0, 100).ToArray().Where(i => i % 2 == 0).Select(i => i);
[Benchmark] public bool Any1() => _data1.Any();
[Benchmark] public bool Any2() => _data2.Any();
private static IEnumerable<int> Iterations(int count)
{
for (int i = 0; i < count; i++) yield return i;
}
}
Method | Runtime |
---|---|
Any1 | .NET 8.0 |
Any1 | .NET 9.0 |
Any2 | .NET 8.0 |
Any2 | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
31.967 ns
[/td][td]
1.00
[/td][td]
104 B
[/td][td]
1.00
[/td][td]
15.818 ns
[/td][td]
0.49
[/td][td]
40 B
[/td][td]
0.38
[/td][td]
21.062 ns
[/td][td]
1.00
[/td][td]
56 B
[/td][td]
1.00
[/td][td]
3.780 ns
[/td][td]
0.18
[/td][td]
–
[/td][td]
0.00
[/td]Another cross-cutting improvement across LINQ comes in dotnet/runtime#96602 and has to do with empty inputs. It’s also a nice example of how what’s considered an optimization ebbs and flows. In the beginning of LINQ,
Enumerable.Empty<T>()
, which is strongly-typed to return IEnumerable<T>
, returned an empty T[]
as the actual instance. When Array.Empty<T>()
was introduced, it used that. Then, however, the aforementioned IPartition<T>
was introduced internally in LINQ, and Enumerable.Empty<T>()
was changed to return a singleton EmptyPartition<T>
, an implementation of the interface with all of the methods dedicated to being efficient for empty inputs. This was helpful internally as an implementation detail, as methods that were typed to return IPartition<T>
could return that EmptyPartition<T>
instance, whereas they couldn’t return a T[]
, since it doesn’t implement that interface. However, it had a downside. A variety of APIs can optimize very well if they know the input is empty, e.g. a Take
call can immediately return an empty singleton if it knows the input is empty. But, it can’t be based solely on whether it’s empty now, but rather if it’s empty now and for always; otherwise, you could call Take
, it would see it was empty, then elements get added to the source, and then you call GetEnumerator
on the enumerable returned from Take
… according to the rules for how all of this behaves, that should yield the newly-added elements, but if Take
had returned an empty singleton, it wouldn’t. There are a variety of types that we know will always be empty once seen as empty (e.g. ImmutableArray<T>
, T[]
, FrozenSet<T>
, etc.), but it’d be too costly to check for each of them individually. Instead, the implementation just picked the same type as Enumerable.Empty<T>()
returned as the one to check for. That’s fairly reasonable, but as it turns out, when that type is EmptyPartition<T>
, there are a lot of empty arrays that are no longer noticed as being a special empty input. This gets even worse with collection expressions in the picture, as initializing an IEnumerable<T>
with []
will, as an implementation detail, produce Array.Empty<T>()
. So, this PR put everything back on a plan of Enumerable.Empty<T>()
being Array.Empty<T>()
and a T[0]
being what’s checked for when special-casing empty inputs. The PR also included new checks for empty in many different places that warranted it.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private string[] _values = [];
[Benchmark] public object Chunk() => _values.Chunk(10);
[Benchmark] public object Distinct() => _values.Distinct();
[Benchmark] public object GroupJoin() => _values.GroupJoin(_values, i => i, i => i, (i, j) => i);
[Benchmark] public object Join() => _values.Join(_values, i => i, i => i, (i, j) => i);
[Benchmark] public object ToLookup() => _values.ToLookup(i => i);
[Benchmark] public object Reverse() => _values.Reverse();
[Benchmark] public object SelectIndex() => _values.Select((s, i) => i);
[Benchmark] public object SelectMany() => _values.SelectMany(i => i);
[Benchmark] public object SkipWhile() => _values.SkipWhile(i => true);
[Benchmark] public object TakeWhile() => _values.TakeWhile(i => true);
[Benchmark] public object WhereIndex() => _values.Where((s, i) => true);
}
Method | Runtime |
---|---|
Chunk | .NET 8.0 |
Chunk | .NET 9.0 |
Distinct | .NET 8.0 |
Distinct | .NET 9.0 |
GroupJoin | .NET 8.0 |
GroupJoin | .NET 9.0 |
Join | .NET 8.0 |
Join | .NET 9.0 |
ToLookup | .NET 8.0 |
ToLookup | .NET 9.0 |
Reverse | .NET 8.0 |
Reverse | .NET 9.0 |
SelectIndex | .NET 8.0 |
SelectIndex | .NET 9.0 |
SelectMany | .NET 8.0 |
SelectMany | .NET 9.0 |
SkipWhile | .NET 8.0 |
SkipWhile | .NET 9.0 |
TakeWhile | .NET 8.0 |
TakeWhile | .NET 9.0 |
WhereIndex | .NET 8.0 |
WhereIndex | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
10.7213 ns
[/td][td]
1.00
[/td][td]
72 B
[/td][td]
1.00
[/td][td]
4.1320 ns
[/td][td]
0.39
[/td][td]
–
[/td][td]
0.00
[/td][td]
9.4410 ns
[/td][td]
1.00
[/td][td]
64 B
[/td][td]
1.00
[/td][td]
0.7162 ns
[/td][td]
0.08
[/td][td]
–
[/td][td]
0.00
[/td][td]
22.4746 ns
[/td][td]
1.00
[/td][td]
144 B
[/td][td]
1.00
[/td][td]
1.1356 ns
[/td][td]
0.05
[/td][td]
–
[/td][td]
0.00
[/td][td]
18.6332 ns
[/td][td]
1.00
[/td][td]
168 B
[/td][td]
1.00
[/td][td]
1.3585 ns
[/td][td]
0.07
[/td][td]
–
[/td][td]
0.00
[/td][td]
23.3518 ns
[/td][td]
1.00
[/td][td]
128 B
[/td][td]
1.00
[/td][td]
0.9539 ns
[/td][td]
0.04
[/td][td]
–
[/td][td]
0.00
[/td][td]
9.5791 ns
[/td][td]
1.00
[/td][td]
48 B
[/td][td]
1.00
[/td][td]
0.9947 ns
[/td][td]
0.10
[/td][td]
–
[/td][td]
0.00
[/td][td]
11.1235 ns
[/td][td]
1.00
[/td][td]
72 B
[/td][td]
1.00
[/td][td]
0.5603 ns
[/td][td]
0.05
[/td][td]
–
[/td][td]
0.00
[/td][td]
10.7537 ns
[/td][td]
1.00
[/td][td]
64 B
[/td][td]
1.00
[/td][td]
0.9906 ns
[/td][td]
0.09
[/td][td]
–
[/td][td]
0.00
[/td][td]
11.2900 ns
[/td][td]
1.00
[/td][td]
72 B
[/td][td]
1.00
[/td][td]
1.0988 ns
[/td][td]
0.10
[/td][td]
–
[/td][td]
0.00
[/td][td]
11.8818 ns
[/td][td]
1.00
[/td][td]
72 B
[/td][td]
1.00
[/td][td]
1.0381 ns
[/td][td]
0.09
[/td][td]
–
[/td][td]
0.00
[/td][td]
11.1751 ns
[/td][td]
1.00
[/td][td]
80 B
[/td][td]
1.00
[/td][td]
1.2185 ns
[/td][td]
0.11
[/td][td]
–
[/td][td]
0.00
[/td]dotnet/runtime#98963 also has to do with emptiness, but actually improves non-empty cases.
DefaultIfEmpty
needs to produce an IEnumerable<T>
containing all of the elements from the source, or if the source is empty, an enumerable with a single default(T)
value. In most cases, that means it has to allocate a new enumerable, because for the same reasons as just described, it can’t know until GetEnumerator
is called whether the source is empty. Except, it can if the source is a T[]
, which has an immutable length. This PR thus special-cases arrays, which are very common, such that if the array isn’t empty, it’s just returned directly rather than allocating a wrapper enumerable for it. That’s more than just about avoiding an allocation: that wrapper object would be in the middle of all subsequent iterations of the object, so avoiding it not only avoids an allocation but also a layer of interface calls. And for subsequent code paths that special-case arrays, the result of DefaultIfEmpty
will still be seen as an T[]
and thus now special-cased, whereas it wouldn’t if it were wrapped.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private int[] _data = Enumerable.Range(0, 1000).ToArray();
[Benchmark]
public double Average() => _data.DefaultIfEmpty().Average();
}
Method | Runtime |
---|---|
Average | .NET 8.0 |
Average | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
1,915.4 ns
[/td][td]
1.00
[/td][td]
80 B
[/td][td]
1.00
[/td][td]
117.6 ns
[/td][td]
0.06
[/td][td]
–
[/td][td]
0.00
[/td]Another change taking advantage of emptiness is dotnet/runtime#99256, this time for
Enumerable.Chunk
. Chunk(int size)
creates an IEnumerable<T[]>
that pages through the input size
elements at a time. Normally, this requires iterating through the source and buffering until size
elements have been reached, then yielding an array with those elements, and then rinsing and repeating. With an array input, we could do this much more efficiently, as we could just do math to compute the right bounds for each set to be yielded and do an efficient copy of the elements, rather than iterating through each element one by one. And while it might not be worth adding a specialized check for array here (Chunk
isn’t exactly a high-performance method to begin with, given it’s allocating a new array for each set), as it turns out we now have a check for array, as part of determining whether the source is permanently empty. This PR just leverages that check to take advantage of both answers. If the array is empty, then it still just returns an empty array. But if it’s not empty, rather than falling back to the normal iteration path, it employs a 7-line alternative that’s specialized to arrays and much more efficient.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private int[] _values = Enumerable.Range(0, 1000).ToArray();
[Benchmark]
public int Count()
{
int count = 0;
foreach (var chunk in _values.Chunk(10)) count += chunk.Length;
return count;
}
}
Method | Runtime |
---|---|
Count | .NET 8.0 |
Count | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
3.612 us
[/td][td]
1.00
[/td][td]
1.334 us
[/td][td]
0.37
[/td]The statement about checking for empty now and permanently applies in particular to methods that accept and return enumerables. It’s the laziness of these methods that makes that relevant. There is, however, a set of LINQ methods that are not lazy because they produce things that aren’t enumerables, such as
ToArray
returning an array, Sum
returning a single value, Count
returning an int
, and so on. These methods also received attention, thanks to dotnet/runtime#102884 from @neon-sunset. One of the optimizations applied in various LINQ methods is to special-case input types that are super common, in particular T[]
and List<T>
. These can be special-cased not just as IList<T>
, which would generally be more efficient than enumerating an input via an IEnumerator<T>
, but rather as a ReadOnlySpan<T>
, which can be iterated through very efficiently. This PR extended that optimization to apply to most of these other non-enumerable producing methods, in particular overloads of Any
, All
, Count
, First
, and Single
that take predicates. This is particularly helpful because recent additions to analyzers have resulted in developers being told about opportunities to simplify their LINQ usage. IDE0120 flags code like source.Where(predicate).First()
and instead recommends simplifying it to source.First(predicate)
. And while that is a nice simplification and is likely to reduce allocation, Where
is considerably more optimized than First(predicate)
has been, with the former having special-casing for T[]
and List<T>
but the latter historically not. That difference is now addressed in .NET 9.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private IEnumerable<int> _list = Enumerable.Range(0, 1000).ToList();
[Benchmark] public bool Any() => _list.Any(i => i == 1000);
[Benchmark] public bool All() => _list.All(i => i >= 0);
[Benchmark] public int Count() => _list.Count(i => i == 0);
[Benchmark] public int First() => _list.First(i => i == 999);
[Benchmark] public int Single() => _list.Single(i => i == 0);
}
Method | Runtime |
---|---|
Any | .NET 8.0 |
Any | .NET 9.0 |
All | .NET 8.0 |
All | .NET 9.0 |
Count | .NET 8.0 |
Count | .NET 9.0 |
First | .NET 8.0 |
First | .NET 9.0 |
Single | .NET 8.0 |
Single | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
1,553.3 ns
[/td][td]
1.00
[/td][td]
40 B
[/td][td]
1.00
[/td][td]
222.2 ns
[/td][td]
0.14
[/td][td]
–
[/td][td]
0.00
[/td][td]
1,586.0 ns
[/td][td]
1.00
[/td][td]
40 B
[/td][td]
1.00
[/td][td]
224.9 ns
[/td][td]
0.14
[/td][td]
–
[/td][td]
0.00
[/td][td]
1,535.6 ns
[/td][td]
1.00
[/td][td]
40 B
[/td][td]
1.00
[/td][td]
244.6 ns
[/td][td]
0.16
[/td][td]
–
[/td][td]
0.00
[/td][td]
1,600.7 ns
[/td][td]
1.00
[/td][td]
40 B
[/td][td]
1.00
[/td][td]
245.4 ns
[/td][td]
0.15
[/td][td]
–
[/td][td]
0.00
[/td][td]
1,550.6 ns
[/td][td]
1.00
[/td][td]
40 B
[/td][td]
1.00
[/td][td]
239.4 ns
[/td][td]
0.15
[/td][td]
–
[/td][td]
0.00
[/td]dotnet/runtime#97004 from @neon-sunset uses that same mechanism to improve performance for
List<T>
inputs inside of Enumerable.SequenceEqual
. Enumerable.SequenceEqual
already had a special-case that checked whether both inputs were arrays, and if they were, it created spans from those arrays and delegated to MemoryExtensions.SequenceEquals
, which will efficiently iterate through the spans, vectorizing if possible. This PR just tweaked that special-case to use the same helper that’s used elsewhere to try to get a span from the source, and that gives this super power to List<T>
as well.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private IEnumerable<int> _source1, _source2;
[GlobalSetup]
public void Setup()
{
_source1 = Enumerable.Range(0, 10_000).ToArray();
_source2 = _source1.ToList();
}
[Benchmark]
public bool SequenceEqual() => _source1.SequenceEqual(_source2);
}
Method | Runtime |
---|---|
SequenceEqual | .NET 8.0 |
SequenceEqual | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
26,623.3 ns
[/td][td]
1.00
[/td][td]
913.4 ns
[/td][td]
0.03
[/td]ToArray
and ToList
were generally improved via a variety of PRs, in particular by dotnet/runtime#96570. ToArray
in particular is used so ubiquitously that over the years, many folks have attempted to optimize it. In doing so, however, it’s gotten too complex for its own good. This PR takes advantage of newer runtime capabilities to significantly simplify the implementation, while also improving common case performance. The easy cases were already handled well and continue to be: if the source is an ICollection<T>
, its Count
/ CopyTo
methods can be used to provide a very efficient ToArray
, and if the source is an Iterator<TSource>
, ToArray
just delegates to the iterator’s ToArray
implementation. The challenge, instead, is in efficiently handling the case where we’re dealing with an IEnumerable<T>
of unknown length, needing to handle both short and long inputs, and doing so in a way that minimizes allocation and maximizes throughput. To achieve that, this PR deleted the internal ArrayBuilder
, LargeArrayBuilder
, and SparseArrayBuilder
types that were previously being used and replaced them all with a simpler internal SegmentedArrayBuilder
. The builder is seeded with an [InlineArray]
-based struct that’s large enough to hold eight T
instances. For up to eight items, the builder can simply use that stack-based space to store the elements. For more than eight items, the builder contains another [InlineArray]
of up to 27 T[]
s. The arrays stored in there are rented from the ArrayPool<T>
, and based on the starting size and standard doubling growth algorithm, 27 arrays is enough to store Array.MaxLength
elements. This approach means that small inputs never need to allocate (other than for the final T[]
, which is unavoidable as it’s the whole purpose of the method), and larger inputs can use ArrayPool<T>
arrays without having to copy while growing, leading to on average significantly less allocation than before and generally faster throughput. There are trade-offs to this approach when compared to the previous one, with a few niche corner cases it doesn’t handle quite as efficiently, but on the whole it’s an improvement in both performance and maintainability.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
[Arguments(1)]
[Arguments(8)]
[Arguments(500)]
public string[] IteratorToArray(int count) => GetItems(count).ToArray();
private IEnumerable<string> GetItems(int count)
{
for (int i = 0; i < count; i++)
{
yield return ".NET 9";
}
}
}
Method | Runtime | count |
---|---|---|
IteratorToArray | .NET 8.0 | 1 |
IteratorToArray | .NET 9.0 | 1 |
IteratorToArray | .NET 8.0 | 8 |
IteratorToArray | .NET 9.0 | 8 |
IteratorToArray | .NET 8.0 | 500 |
IteratorToArray | .NET 9.0 | 500 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
65.51 ns
[/td][td]
1.00
[/td][td]
136 B
[/td][td]
1.00
[/td][td]
41.39 ns
[/td][td]
0.63
[/td][td]
80 B
[/td][td]
0.59
[/td][td]
103.30 ns
[/td][td]
1.00
[/td][td]
192 B
[/td][td]
1.00
[/td][td]
74.66 ns
[/td][td]
0.72
[/td][td]
136 B
[/td][td]
0.71
[/td][td]
3,100.69 ns
[/td][td]
1.00
[/td][td]
8536 B
[/td][td]
1.00
[/td][td]
3,080.31 ns
[/td][td]
0.99
[/td][td]
4072 B
[/td][td]
0.48
[/td]dotnet/runtime#104365 from @andrewjsaid followed-up on this to use that same
SegmentedArrayBuilder
to improve ToList
. Everything stays the same, except for the last step of constructing the final collection to be returned: rather than allocating an array, it allocates a List<T>
and uses the CollectionsMarshal.SetCount
method to set both the Capacity
and Count
of the list to the desired size, then copies the elements directly into the backing array for the list, thanks to CollectionsMarshal.AsSpan
. ToList
was also improved in dotnet/runtime#86796 from @brantburnett. In various Iterator<T>.ToList
specializations, the common pattern is to use List<T>.Add
to fill in the resulting collection. This PR used a similar approach as with the previous PR, using a combination of CollectionsMarshal.SetCount
and CollectionsMarshal.AsSpan
to get a span for the list and directly write into the span. This saves some of the overhead from List<T>.Add
, including bounds checks that would otherwise occur when writing to its backing array.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public List<int> IteratorSelectToList() => GetItems(8).Select(i => i).ToList();
[Benchmark]
public List<int> IteratorWhereSelectToList() => GetItems(8).Where(i => true).Select(i => i).ToList();
private IEnumerable<int> GetItems(int count)
{
for (int i = 0; i < count; i++)
{
yield return i;
}
}
}
Method | Runtime |
---|---|
IteratorSelectToList | .NET 8.0 |
IteratorSelectToList | .NET 9.0 |
IteratorWhereSelectToList | .NET 8.0 |
IteratorWhereSelectToList | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
75.14 ns
[/td][td]
1.00
[/td][td]
224 B
[/td][td]
1.00
[/td][td]
67.50 ns
[/td][td]
0.90
[/td][td]
184 B
[/td][td]
0.82
[/td][td]
94.84 ns
[/td][td]
1.00
[/td][td]
288 B
[/td][td]
1.00
[/td][td]
89.42 ns
[/td][td]
0.94
[/td][td]
248 B
[/td][td]
0.86
[/td]A few more tweaks were made to
ToList
and ToArray
in dotnet/runtime#95224 from @Windows10CE and dotnet/runtime#100218. The former improved ToList
on the result of a Distinct
or Union
by enabling HashSet<T>
‘s CopyTo
implementation to be used; previously a custom function was manually iterating through the set, and this PR deleted that code (yay!) and just used List<T>
‘s constructor directly. The latter PR also improved Distinct
and Union
, but for ToArray
, and specifically in the case where it would have allocated a 0-length array when the source was empty. dotnet/runtime#99639 also improved ToArray
and ToList
on the result of an OrderBy
; OrderBy
‘s iterator already special-cased empty sources, but with small tweaks it could also be made to optimize sources with only a single element, in which case no additional work needs to be done (a length-1 array is inherently sorted).
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
public int[] OrderByToArray() => GetItems(1).OrderBy(x => x).ToArray();
private IEnumerable<int> GetItems(int count)
{
for (int i = 0; i < count; i++)
{
yield return i;
}
}
}
Method | Runtime |
---|---|
OrderByToArray | .NET 8.0 |
OrderByToArray | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
66.99 ns
[/td][td]
1.00
[/td][td]
352 B
[/td][td]
1.00
[/td][td]
53.92 ns
[/td][td]
0.81
[/td][td]
160 B
[/td][td]
0.45
[/td]Not to be left out from the fun its
To
cousins are having, ToDictionary
also sees improvements from dotnet/runtime#96574 from @xin9le. The PR changes the code to do a better job setting the capacity of the Dictionary<TKey, TValue>
prior to filling it, and also using the CollectionsMarshal.AsSpan
to share code for handling sources that are arrays and lists, while also shaving off some overhead by enumerating the span instead of the list directly.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private readonly IEnumerable<KeyValuePair<int, int>> _enumerable = Enumerable.Range(0, 10_000).Select(x => new KeyValuePair<int, int>(x, x));
[Benchmark]
public Dictionary<int, KeyValuePair<int, int>> EnumerableToDictionary() => _enumerable.ToDictionary(x => x.Key);
}
Method | Runtime |
---|---|
EnumerableToDictionary | .NET 8.0 |
EnumerableToDictionary | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
284.3 us
[/td][td]
1.00
[/td][td]
788.73 KB
[/td][td]
1.00
[/td][td]
149.9 us
[/td][td]
0.53
[/td][td]
237.01 KB
[/td][td]
0.30
[/td]dotnet/runtime#96605 updated
Enumerable.Min
and Enumerable.Max
to specialize for char
, Int128
, and UInt128
(previous changes specialized other numerical primitives, but these had been left out). By taking advantage of the existing code paths for handling those other primitives, with just a few lines added/changed, these types can now utilize those faster paths, which in particular special-case arrays and lists (which means it can then avoid an enumerator allocation in addition to faster access to each element).
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private Int128[] _values = Enumerable.Range(0, 1000).Select(x => (Int128)x).ToArray();
[Benchmark]
public Int128 Max() => _values.Max();
}
Method | Runtime |
---|---|
Max | .NET 8.0 |
Max | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
1,882.0 ns
[/td][td]
1.00
[/td][td]
32 B
[/td][td]
1.00
[/td][td]
624.7 ns
[/td][td]
0.33
[/td][td]
–
[/td][td]
0.00
[/td]The aforementioned special code paths for the primitive types also support vectorization. Previously that vectorization only supported 128-bit and 256-bit vector widths, but as of dotnet/runtime#93369 from @Spacefish, it now also supports 512-bit vector widths, possibly doubling the throughput of
Enumerable.Min
and Enumerable.Max
on supported hardware with the core numerical primitive types.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private int[] _values = Enumerable.Range(0, 10_000).ToArray();
[Benchmark]
public int Max() => _values.Max();
}
Method | Runtime |
---|---|
Max | .NET 8.0 |
Max | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
327.6 ns
[/td][td]
1.00
[/td][td]
166.3 ns
[/td][td]
0.51
[/td]One caveat here about AVX512. Some AVX512 hardware, even on recent chips, can take a measurable amount of time to “power up,” such that it might be tens, hundreds, or even thousands of cycles before AVX512 processing ends up actually dispatching 512-bit vectors. Until then, the hardware might end up doing the equivalent of dispatching two 256-bit vectors. On my machine, for example, if I lower the size in the previous benchmark from 10,000 elements to 1,000 elements, the .NET 9 improvement disappears and it ends up running at exactly the same throughput as on .NET 8; on a colleague’s machine with a different processor, even at 1,000 elements the .NET 9 throughput is almost twice that of .NET 8. This is all to say, your mileage may vary. In some of the micro-benchmarks discussed in this post, small improvements are made to already very fast operations, and the gains then come from those operations being done many, many, many times on hot paths. In others, the gains come from taking an expensive operation and making it measurably cheaper. In general the benefits with using AVX512 in these kinds of vectorized implementations come for the latter case, where large data sizes lead to operations taking significant amounts of time, and the use of 512-bit vectors instead of 256-bit vectors measurably speeds up those longer operations.
The
OrderBy
family of methods on Enumerable
were also improved in several ways:- Ordering operations followed by a
First()
orLast()
call were already specialized to completely avoid theO(N log N)
sort and instead do anO(N)
search for the min or max. However,OrderBy
in LINQ is fairly complicated because it needs to account for the possibility of one or more subsequentThenBy
operations that impact the sort order, and thus it uses a custom comparison mechanism that factors in the possibility of such refinement. That custom comparer mechanism was being used as part of thoseFirst
/Last
specializations. dotnet/runtime#97483 detects whether there are anyThenBy
s in play, and if there aren’t, it bypasses that customization and, in doing so, avoids its overhead. That can be very measurable, but in certain cases, it can be enormous, as it can enable other optimizations to kick in, e.g. anOrder().Last()
on anint[]
can just end up doing a vectorized search for the max.
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using System.Runtime.InteropServices; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [MemoryDiagnoser(false)] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private List _ints; [GlobalSetup] public void Setup() { _ints = new(Enumerable.Range(-8000, 8000 * 2)); new Random(42).Shuffle(CollectionsMarshal.AsSpan(_ints)); } [Benchmark] public int OrderByLast_Int32() => _ints.OrderBy(x => x).Last(); [Benchmark] public int OrderLast_Int32() => _ints.Order().Last(); }
Method Runtime OrderByLast_Int32 .NET 8.0 OrderByLast_Int32 .NET 9.0 OrderLast_Int32 .NET 8.0 OrderLast_Int32 .NET 9.0 [th]Mean[/th][th]Ratio[/th][th]Allocated[/th][th]Alloc Ratio[/th][td]34,715.6 ns[/td][td]1.00[/td][td]136 B[/td][td]1.00[/td][td]25,001.1 ns[/td][td]0.72[/td][td]128 B[/td][td]0.94[/td][td]36,064.9 ns[/td][td]1.00[/td][td]112 B[/td][td]1.00[/td][td]693.8 ns[/td][td]0.02[/td][td]56 B[/td][td]0.50[/td] - In .NET 8,
Enumerable.Order
was updated to recognize that sorting of certain primitive types is implicitly stable even if an unstable sorting algorithm is used, because any two values of such types that compare equally are indistinguishable in memory (e.g. the onlyInt32
values that compare equally are those with the exact same bit patterns in memory). dotnet/runtime#99533 improves this logic to also handle enums whose underlying type counts.
Code:dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [MemoryDiagnoser(false)] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private IEnumerable<DayOfWeek> _days = new Random(42).GetItems(Enum.GetValues<DayOfWeek>(), 100); [Benchmark] public int Order() { int sum = 0; foreach (DayOfWeek dow in _days.Order()) { sum += (int)dow; } return sum; } }
Method Runtime Order .NET 8.0 Order .NET 9.0 [th]Mean[/th][th]Ratio[/th][th]Allocated[/th][th]Alloc Ratio[/th][td]1,652.9 ns[/td][td]1.00[/td][td]1088 B[/td][td]1.00[/td][td]873.0 ns[/td][td]0.53[/td][td]544 B[/td][td]0.50[/td] - In .NET 8, a change was submitted to
Enumerable.Range
to vectorize its operation when followed by methods likeToArray
. At the time, we had some debate about whether to merge the change, with me asking questions like “Who would actually useEnumerable.Range(...).ToArray()
on code paths that care about performance?” As it turns out, we do! As part ofOrderBy
‘s stable sort implementation, it had code like this:
Code:int[] map = new int[count]; for (int i = 0; i < map.Length; i++) { map[i] = i; }
For all intents and purposes, that’sEnumerable.Range(0, count).ToArray()
. dotnet/runtime#99538 recognizes this and uses that same vectorized helper to fill this array in a vectorized manner, and that overhead can actually be measurable in some cases.
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private IEnumerable<int> _data = Enumerable.Range(0, 1000).ToArray(); [Benchmark] public int OrderBy() { int sum = 0; foreach (int value in _data.OrderBy(i => i)) { sum += value; } return sum; } }
Method Runtime OrderBy .NET 8.0 OrderBy .NET 9.0 [th]Mean[/th][th]Ratio[/th][td]14.83 us[/td][td]1.00[/td][td]13.48 us[/td][td]0.91[/td]
GroupBy
and ToLookup
also get some dedicated improvements in .NET 9, thanks to dotnet/runtime#99365. GetEnumerator
on the grouping object returned by these methods was implemented using a simple C# iterator:
Code:
public IEnumerator<TElement> GetEnumerator()
{
for (int i = 0; i < _count; i++)
{
yield return _elements[i];
}
}
In general we favor using C# iterators over manual implementations (unless we’re going to go all out and implement all of the
Iterator<TSource>
logic) because C# iterators make the code so simple and maintainable. In this particular case, however, this is a reasonably common hot path and we can actually do meaningfully better by hand than the compiler is able to do today. When the compiler generates a state machine for the previous iterator, it does so with a dedicated state field, but with a manual implementation, we can use the same field for state as we use for the iteration variable, which also means we only need to update one thing per loop iteration.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private ILookup<int, string> _stringsByLength =
(from i in Enumerable.Range(0, 10)
from item in Enumerable.Range(0, 8)
select new string((char)('a' + item), i + 1)).ToLookup(s => s.Length);
[Benchmark]
public int Sum()
{
int sum = 0;
foreach (IGrouping<int, string> group in _stringsByLength)
{
foreach (string item in group)
{
sum += item.Length;
}
}
return sum;
}
}
Method | Runtime |
---|---|
Sum | .NET 8.0 |
Sum | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
290.3 ns
[/td][td]
1.00
[/td][td]
267.4 ns
[/td][td]
0.92
[/td]Core Collections
As shared in Performance Improvements in .NET 8,
Dictionary<TKey, TValue>
is one of the most popular collections in all of .NET, by far (probably not surprising to anyone). And in .NET 9, it gets a performance-focused feature I’ve been wanting for years.One of the most common uses for a dictionary is as a cache, often indexed by a
string
key. And for high-performance scenarios, such caches are frequently used in situations where an actual string
object may not be available, but where the text is available, just in a different form, like a ReadOnlySpan<char>
(or for caches indexed by UTF8 data, the key might be a byte[]
yet the data to perform the lookup is only available as a ReadOnlySpan<byte>
). Performing the lookup on the dictionary then would either require materializing a string from the data, which makes the lookup more costly (and in some cases can entirely defeat the purposes of the cache), or require using a custom key type that’s capable of handling multiple forms of the data, which then also generally requires a custom comparer.This has been addressed in .NET 9 with the introduction of
IAlternateEqualityComparer<TAlternate, T>
. A comparer that implements IEqualityComparer<T>
may now also implement this additional interface one or more times for other TAlternate
types, making it possible for that comparer to treat alternate types the same as the T
. Then a type like Dictionary<TKey, TValue>
can expose additional methods that work in terms of a TAlternateKey
and allow them to work if the comparer in that Dictionary<TKey, TValue>
implements IAlternateEqualityComparer<TAlternateKey, TKey>
. In .NET 9 with dotnet/runtime#102907 and dotnet/runtime#103191, Dictionary<TKey, TValue>
, ConcurrentDictionary<TKey, TValue>
, FrozenDictionary<TKey, TValue>
, HashSet<T>
, and FrozenSet<T>
all do exactly that. For example, here I have a Dictionary<string, int>
I’m using to count the number of occurrences of each word in a span:
Code:
static Dictionary<string, int> CountWords1(ReadOnlySpan<char> input)
{
Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);
foreach (ValueMatch match in Regex.EnumerateMatches(input, @"\b\w+\b"))
{
ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);
string key = word.ToString();
result[key] = result.TryGetValue(key, out int count) ? count + 1 : 1;
}
return result;
}
I’m returning a
Dictionary<string, int>
, so I certainly need to materialize the string
for each ReadOnlySpan<char>
in order to store it in the dictionary, but I should only need to do so once, the first time the word is found. I shouldn’t need to create a new string each time, yet I’m having to in order to do the TryGetValue
call. Now with .NET 9, a new GetAlternateLookup
method (and a corresponding TryGetAlternateLookup
) exists to produce a separate value type wrapper that enables using an alternate key type for all the relevant operations, which means I can now write this:
Code:
static Dictionary<string, int> CountWords2(ReadOnlySpan<char> input)
{
Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);
Dictionary<string, int>.AlternateLookup<ReadOnlySpan<char>> alternate = result.GetAlternateLookup<ReadOnlySpan<char>>();
foreach (ValueMatch match in Regex.EnumerateMatches(input, @"\b\w+\b"))
{
ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);
alternate[word] = alternate.TryGetValue(word, out int count) ? count + 1 : 1;
}
return result;
}
Note the distinct lack of a
ToString()
, which means no allocation will occur here for words already seen. How then does the alternate[word] = ...
part work? Surely this isn’t storing a ReadOnlySpan<char>
into the dictionary? Nope. Rather, IAlternateEqualityComparer<TAlternate, T>
looks like this:
Code:
public interface IAlternateEqualityComparer<in TAlternate, T>
where TAlternate : allows ref struct
where T : allows ref struct
{
bool Equals(TAlternate alternate, T other);
int GetHashCode(TAlternate alternate);
T Create(TAlternate alternate);
}
The
Equals
and GetHashCode
should look familiar, the main difference from the corresponding members of IEqualityComparer<T>
just being the type of the first parameter. But then there’s this additional Create
method. That method accepts a TAlternate
and returns a T
, giving the comparer the ability to map from one to the other. That setter we saw previously (and other methods like TryAdd
) are able to use this to only create the TKey
from the TAlternateKey
when they have to, so the setter here will only allocate the string for the word if the word doesn’t already exist in the collection.Another possibly perplexing thing for anyone reading this and who’s well versed in the ways of
ReadOnlySpan<T>
: how in the world is Dictionary<string, int>.AlternateLookup<ReadOnlySpan<char>>
valid? ref struct
s like span can’t be used as generic parameters, right? Right… until now. C# 13 and .NET 9 now permit ref struct
s as generic parameters, but the generic parameter needs to opt-in to it via the new allows ref struct
constraint (or “anti-constraint” as some of us frequently refer to it). There are things a method can do with an instance of an unconstrained generic parameter, like cast it to object
or store it into a field of a class, that can’t be done with ref struct
. By adding allows ref struct
to a generic parameter, it tells the compiler compiling the consumer that it may specify a ref struct
, and it tells the compiler compiling the type or method with the constraint that the generic instantiation might be a ref struct
and thus the generic parameter can only be used in situations where a ref struct
would be legal.Of course, all of this working hinges on the supplied comparer sporting the appropriate
IAlternateEqualityComparer<TAlternate, T>
implementation; if it doesn’t, attempts to call GetAlternateLookup
will throw an exception, and attempts to call TryGetAlternateLookup
will return false
. You can use these collection types with whatever comparer you want, and that comparer can provide implementations of this interface for whatever alternate key types you want. But with string
and ReadOnlySpan<char>
being so common, it’d be a shame if there wasn’t built-in support for this combination. And indeed, with the aforementioned PRs, all of the built-in StringComparer
types implement IAlternateEqualityComparer<ReadOnlySpan<char>, string>
. That’s why the Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);
line is successful in the previous code example, as the subsequent call to result.GetAlternateLookup<ReadOnlySpan<char>>()
will successfully find the interface on the supplied comparer.
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public partial class Tests
{
private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;
[GeneratedRegex(@"\b\w+\b")]
private static partial Regex WordParser();
[Benchmark(Baseline = true)]
public Dictionary<string, int> CountWords1()
{
ReadOnlySpan<char> input = s_input;
Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);
foreach (ValueMatch match in WordParser().EnumerateMatches(input))
{
ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);
string key = word.ToString();
result[key] = result.TryGetValue(key, out int count) ? count + 1 : 1;
}
return result;
}
[Benchmark]
public Dictionary<string, int> CountWords2()
{
ReadOnlySpan<char> input = s_input;
Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);
Dictionary<string, int>.AlternateLookup<ReadOnlySpan<char>> alternate = result.GetAlternateLookup<ReadOnlySpan<char>>();
foreach (ValueMatch match in WordParser().EnumerateMatches(input))
{
ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);
alternate[word] = alternate.TryGetValue(word, out int count) ? count + 1 : 1;
}
return result;
}
}
Method |
---|
CountWords1 |
CountWords2 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
60.35 ms
[/td][td]
1.00
[/td][td]
20.67 MB
[/td][td]
1.00
[/td][td]
57.40 ms
[/td][td]
0.95
[/td][td]
2.54 MB
[/td][td]
0.12
[/td]Note the huge reduction in allocation.
For fun, we can also take this example one step further. .NET 6 introduced the
CollectionsMarshal.GetValueRefOrAddDefault
method, which returns a writable ref
to the actual location where the TValue
for a given TKey
is stored, creating the entry in the dictionary if it doesn’t exist. This is very handy for operations like the one used above, as it helps to avoid an extra dictionary lookup. Without it, we’re doing one lookup as part of the TryGetValue
and then another lookup as part of the setter, but with it, we just do the single lookup as part of GetValueRefOrAddDefault
and then no additional lookup is necessary because we already have the location into which we can directly write. And as the lookups in this benchmark are one of the more costly elements, eliminating half of them can significantly reduce the cost of the operation. As part of this alternate key effort, a new overload of GetValueRefOrAddDefault
was added that works with it, such that the same operation can be performed with a TAlternateKey
.
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.InteropServices;
using System.Text.RegularExpressions;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public partial class Tests
{
private static readonly string s_input = new HttpClient().GetStringAsync("https://gutenberg.org/cache/epub/2600/pg2600.txt").Result;
[GeneratedRegex(@"\b\w+\b")]
private static partial Regex WordParser();
[Benchmark(Baseline = true)]
public Dictionary<string, int> CountWords1()
{
ReadOnlySpan<char> input = s_input;
Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);
foreach (ValueMatch match in WordParser().EnumerateMatches(input))
{
ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);
string key = word.ToString();
result[key] = result.TryGetValue(key, out int count) ? count + 1 : 1;
}
return result;
}
[Benchmark]
public Dictionary<string, int> CountWords2()
{
ReadOnlySpan<char> input = s_input;
Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);
Dictionary<string, int>.AlternateLookup<ReadOnlySpan<char>> alternate = result.GetAlternateLookup<ReadOnlySpan<char>>();
foreach (ValueMatch match in WordParser().EnumerateMatches(input))
{
ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);
alternate[word] = alternate.TryGetValue(word, out int count) ? count + 1 : 1;
}
return result;
}
[Benchmark]
public Dictionary<string, int> CountWords3()
{
ReadOnlySpan<char> input = s_input;
Dictionary<string, int> result = new(StringComparer.OrdinalIgnoreCase);
Dictionary<string, int>.AlternateLookup<ReadOnlySpan<char>> alternate = result.GetAlternateLookup<ReadOnlySpan<char>>();
foreach (ValueMatch match in WordParser().EnumerateMatches(input))
{
ReadOnlySpan<char> word = input.Slice(match.Index, match.Length);
CollectionsMarshal.GetValueRefOrAddDefault(alternate, word, out _)++;
}
return result;
}
}
Method |
---|
CountWords1 |
CountWords2 |
CountWords3 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
60.73 ms
[/td][td]
1.00
[/td][td]
20.67 MB
[/td][td]
1.00
[/td][td]
54.01 ms
[/td][td]
0.89
[/td][td]
2.54 MB
[/td][td]
0.12
[/td][td]
44.38 ms
[/td][td]
0.73
[/td][td]
2.54 MB
[/td][td]
0.12
[/td]“But wait, there’s more!” dotnet/runtime#104202 extends the alternate comparer implementation for
string
/ReadOnlySpan<char>
further to also apply to EqualityComparer<string>.Default
, which means that if you don’t supply a comparer at all, these collection types will still support ReadOnlySpan<char>
lookups. That change not only then improves the usability of these new APIs, but it actually had an additional unintended but welcome performance benefit. Previously, EqualityComparer<string>.Default
would return an internal GenericEqualityComparer<string>
type, derived from EqualityComparer<string>
. It wouldn’t be possible to implement IAlternateEqualityComparer<ReadOnlySpan<char>, string>
on GenericEqualityComparer<string>
because doing so would actually have to be done on GenericEqualityComparer<T>
, which would mean every EqualityComparer<T>.Default
would support IAlternateEqualityComparer<ReadOnlySpan<char>, T>
, and we have no correct way of providing such an implementation for all T
s. Instead, we introduced a new internal non-generic StringEqualityComparer
type and made EqualityComparer<T>.Default
return an instance of that when T
is string
(the implementation of Default
already knows about and returns a bunch of specialized comparers, this is just one more). In doing so, it made the type that’s used non-generic, which in turn means that in some situations, it eliminates some of the overhead associated with generics.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private IEqualityComparer<string> _comparer = EqualityComparer<string>.Default;
private string[] _values = Enumerable.Range(0, 1000).Select(i => i.ToString()).ToArray();
[Benchmark]
public int Count() => CountEquals(_values, "500", _comparer);
[MethodImpl(MethodImplOptions.NoInlining)]
private static int CountEquals<T>(T[] haystack, T needle, IEqualityComparer<T> comparer)
{
int count = 0;
foreach (T value in haystack)
{
if (comparer.Equals(value, needle))
{
count++;
}
}
return count;
}
}
Method | Runtime |
---|---|
Count | .NET 8.0 |
Count | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
4.477 us
[/td][td]
1.00
[/td][td]
2.808 us
[/td][td]
0.63
[/td]HashSet<T>
also gains all of these super powers, but several additional PRs went into making other performance improvements to it. dotnet/runtime#85877 from @hrrrrustic added a TrimExcess(int capacity)
method to HashSet<T>
(as well as to Queue<T>
and Stack<T>
), enabling more fine-grained control over how much memory to cull from a set that might have grown larger than is now required. And dotnet/runtime#102758 from @lilinus improved its IsSubsetOf
, IsPropertSubsetOf
, and SetEquals
methods by tweaking the fast paths already present. The methods were attempting to early-exit as soon as the condition could be proved true
or false
, but some common conditions were being missed, and this rectified those.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private HashSet<int> _set = new(Enumerable.Range(0, 1000));
private List<int> _list = Enumerable.Range(0, 999).ToList();
[Benchmark]
public bool IsSubsetOf() => _set.IsSubsetOf(_list);
}
Method | Runtime |
---|---|
IsSubsetOf | .NET 8.0 |
IsSubsetOf | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
7,351.373 ns
[/td][td]
1.000
[/td][td]
40 B
[/td][td]
1.00
[/td][td]
1.216 ns
[/td][td]
0.000
[/td][td]
–
[/td][td]
0.00
[/td]dotnet/runtime#96573 from @ndsvw also identified a few places in various libraries where a
Dictionary<T, T>
was being used as a set and replaced them with HashSet<T>
. The implementations of Dictionary<>
and HashSet<>
are very close in nature, but the latter consumes less memory because it doesn’t need to store separate values. Using a Dictionary<T, T>
effectively doubles the required storage, so if a HashSet<T>
suffices, it’s preferable.A variety of other collection types have also seen improvements in .NET 9:
PriorityQueue<TElement, TPriority>
. TheEnqueueRange(IEnumerable<Telement>, TPriority)
method enables multiple elements to all be inserted at the same priority. If there are already elements in the collection, this is akin to just callingEnqueue
for each. However, if the collection is currently empty, then it can skip the per-element addition costs and instead just store the elements directly into the backing array. After doing so, it was then performing a heapify operation. But dotnet/runtime#99139 from @skyoxZ recognized that this heapify was entirely unnecessary, because all of the elements were inserted at the same priority, and there were no elements of any other priority already in the collection. Many performance optimizations come with trade-offs, making one common thing much faster at the expense of making some less common thing a little slower. This, however, is my favorite kind of optimization: elimination of unnecessary work with effectively zero downside.
Code:using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [DisassemblyDiagnoser] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private PriorityQueue<int, int> _pq = new(); private int[] _elements = Enumerable.Range(0, 100).ToArray(); [Benchmark] public void EnqueueRange() { _pq.Clear(); _pq.EnqueueRange(_elements, priority: 42); } }
Method Runtime EnqueueRange .NET 8.0 EnqueueRange .NET 9.0 [th]Mean[/th][th]Ratio[/th][td]239.3 ns[/td][td]1.00[/td][td]206.7 ns[/td][td]0.86[/td]BitArray
. Multiple methods onBitArray
are already accelerated usingVector128
andVector256
, enabling much faster throughput for methods likeAnd
,Or
, andNot
. dotnet/runtime#91903 from @khushal1996 addsVector512
support to all of these as well, enabling hardware with AVX512 support to process these operations upwards of twice as fast as before.
Code:using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using System.Collections; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private BitArray _first = new BitArray(1024 * 1024); private BitArray _second = new BitArray(1024 * 1024); [Benchmark] public void Or() => _first.Or(_second); }
Method Runtime Or .NET 8.0 Or .NET 9.0 [th]Mean[/th][th]Ratio[/th][td]2.894 us[/td][td]1.00[/td][td]2.354 us[/td][td]0.81[/td]List<T>
. dotnet/runtime#90089 from @karakasa avoided an extraArray.Copy
call as part ofInsert
. Previously the implementation may have done up to threeArray.Copy
operations, and as part of this change that can drop to just two.FrozenDictionary<TKey, TValue>
andFrozenSet<T>
. These frozen collections introduced in .NET 8 have also received some attention in .NET 9. As a reminder,FrozenDictionary<TKey, TValue>
andFrozenSet<T>
are immutable collections optimized for reading, willing to spend more time and effort during construction to make subsequent operations on the collections as fast as possible. When theTKey
/T
is astring
, one optimization employed is to track the minimum and maximum lengths of all strings in the collection; if astring
that’s shorter or longer than that is used in a lookup, the collection can immediately report that it’s not in the collection without having to actually perform any lookup, instead just comparing against the min and max. dotnet/runtime#92546 from @andrewjsaid extends this further by employing a bitmap of up to 64 bits corresponding to lengths of strings contained in the collection. On lookup, rather than only comparing against min/max, the implementation can test whether the corresponding bit for thestring
‘s length is set, bailing immediately if it’s not. dotnet/runtime#100998 also reduced creation overheads with frozen collections created withstring
keys andStringComparer.OrdinalIgnoreCase
. The implementation had been using its own custom comparison logic for hash code generation, in order to support building fornetstandard2.0
in addition to .NET Core, but this PR specialized the code for .NET Core to usestring.GetHashCode(ReadOnlySpan<char>, StringComparison)
, which is more efficient than the custom implementation.
Code:using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using System.Collections.Frozen; using System.Text.RegularExpressions; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private static readonly FrozenSet<string> s_words = Regex.Matches(""" Let me not to the marriage of true minds Admit impediments; love is not love Which alters when it alteration finds, Or bends with the remover to remove. O no, it is an ever-fixed mark That looks on tempests and is never shaken; It is the star to every wand'ring bark Whose worth's unknown, although his height be taken. Love's not time's fool, though rosy lips and cheeks Within his bending sickle's compass come. Love alters not with his brief hours and weeks, But bears it out even to the edge of doom: If this be error and upon me proved, I never writ, nor no man ever loved. """, @"\b\w+\b").Cast<Match>().Select(w => w.Value).ToFrozenSet(); private string _word = "quickness"; [GlobalSetup] public void Setup() => Console.WriteLine(s_words); [Benchmark] public bool Contains() => s_words.Contains(_word); }
Method Runtime Contains .NET 8.0 Contains .NET 9.0 [th]Mean[/th][th]Ratio[/th][td]4.373 ns[/td][td]1.00[/td][td]1.154 ns[/td][td]0.26[/td]
Compression
It’s an important goal of the core .NET libraries to be as platform-agnostic as possible. Things should generally behave the same way regardless of which operating system or which hardware is being used, excepting things that really are operating system or hardware specific (e.g. we purposefully don’t try to paper over casing differences of different file systems). To that end, we generally implement as much as possible in C#, deferring down to the operating system and native platform libraries only when necessary; for example, the default .NET HTTP implementation,
System.Net.Http.SocketsHttpHandler
, is written in C# on top of System.Net.Sockets
, System.Net.Dns
, etc., and subject to the implementation of sockets on each platform (where behaviors are implemented by the operating system), generally behaves the same wherever you’re running.There are, however, just a few specific places where we’ve actively made the choice to defer more to something in the platform. The most important case here is cryptography, where we want to rely on the operating system for such security-related functionality; so on Windows, for example, TLS is implemented in terms of components like
SChannel
, on Linux in terms of OpenSSL
, and on macOS in terms of SecureTransport
. The other notable case has been compression, and in particular zlib
. We decided long ago to simply use whatever zlib
was distributed with the operating system. That has had various implications, however. Starting with the fact that Windows doesn’t ship with zlib
as a library exposed for consumption, so the .NET build targeting Windows still had to include its own copy of zlib
. That was then improved but also complicated by a decision to switch to distribute a variant of zlib
produced by Intel, which was nicely optimized further for x64, but which didn’t have as much attention paid to other hardware, like Arm64. And very recently, the intel/zlib
repository was archived and is not actively being maintained by Intel.To simplify things, to improve consistency and performance across more platforms, and to move to an actively supported and evolving implementation, this changes for .NET 9. Thanks to a stream of PRs, and in particular dotnet/runtime#104454 and dotnet/runtime#105771, .NET 9 now includes the
zlib
functionality built-in across Windows, Linux, and macOS, based on the newer zlib-ng/zlib-ng
. zlib-ng
is a zlib
-compatible API that is actively maintained, includes improvements previously made to both Intel and Cloudflare’s forks, and has received improvements across many different CPU intrinsics.Benchmarking just throughput is easy with BenchmarkDotNet. Unfortunately, while I love the tool, the lack of dotnet/BenchmarkDotNet#784 makes it very challenging to appropriately benchmark compression, because throughput is only one part of the equation. Compression ratio is also a key part of the equation (you can make “compression” super fast by just outputting the input without actually manipulating it at all), so we also need to know about compressed output size when discussing compression speeds. To do that for this post, I’ve hacked up just enough in this benchmark to make it work for this example, implementing a custom column for BenchmarkDotNet, but please note this is not a general-purpose implementation.
Code:
// dotnet run -c Release -f net8.0 --filter "*"
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Columns;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Reports;
using BenchmarkDotNet.Running;
using System.IO.Compression;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(
args,
DefaultConfig.Instance.AddColumn(new CompressedSizeColumn()));
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private byte[] _uncompressed = new HttpClient().GetByteArrayAsync(@"https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;
[Params(CompressionLevel.NoCompression, CompressionLevel.Fastest, CompressionLevel.Optimal, CompressionLevel.SmallestSize)]
public CompressionLevel Level { get; set; }
private MemoryStream _compressed = new MemoryStream();
private long _compressedSize;
[Benchmark]
public void Compress()
{
_compressed.Position = 0;
_compressed.SetLength(0);
using (var ds = new DeflateStream(_compressed, Level, leaveOpen: true))
{
ds.Write(_uncompressed, 0, _uncompressed.Length);
}
_compressedSize = _compressed.Length;
}
[GlobalCleanup]
public void SaveSize()
{
File.WriteAllText(Path.Combine(Path.GetTempPath(), $"Compress_{Level}"), _compressedSize.ToString());
}
}
public class CompressedSizeColumn : IColumn
{
public string Id => nameof(CompressedSizeColumn);
public string ColumnName { get; } = "CompressedSize";
public bool AlwaysShow => true;
public ColumnCategory Category => ColumnCategory.Custom;
public int PriorityInCategory => 1;
public bool IsNumeric => true;
public UnitType UnitType { get; } = UnitType.Size;
public string Legend => "CompressedSize Bytes";
public bool IsAvailable(Summary summary) => true;
public bool IsDefault(Summary summary, BenchmarkCase benchmarkCase) => true;
public string GetValue(Summary summary, BenchmarkCase benchmarkCase, SummaryStyle style) =>
GetValue(summary, benchmarkCase);
public string GetValue(Summary summary, BenchmarkCase benchmarkCase) =>
File.ReadAllText(Path.Combine(Path.GetTempPath(), $"Compress_{benchmarkCase.Parameters.Items[0].Value}")).Trim();
}
Running that for .NET 8, I get this:
Method | Level |
---|---|
Compress | NoCompression |
Compress | Fastest |
Compress | Optimal |
Compress | SmallestSize |
[th]
Mean
[/th][th]
CompressedSize
[/th][td]
1.783 ms
[/td][td]
16015049
[/td][td]
164.495 ms
[/td][td]
7312367
[/td][td]
620.987 ms
[/td][td]
6235314
[/td][td]
867.076 ms
[/td][td]
6208245
[/td]and for .NET 9, I get this:
Method | Level |
---|---|
Compress | NoCompression |
Compress | Fastest |
Compress | Optimal |
Compress | SmallestSize |
[th]
Mean
[/th][th]
CompressedSize
[/th][td]
1.814 ms
[/td][td]
16015049
[/td][td]
64.345 ms
[/td][td]
9578398
[/td][td]
230.646 ms
[/td][td]
6276158
[/td][td]
567.579 ms
[/td][td]
6215048
[/td]A few interesting things to note here:
- On both .NET 8 and .NET 9, there’s an obvious correlation: the more compression is requested, the slower it gets and the smaller the file size becomes.
NoCompression
, which really just echos the input bytes back as output, produces the exact same compressed size across .NET 8 and .NET 9, as one would hope; the compressed size should be identical to the input size.- The compressed size for
SmallestSize
is almost the same between .NET 8 and .NET 9; they differ by only ~0.1%, but for that small increase, theSmallestSize
throughput ends up being ~35% faster. In both cases, the .NET layer is just passing down a zlib compression level of 9, which is the largest value possible and denotes best-possible compression. It just happens that withzlib-ng
, that best possible compression is significantly faster and just a tad bit worse compression-ratio-wise than withzlib
. - For
Optimal
, which is also the default and represents a balanced tradeoff between speed and compression ratio (with 20/20 hindsight, the name for this member should have beenBalanced
), the .NET 9 version usingzlib-ng
is 60% faster while only sacrificing ~0.6% on compression ratio. Fastest
is interesting. The .NET implementation is just passing down a compression level of 1 to thezlib-ng
native code, indicating to choose the fastest speed while still doing some compression (0 means don’t compress at all). But thezlib-ng
implementation is obviously making different trade-offs than did the olderzlib
code, as it’s truer to its name: it’s running more than 2x as fast and still compressing, but the compressed output is ~30% larger than the output on .NET 8.
The net effect of this is, especially if you’re using
Fastest
, you might want to re-evaluate to see whether the throughput / compression ratios meet your needs. If you want to tweak it further, though, you’re no longer limited to just these options. dotnet/runtime#105430 adds new constructors to DeflateStream
, GZipStream
, ZLibStream
, and also the unrelated BrotliStream
, enabling more fine-grained control over the parameters passed to the native implementations, e.g.
Code:
private static readonly ZLibCompressionOptions s_options = new ZLibCompressionOptions()
{
CompressionLevel = 2,
};
...
Stream sourceStream = ...;
using var ds = new DeflateStream(compressed, s_options, leaveOpen: true)
{
sourceStream.CopyTo(ds);
}
Cryptography
Investments in
System.Security.Cryptography
are generally focused on improving the security of a system, supporting new cryptographic primitives, better integrating with security capabilities of the underlying operating system, and so on. But as cryptography is ever present in most modern systems, it’s also impactful to make the existing functionality more efficient, and a variety of PRs in .NET 9 have done so.Let’s start with random number generation. .NET 8 added a new
GetItems
method to both Random
(the core non-cryptographically-secure random number generator) and RandomNumberGenerator
(the core cryptographically-secure random number generator). This method is very handy when you need to randomly generate N elements sourced from a specific set of values. For example, if you wanted to write 100 random hex characters to a destination Span<char>
, you could do:
Code:
Span<char> dest = stackalloc char[100];
Random.Shared.GetItems("0123456789abcdef", dest);
The core implementation is very simple, and is just a convenience implementation for something you could easily do yourself:
Code:
for (int i = 0; i < dest.Length; i++)
{
dest[i] = choices[Next(choices.Length)];
}
Easy peasy. However, in some situations we can do better. This implementation ends up making a call to the random number generator for each element, and that roundtrip adds measurable overhead. If we could instead make fewer calls, we could ammortize that overhead across however many elements could be filled by that single call. That’s exactly what dotnet/runtime#92229 does. If the number of choices is less than or equal to 256 and a power of two, rather than asking for a random integer for each element, we can instead get a byte for each element, and we can do that in bulk with a single call to NextBytes. The max of 256 choices is because that’s the number of values a byte can represent, and the power of two is so that we can simply mask off unnecessary bits from the byte, which helps to avoid bias. This makes a measurable impact for
Random
, but even more so for RandomNumberGenerator
, where each call to get random bytes requires a transition into the operating system.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Security.Cryptography;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private char[] _dest = new char[100];
[Benchmark]
public void GetRandomHex() => RandomNumberGenerator.GetItems<char>("0123456789abcdef", _dest);
}
Method | Runtime |
---|---|
GetRandomHex | .NET 8.0 |
GetRandomHex | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
58,659.2 ns
[/td][td]
1.00
[/td][td]
746.5 ns
[/td][td]
0.01
[/td]Sometimes performance improvements are about revisiting past assumptions. .NET 5 added a new
GC.AllocateArray
method which optionally allows that array to be created on the “pinned object heap,” or POH. Allocating on the POH is the same as allocating normally, except that the GC guarantees that objects on the POH won’t be moved (normally the GC is free to compact the heap, moving objects around in order to reduce fragmentation). This is a useful guarantee for cryptography, which employs defense-in-depth measures like zero’ing out buffers to reduce the chances of an attacker being able to find sensitive information in the memory (or memory dump) of a process. The crypto library wants to be able to allocate some memory, use it to temporarily contain some sensitive information, and then zero out the memory before stopping using it, but if the GC is able to move the object around in the interim, it could end up leaving shadows of the data on the heap. When the POH was introduced, then, System.Security.Cryptography
started using it, including for relatively short-lived objects. This is potentially problematic, however. Because the nature of the POH is that objects can’t be moved around, creating short-lived objects on the POH can significantly increase fragmentation, which can in turn increase memory consumption, GC costs, and so on. And as a result, the POH is really only recommended for long-lived objects, ideally ones that you create and then hold onto for the remainder of the process’ lifetime. dotnet/runtime#99168 undid System.Security.Cryptography
‘s reliance on the POH, instead preferring to use native memory (e.g. via NativeMemory.Alloc
and NativeMemory.Free
) for such needs.On the subject of memory, multiple PRs went into the crypto libraries to reduce allocation. Here are some examples:
- Marshaling pointers instead of temporary arrays. The
CngKey
type exposes properties likeExportPolicy
,IsMachineKey
, andKeyUsage
, all of which utilize an internalGetPropertyAsDword
method that P/Invokes to retrieve an integer from Windows. It was doing so, however, via a shared helper that was allocating a 4-bytebyte[]
, passing that to the OS to fill, and then converting those four bytes into anint
. dotnet/runtime#91521 changed the interop path to instead just store theint
on the stack, passing a pointer to it to the OS, avoiding the need to allocate and parse. - Special-casing empty. Throughout the core libraries, we rely heavily on
Array.Empty<T>()
to avoid allocating lots of empty arrays when we could instead just employ singletons. The crypto libraries work with a lot of arrays, and as part of defense-in-depth, will often hand out clones of those arrays rather than handing out the same array to everyone; that’s handled by a sharedCloneByteArray
helper. As it turns out, however, it’s reasonably common for arrays to be empty, yetCloneByteArray
wasn’t special-casing them, and was thus always allocating new arrays even if the input was empty. dotnet/runtime#93231 simply special-cased empty input arrays to return themselves rather than clone them. - Avoiding unnecessary defensive copies. dotnet/runtime#97108 avoids more defensive copies than just those for empty arrays mentioned above. The
PublicKey
type is passed twoAsnEncodedData
instances, one for parameters and one for a key value, and both of which it clones to avoid any issues that might arise with that provided instance being mutated. But in some internal uses, the caller is constructing a temporaryAsnEncodedData
and effectively transferring ownership, yetPublicKey
would still then make a defensive copy, even though the temporary could have just been used in its stead. This change enables the original instances to just be used without copy in such cases. - Using collection expressions with spans. One of the really neat things about the collection expressions feature introduced in C# 11 is it allows you to express your intent for what you want and allow the system to implement that as best it can. As part of initializing
OidLookup
, it had multiple lines that look like this:
Code:AddEntry("1.2.840.10045.3.1.7", "ECDSA_P256", new[] { "nistP256", "secP256r1", "x962P256v1", "ECDH_P256" }); AddEntry("1.3.132.0.34", "ECDSA_P384", new[] { "nistP384", "secP384r1", "ECDH_P384" }); AddEntry("1.3.132.0.35", "ECDSA_P521", new[] { "nistP521", "secP521r1", "ECDH_P521" });
This effectively forced it to allocate these arrays, even though theAddEntry
method doesn’t actually require the array-ness and just iterates through the supplied values. dotnet/runtime#100252 changedAddEntry
to take aReadOnlySpan<string>
instead ofstring[]
, and changed all the call sites to be collection expressions:
Code:AddEntry("1.2.840.10045.3.1.7", "ECDSA_P256", ["nistP256", "secP256r1", "x962P256v1", "ECDH_P256"]); AddEntry("1.3.132.0.34", "ECDSA_P384", ["nistP384", "secP384r1", "ECDH_P384"]); AddEntry("1.3.132.0.35", "ECDSA_P521", ["nistP521", "secP521r1", "ECDH_P521"]);
allowing the compiler to do the “right thing.” All of those call sites then instead just end up using stack space to store the strings passed toAddEntry
, rather than allocating any arrays at all.- Presizing collections. Many collections, such as
List<T>
orDictionary<TKey, TValue>
, allow you to create a new one, with no a priori knowledge of how large they’ll grow to be, and internally they handle growing their storage to accommodate additional data. The growth algorithm employed typically involves doubling capacity, as doing so strikes a reasonable balance between possibly wasting some memory and not having to re-grow too frequently. However, such growing does have overhead, avoiding it is desirable, and so many collections offer the ability to pre-size the capacity of the collection, e.g.List<T>
has a constructor that accepts anint capacity
, where the list will immediately create a backing store large enough to accommodate that many elements. TheOidCollection
in cryptography didn’t have such a capability even though many of the places it was being created did know the exact required size, which in turn results in unnecessary allocation and copying as the collection grows to reach the target size. dotnet/runtime#97106 added such a constructor internally and used it in various places, in order to avoid that overhead. As withOidCollection
,CborWriter
also lacked the ability to presize, making the aforementioned growth algorithm problem even more stark. dotnet/runtime#92538 added such a constructor.
- Avoiding
O(N^2)
growth algorithms. dotnet/runtime#92435 from @MichalPetryka fixes a good example of what happens when you don’t employ such a doubling scheme as part of collection resizing. The algorithm used to grow the buffer used byCborWriter
would increase the backing buffer by a fixed number of elements each time. A doubling strategy ensures you need no more thanO(log N)
growth operations, and ensures thatN
items can be added to a collection inO(N)
time, since the number of element copies will beO(2N)
, which is justO(N)
(e.g. if N == 128, and you start with a buffer of size 1, and you grow to 2, then 4, 8, 16, 32, 64, and 128, that’s 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128, which is 255, or just under twice N). But increasing by a fixed number can meanO(N)
such operations. And since each growth operation also needs to copy all the elements (assuming the growing is done by array resizing), that makes the algorithmO(N^2)
. In the extreme, if that fixed number was 1, and we were again growing from 1 to 128 one at a time, that’s just summing all the numbers from 1 to 128, the formula for which isN(N+1)/2
, which isO(N^2)
. This PR changedCborWriters
‘s growth strategy to use doubling instead.
Code:// Add a <PackageReference Include="System.Formats.Cbor" Version="8.0.0" /> to the csproj. // dotnet run -c Release -f net8.0 --filter "*" using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Configs; using BenchmarkDotNet.Environments; using BenchmarkDotNet.Jobs; using BenchmarkDotNet.Running; using System.Formats.Cbor; var config = DefaultConfig.Instance .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet("System.Formats.Cbor", "8.0.0").AsBaseline()) .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90).WithNuGet("System.Formats.Cbor", "9.0.0-rc.1.24431.7")); BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config); [MemoryDiagnoser(false)] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "NuGetReferences")] public class Tests { [Benchmark] public CborWriter Test() { const int NumArrayElements = 100_000; CborWriter writer = new(); writer.WriteStartArray(NumArrayElements); for (int i = 0; i < NumArrayElements; i++) { writer.WriteInt32(i); } writer.WriteEndArray(); return writer; } }
Method Runtime Test .NET 8.0 Test .NET 9.0 [th]Mean[/th][th]Ratio[/th][th]Allocated[/th][th]Alloc Ratio[/th][td]25,185.2 us[/td][td]1.00[/td][td]65350.11 KB[/td][td]1.00[/td][td]697.2 us[/td][td]0.03[/td][td]1023.82 KB[/td][td]0.02[/td]
- Presizing collections. Many collections, such as
Of course, improving performance is more than just avoiding allocation. A variety of changes helped in other ways.
dotnet/runtime#99053 “memoizes” (caches) various properties on
CngKey
that are accessed multiple times but where the answer stays the same from call to call; it does so simply by adding a few fields to the type to cache these values, which is a significant win if any is accessed multiple times over the lifetime of the object. The affected properties (Algorithm
, AlgorithmGroup
, and Provider
) are particularly expensive because the OS implementation of these functions needs to make a remote procedure call to another Windows process to access the relevant data.
Code:
// Windows-only test.
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Security.Cryptography;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private RSACng _rsa = new RSACng(2048);
[GlobalCleanup]
public void Cleanup() => _rsa.Dispose();
[Benchmark]
public CngAlgorithm GetAlgorithm() => _rsa.Key.Algorithm;
[Benchmark]
public CngAlgorithmGroup? GetAlgorithmGroup() => _rsa.Key.AlgorithmGroup;
[Benchmark]
public CngProvider? GetProvider() => _rsa.Key.Provider;
}
Method | Runtime |
---|---|
GetAlgorithm | .NET 8.0 |
GetAlgorithm | .NET 9.0 |
GetAlgorithmGroup | .NET 8.0 |
GetAlgorithmGroup | .NET 9.0 |
GetProvider | .NET 8.0 |
GetProvider | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
63,619.352 ns
[/td][td]
1.000
[/td][td]
88 B
[/td][td]
1.00
[/td][td]
10.216 ns
[/td][td]
0.000
[/td][td]
–
[/td][td]
0.00
[/td][td]
62,580.363 ns
[/td][td]
1.000
[/td][td]
88 B
[/td][td]
1.00
[/td][td]
8.354 ns
[/td][td]
0.000
[/td][td]
–
[/td][td]
0.00
[/td][td]
62,108.489 ns
[/td][td]
1.000
[/td][td]
232 B
[/td][td]
1.00
[/td][td]
8.393 ns
[/td][td]
0.000
[/td][td]
–
[/td][td]
0.00
[/td]There were also several improvements related to loading certificates and keys. dotnet/runtime#97267 from @birojnayak addressed an issue on Linux where the same certificate was being processed multiple times rather than just once, and dotnet/runtime#97827 improved the performance of RSA key loading by avoiding some unnecessary work that the key validation was performing.
Networking
Quick, when was the last time you worked on a real application or service that didn’t involve networking at all? I’ll wait… (I’m so funny.) Effectively every modern application relies on networking in one way, shape, or form, especially one that’s following more cloud-native architectures, involving microservices, and the like. Driving down the costs associated with networking is something we take very seriously, and the .NET community whittles away at these costs every release, including in .NET 9.
SslStream
has been a key focus for performance optimization in past releases. It’s used by a significant portion of traffic with both HttpClient
and the ASP.NET Kestrel web server, putting it on the hot path for many systems. Previous improvements have targeted both steady-state throughput as well as creation overhead.In .NET 9, a few PRs focused on steady-state throughput, such as dotnet/runtime#95595, which addressed an issue where some packets were being unnecessarily split into two, leading to extra overhead associated with needing to send and receive that extra packet. This was particularly impactful when writing out exactly 16K, and especially on Windows (where I’ve run this test):
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Net;
using System.Net.Security;
using System.Net.Sockets;
using System.Runtime.InteropServices;
using System.Security.Authentication;
using System.Security.Cryptography.X509Certificates;
using System.Security.Cryptography;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private SslStream _client, _server;
private byte[] _buffer = new byte[16 * 1024];
private readonly SslServerAuthenticationOptions _serverOptions = new SslServerAuthenticationOptions
{
ServerCertificateContext = SslStreamCertificateContext.Create(GetCertificate(), null),
EnabledSslProtocols = SslProtocols.Tls13,
};
private readonly SslClientAuthenticationOptions _clientOptions = new SslClientAuthenticationOptions
{
TargetHost = "localhost",
RemoteCertificateValidationCallback = delegate { return true; },
EnabledSslProtocols = SslProtocols.Tls13,
};
[GlobalSetup]
public async Task Setup()
{
using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));
listener.Listen(1);
var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp) { NoDelay = true };
client.Connect(listener.LocalEndPoint!);
Socket serverSocket = listener.Accept();
serverSocket.NoDelay = true;
_client = new SslStream(new NetworkStream(client, ownsSocket: true), leaveInnerStreamOpen: true);
_server = new SslStream(new NetworkStream(serverSocket, ownsSocket: true), leaveInnerStreamOpen: true);
await Task.WhenAll(
_client.AuthenticateAsClientAsync(_clientOptions),
_server.AuthenticateAsServerAsync(_serverOptions));
}
[GlobalCleanup]
public void Cleanup()
{
_client.Dispose();
_server.Dispose();
}
[Benchmark]
public async Task SendReceive()
{
await _client.WriteAsync(_buffer);
await _server.ReadExactlyAsync(_buffer);
}
private static X509Certificate2 GetCertificate()
{
X509Certificate2 cert;
using (RSA rsa = RSA.Create())
{
var certReq = new CertificateRequest("CN=localhost", rsa, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);
certReq.CertificateExtensions.Add(new X509BasicConstraintsExtension(false, false, 0, false));
certReq.CertificateExtensions.Add(new X509EnhancedKeyUsageExtension(new OidCollection { new Oid("1.3.6.1.5.5.7.3.1") }, false));
certReq.CertificateExtensions.Add(new X509KeyUsageExtension(X509KeyUsageFlags.DigitalSignature, false));
cert = certReq.CreateSelfSigned(DateTimeOffset.UtcNow.AddMonths(-1), DateTimeOffset.UtcNow.AddMonths(1));
if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))
{
cert = new X509Certificate2(cert.Export(X509ContentType.Pfx));
}
}
return cert;
}
}
Method | Runtime |
---|---|
SendReceive | .NET 8.0 |
SendReceive | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
43.07 us
[/td][td]
1.00
[/td][td]
29.38 us
[/td][td]
0.68
[/td]dotnet/runtime#100513 also reduced the cost of checking
SslStream.IsMutuallyAuthenticated
or SslStream.LocalCertificate
from a client when a client certificate is being used.However, the bigger impacts in .NET 9 weren’t on steady-state throughput but rather on TLS connection establishment, aka the handshake. Establishing a TLS connection requires the client and server to engage in a conversation where they agree on details like TLS version, what cipher suite to use, confirm the other is who they say they are, create and exchange dedicated symmetric keys for the communication, and so on. That’s a relatively expensive endeavor. For long-lived connections, that overhead is generally not a big deal, but there are plenty of scenarios where connections are more routinely established and torn down, and for those, we want to drive down the overhead associated with setting up such a connection.
dotnet/runtime#87874 focused on reducing allocations associated with the TLS handshake, by renting some buffers from
ArrayPool<byte>
rather than always allocating. And dotnet/runtime#97348 continued the effort by avoiding some unnecessary SafeHandle
allocation. dotnet/runtime#103814 also switched a rarely-needed ConcurrentDictionary<>
in the Linux implementation to be lazily allocated rather than always being allocated as part of the handshake. These changes combine to significantly reduce the allocation incurred as part of setting up TLS:
Code:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Net;
using System.Net.Security;
using System.Net.Sockets;
using System.Runtime.InteropServices;
using System.Security.Authentication;
using System.Security.Cryptography.X509Certificates;
using System.Security.Cryptography;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private NetworkStream _client, _server;
private readonly SslServerAuthenticationOptions _serverOptions = new SslServerAuthenticationOptions
{
ServerCertificateContext = SslStreamCertificateContext.Create(GetCertificate(), null),
EnabledSslProtocols = SslProtocols.Tls13,
};
private readonly SslClientAuthenticationOptions _clientOptions = new SslClientAuthenticationOptions
{
TargetHost = "localhost",
RemoteCertificateValidationCallback = delegate { return true; },
EnabledSslProtocols = SslProtocols.Tls13,
};
[GlobalSetup]
public void Setup()
{
using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));
listener.Listen(1);
var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp) { NoDelay = true };
client.Connect(listener.LocalEndPoint!);
Socket serverSocket = listener.Accept();
serverSocket.NoDelay = true;
_server = new NetworkStream(serverSocket, ownsSocket: true);
_client = new NetworkStream(client, ownsSocket: true);
}
[GlobalCleanup]
public void Cleanup()
{
_client.Dispose();
_server.Dispose();
}
[Benchmark]
public async Task Handshake()
{
using var client = new SslStream(_client, leaveInnerStreamOpen: true);
using var server = new SslStream(_server, leaveInnerStreamOpen: true);
await Task.WhenAll(
client.AuthenticateAsClientAsync(_clientOptions),
server.AuthenticateAsServerAsync(_serverOptions));
}
private static X509Certificate2 GetCertificate()
{
X509Certificate2 cert;
using (RSA rsa = RSA.Create())
{
var certReq = new CertificateRequest("CN=localhost", rsa, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);
certReq.CertificateExtensions.Add(new X509BasicConstraintsExtension(false, false, 0, false));
certReq.CertificateExtensions.Add(new X509EnhancedKeyUsageExtension(new OidCollection { new Oid("1.3.6.1.5.5.7.3.1") }, false));
certReq.CertificateExtensions.Add(new X509KeyUsageExtension(X509KeyUsageFlags.DigitalSignature, false));
cert = certReq.CreateSelfSigned(DateTimeOffset.UtcNow.AddMonths(-1), DateTimeOffset.UtcNow.AddMonths(1));
if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))
{
cert = new X509Certificate2(cert.Export(X509ContentType.Pfx));
}
}
return cert;
}
}
Method | Runtime |
---|---|
Handshake | .NET 8.0 |
Handshake | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
2.652 ms
[/td][td]
1.00
[/td][td]
5.03 KB
[/td][td]
1.00
[/td][td]
2.581 ms
[/td][td]
0.97
[/td][td]
3.3 KB
[/td][td]
0.66
[/td]Of course, while driving down the costs of doing something is good, avoiding that thing altogether is even better. “TLS resumption” is a capability in the TLS protocol where, if a connection is closed and the same client later opens a new connection to the same server, the client may be able to effectively pick up where it left off with the previous TLS connection rather than starting a brand new one from scratch. Support for TLS resumption on Linux was added in .NET 7, but clients using client certificates weren’t supported… now in .NET 9 thanks to dotnet/runtime#102656, even such clients can benefit from this significant time saver.
TLS resumption is an optimization where information is stored to enable more efficient operation later. In some ways, it’s not unlike pooling in that regard. We frequently talk about pooling as a way to optimize. Often our conversations are around avoiding allocations, where employing a pool is betting that you can be more efficient than the garbage collector. For small, cheap to create objects, that’s often a bad bet. For larger objects, such as for larger arrays, it can be a good bet, which is why
ArrayPool<T>
exists and is used throughout the core libraries for getting temporary buffers (see this Deep .NET discussion on ArrayPool
for more info). But there’s a much more impactful class where pooling is useful, and that’s where the thing being pooled is really expensive to create. Such cases are no longer about memory management, they’re about ammortizing the cost of that creation. And out of all of the pooling done throughout the core libraries, it’s hard to imagine a more impactful case of that then the HTTP connection pool. The objects in this pool represent established connections to an HTTP server, and establishing such connections can be measured in microseconds, or even seconds in certain environments. If such costs had to be paid every single time you were making an HTTP request, it would add huge latency throughout the system. Instead, outgoing HTTP requests try to grab a connection from the connection pool, reusing that connection for the individual request/response, and then putting the connection back into the pool when done.However, as with any pool, the pool itself has cost. In the case of the HTTP connection pool in a
SocketsHttpHandler
instance, the most important factor impacting performance is how quickly a connection can be rented and returned to the pool, especially when under load. That load aspect is important, because as a shared resource, access to this pool must be synchronized, in order to ensure the correctness of the system: it’d be really bad, for example, if two requests went to rent a connection at the same time and ended up incorrectly being given the same connection to use, concurrently. “Really bad” in such a case could not only be corrupting data, but possibly even sending the wrong data to the wrong server. That obviously needs to be avoided. So, synchronization is used, but that synchronization creates a bottleneck, where under load lots of requests can end up being blocked just waiting to check whether a connection is even available. Over the years we’ve whittled away at that cost, but it gets even lower in .NET 9, in particular for HTTP/1.1 connections (we talk about “the” pool, but in reality connections are only pooled together when they’re interchangeable, so there are actually many groupings of connections, for example with HTTP/1.1 connections separate from HTTP/2 connections or HTTP/3 connections, a separate pool for each endpoint, etc.). dotnet/runtime#99364 changes the synchronization mechanism from using a pure lock-based scheme to a more opportunistic concurrency scheme that employs a first-layer of lockless synchronization. There’s now still a lock, but for the hot path it’s avoided as long as there are connections in the pool by using a ConcurrentStack<T>
, such that renting is a TryPop
and returning is a Push
. ConcurrentStack<T>
itself uses a lock-free algorithm, that’s a lot more scalable than a lock. There is an interesting downside to ConcurrentStack<T>
, which is that the algorithm it employs necessarily involves an allocation per pushed element, and for reasons related to the “ABA” problem, those allocations can’t be pooled. That means that every time a connection is returned to the pool now, there’s a small allocation. However, for an HTTP request/response, even though we’ve significantly reduced it over the years, there’s still a non-trivial amount of allocation that occurs over the lifetime of the operation, so one more tiny one doesn’t break the bank, and it’s worth it for the reduced synchronization overheads. We’ve experimented with other data structures, such as ConcurrentQueue<T>
(which is able to avoid allocation per Enqueue
at steady state), but they’ve had other downsides. I expect we’ll continue to push on this in the future, but for now, what’s there now is a nice improvement.Of course once you’ve got the connection, there’s all of the costs associated with actually making the request and handling the response, and those have been whittled away at as well.
- Using vectorized helpers. dotnet/runtime#93511 replaces a scalar loop for writing out bytes from an HTTP/1.1 request header, instead using
Ascii.FromUtf16
, which is vectorized. In fact, that vectorization was further improved by dotnet/runtime#102735, which improved the 256-bit and 512-bit code paths by using better instructions possible due to not having to care about edge cases already weeded out. - Avoiding extra async state machines. dotnet/runtime#100205 avoids an extra layer of async state machine that was incurred by most of the HTTP/1.1 response streams; a method was
async
only to accommodate some rare logging being enabled, so theasync
wrapper is now only employed when that logging is enabled. - More caching of very common data. dotnet/runtime#100177 removes some more allocation and reduces some overheads by computing and caching some bytes that need to be written out on every request.
- Special-casing main use cases. dotnet/runtime#102859 and dotnet/runtime#103008 made
JsonContent
andStringContent
cheaper by special-casing the vastly most common media types used. dotnet/runtime#93759 further improvedJsonContent
by reducing the number ofasync
frames on the hot path. And dotnet/runtime#102845 from @pedrobsaila madeTryAddWithoutValidation
cheaper for multiple values by special-casing the most common case of the input being anIList<string>
, which enables presizing arrays while also avoiding an enumerator allocation. - More ArrayPool. dotnet/runtime#103764 avoided a possibly large
char[]
allocation in the parsing ofAlt-Svc
headers by usingArrayPool
rather than direct allocation.
While this simple benchmark doesn’t touch on all of these changes, it does highlight that the end-to-end performance of HTTP requests gets cheaper:
Code:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Net.Sockets;
using System.Net;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static readonly Socket s_listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
private static readonly HttpMessageInvoker s_client = new(new SocketsHttpHandler());
private static Uri? s_uri;
[Benchmark]
public async Task HttpGet()
{
var m = new HttpRequestMessage(HttpMethod.Get, s_uri) { Content = new StringContent("Hello, there! How are you today?") };
using (HttpResponseMessage r = await s_client.SendAsync(m, default))
using (Stream s = r.Content.ReadAsStream())
await s.CopyToAsync(Stream.Null);
}
[GlobalSetup]
public void CreateSocketServer()
{
s_listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));
s_listener.Listen(int.MaxValue);
var ep = (IPEndPoint)s_listener.LocalEndPoint!;
s_uri = new Uri($"http://{ep.Address}:{ep.Port}/");
Task.Run(async () =>
{
while (true)
{
Socket s = await s_listener.AcceptAsync();
_ = Task.Run(() =>
{
using (var ns = new NetworkStream(s, true))
{
byte[] buffer = new byte[1024];
int totalRead = 0;
while (true)
{
int read = ns.Read(buffer, totalRead, buffer.Length - totalRead);
if (read == 0) return;
totalRead += read;
if (buffer.AsSpan(0, totalRead).IndexOf("\r\n\r\n"u8) == -1)
{
if (totalRead == buffer.Length) Array.Resize(ref buffer, buffer.Length * 2);
continue;
}
ns.Write("HTTP/1.1 200 OK\r\nDate: Sun, 05 Jul 2020 12:00:00 GMT \r\nServer: Example\r\nContent-Length: 5\r\n\r\nHello"u8);
totalRead = 0;
}
}
});
}
});
}
}
Method | Runtime |
---|---|
HttpGet | .NET 8.0 |
HttpGet | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
92.42 us
[/td][td]
1.00
[/td][td]
1.98 KB
[/td][td]
1.00
[/td][td]
77.13 us
[/td][td]
0.83
[/td][td]
1.8 KB
[/td][td]
0.91
[/td]Related to HTTP, the
WebUtility
and HttpUtility
types both got more efficient this release. dotnet/runtime#103737, in particular, made a variety of changes that have a measurable impact on HtmlEncode
and UrlEncode
:HtmlEncode
had a scalar loop looking for characters that need to be encoded. That loop can instead be vectorized by usingSearchValues<char>
.UrlEncode
also had a simlar scalar loop as part of looking for the first non-safe character.SearchValues<char>
can also solve this.UrlEncode
had a complicated scheme where it would UTF8-encode into a newly-allocatedbyte[]
, percent-encode in place in that (thanks to the ability to reinterpret cast with spans), and then use the resulting chars to create a newstring
. Instead,string.Create
can be used, with all of the work done in-place in the buffer generated for that operation.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Net;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
[Benchmark]
[Arguments("""
How much wood could a woodchuck chuck
If a woodchuck could chuck wood?
A woodchuck would chuck as much wood
As much wood as a woodchuck could chuck,
If a woodchuck could chuck wood.
""")]
public string HtmlEncode(string input) => WebUtility.HtmlEncode(input);
[Benchmark]
[Arguments("short_name.txt")]
public string UrlEncode(string input) => WebUtility.UrlEncode(input);
}
Method | Runtime |
---|---|
HtmlEncode | .NET 8.0 |
HtmlEncode | .NET 9.0 |
UrlEncode | .NET 8.0 |
UrlEncode | .NET 9.0 |
[th]
input
[/th][th]
Mean
[/th][th]
Ratio
[/th][td]
How (…)ood. [181]
[/td][td]
102.607 ns
[/td][td]
1.00
[/td][td]
How (…)ood. [181]
[/td][td]
10.188 ns
[/td][td]
0.10
[/td][td]
short_name.txt
[/td][td]
8.656 ns
[/td][td]
1.00
[/td][td]
short_name.txt
[/td][td]
2.463 ns
[/td][td]
0.28
[/td]HttpUtility
also received some attention. dotnet/runtime#102805 from @TrayanZapryanov updated UrlEncodeToBytes
, using stack space instead of allocation for smaller inputs, and using SearchValues<byte>
to optimize the search for invalid bytes. dotnet/runtime#102753 from @TrayanZapryanov did the same for UrlDecodeToBytes
. dotnet/runtime#102909 from @TrayanZapryanov similarly reduced allocation in UrlPathEncode
, but via the ArrayPool
. dotnet/runtime#102917 from @TrayanZapryanov optimized JavaScriptStringEncode
, in particular by using SearchValues
. And dotnet/runtime#102745 from @TrayanZapryanov optimized ParseQueryString
by using stackalloc
instead of array allocation for smaller input lengths and by replacing string.Substring
s with span slicing.There were also changes elsewhere in the networking stack that contribute to HTTP use cases. In dotnet/runtime#98074, for example,
Uri
gained new TryEscapeDataString
and TryUnescapeDataString
methods that store the output characters into a provided destination span rather than allocating new strings on each call. These methods were then used elsewhere in the networking stack, such as in FormUrlEncodedContent
, to improve throughput and reduce allocation.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private KeyValuePair<string, string>[] _data =
[
new("key1", "value1"),
new("key2", "value2"),
new("key3", "value3"),
new("key4", "value4")
];
[Benchmark]
public FormUrlEncodedContent Create() => new FormUrlEncodedContent(_data);
}
Method | Runtime |
---|---|
Create | .NET 8.0 |
Create | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
311.5 ns
[/td][td]
1.00
[/td][td]
848 B
[/td][td]
1.00
[/td][td]
218.7 ns
[/td][td]
0.70
[/td][td]
384 B
[/td][td]
0.45
[/td]Beyond raw HTTP, there were also some new features for
WebSocket
in .NET 9, namely support for keep-alive pings and timeouts, though not many PRs focused solely on performance (though dotnet/runtime#101953 from @PaulusParssinen did utilize some newer APIs in ManagedWebSocket
in a way that may have removed a bit of fat). There was one notable improvement, however, in dotnet/runtime#104865. The web sockets RFC 6455 specification requires that when a data frame’s payload data has an opcode denoting it as text, that text must be checked to be valid UTF8-encoded bytes. The validation for that had been a hand-rolled scalar comparison loop. However, now that Utf8.IsValid
exists (it was introduced in .NET 8), that accelerated method can be used here instead. It can’t be used in all situations, which is probably why it wasn’t immediately employed when the method was added in the first place. Web sockets payloads may be split across data frames, so it’s possible that the frame being validated is actually the continuation of some previously-received data, and it’s possible that this frame is not the end of the payload, either. But, we know those two pieces of information up-front: if it’s a continuation from a previous frame, we would have already noted it as such, and if it’s not complete, its end-of-message bit won’t have been set. Thus, for the common case where the payload is complete, we can use the accelerated helper for UTF8 validation, and only fall back to the slower path for the corner cases. And this matters because even with the networking costs involved, that UTF8 validation shows up.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Net;
using System.Net.Sockets;
using System.Net.WebSockets;
using System.Text;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private WebSocket _client, _server;
private Memory<byte> _buffer = Encoding.UTF8.GetBytes("""
Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date;
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometime declines,
By chance or nature’s changing course untrimm'd;
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st;
Nor shall death brag thou wander’st in his shade,
When in eternal lines to time thou grow’st:
So long as men can breathe or eyes can see,
So long lives this, and this gives life to thee.
""");
private Memory<byte> _tmp = new byte[1024];
[GlobalSetup]
public void Setup()
{
using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));
listener.Listen();
var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
client.Connect(listener.LocalEndPoint!);
Socket server = listener.Accept();
_client = WebSocket.CreateFromStream(new NetworkStream(client, ownsSocket: true), new WebSocketCreationOptions { IsServer = false, });
_server = WebSocket.CreateFromStream(new NetworkStream(server, ownsSocket: true), new WebSocketCreationOptions { IsServer = true });
}
[GlobalCleanup]
public void Cleanup()
{
_client.Dispose();
_server.Dispose();
}
[Benchmark]
public async Task SendReceive()
{
await _client.SendAsync(_buffer, WebSocketMessageType.Text, true, default);
while (!(await _server.ReceiveAsync(_tmp, default)).EndOfMessage) ;
}
}
Method | Runtime |
---|---|
SendReceive | .NET 8.0 |
SendReceive | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
4.093 us
[/td][td]
1.00
[/td][td]
3.438 us
[/td][td]
0.84
[/td]There are of course a variety of reasons that performance could have improved, e.g. maybe
WebSockets
is just exercising a code path that benefits from one of the other optimizations already discussed. How do we know it’s connected to the validation? Let’s profile. And since we already have a benchmark written, we can just use it. There’s another very handy nuget package, Microsoft.VisualStudio.DiagnosticsHub.BenchmarkDotNetDiagnosers
, which contains additional “diagnosers” for BenchmarkDotNet. Diagnosers are one of the main extensibility points within BenchmarkDotNet, enabling developers to perform additional tracking and analyses over benchmarks. You’ve already seen me use some, including the built-in [MemoryDiagnoser(false)]
and [DisassemblyDiagnoser]
; there are other built-in ones we haven’t used in this post but that are helpful in various situations, like [ThreadingDiagnoser]
and [ExceptionDiagnoser]
, but diagnosers can come from anywhere, and the aforementioned nuget package provides several more. The purpose of those diagnosers is to collect and export performance traces that Visual Studio’s performance tools can then consume. In my case, I want to collect a CPU trace, so as to understand where CPU consumption is going, so I added a [CPUUsageDiagnoser]
attribute to my Tests
class:
Code:
[CPUUsageDiagnoser]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
and then re-ran (on Windows). That’s it. While the test is running, you’ll see the same output as you’re used to seeing, plus a little more. For example, at the end of the benchmarking, I now also see this:
Code:
// * Diagnostic Output - VSDiagnosticsDiagnoser *
Collection result moved to 'BenchmarkDotNet_Tests_20240804_081400.diagsession'.
Session : {a1671047-d6da-4a56-9c71-eadef6c1dd00}
Stopped
Exported diagsession file: d:\Benchmarks\BenchmarkDotNet_Tests_20240804_081400.diagsession.
I then simply opened that
.diagsession
file, just typing its name at the command-line, since that file extension is by default associated with Visual Studio, but you could also File->Open from within Visual Studio itself. That results in a view like the following: Notice this single trace covers both the .NET 8 test execution and the .NET 9 test execution, and each is represented by a different entry in the Benchmarks table (but both are on the same execution timeline). I can then double-click one of the tests to narrow the timeline down to just the relevant portion of activity, and then switch over to the CPU Usage tab. When I do, here’s what I see for .NET 8 for the top impacting methods: and for .NET 9: Notice in the first trace that TryValidateUtf8
is taking up almost 8% of the CPU time, but it doesn’t show up in the second trace at all, instead being replaced by Utf8Utility.GetPointerToFirstInvalidByte
, which is the implementation of Utf8.IsValid
and which is only half a percent. That ~8% correlates with the ~10% reduction we saw in benchmark execution time. Neat.JSON
System.Text.Json hit the scene in .NET Core 3.0, and every release since it’s gotten more capable and more efficient. .NET 9 is no exception. In addition to new features like support for exporting JSON schema, deep semantic equality comparison of
JsonElement
s, the ability to respect nullable reference type annotations, support for ordering JsonObject properties, new contract metadata APIs, and more, performance has also been a significant focus.One improvement comes from the integration of
JsonSerializer
with System.IO.Pipelines
. Much of the .NET stack moves bytes around via Stream
, however ASP.NET internally is implemented with System.IO.Pipelines
. There are built-in bidirectional adapters between streams and pipes, but in some cases those adapters add some overhead. As JSON is so critical to modern services, it’s important that JsonSerializer
be able to work equally well with both streams and pipes. As such, dotnet/runtime#101461 adds new JsonSerializer.SerializeAsync
overloads that target PipeWriter
in addition to the existing overloads that target Stream
. That way, whether you have a Stream
or a PipeWriter
, JsonSerializer
will natively work with either without requiring any indirection to adapt between them. Just use whichever you already have.JsonSerializer
‘s handling of enums was also improved by dotnet/runtime#105032. In addition to adding support for the new [JsonEnumMemberName]
attribute, it also moved to an allocation-free parsing solution for enums, utilizing the GetAlternateLookup
support added to Dictionary<TKey, TValue>
and ConcurrentDictionary<TKey, TValue>
to enable a cache of enum information queryable via a ReadOnlySpan<char>
.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.Json;
using System.Reflection;
using System.Text.Json.Serialization;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private static readonly JsonSerializerOptions s_options = new()
{
Converters = { new JsonStringEnumConverter() },
DictionaryKeyPolicy = JsonNamingPolicy.KebabCaseLower,
};
[Params(BindingFlags.Default, BindingFlags.NonPublic | BindingFlags.Instance)]
public BindingFlags _value;
private byte[] _jsonValue;
private Utf8JsonWriter _writer = new(Stream.Null);
[GlobalSetup]
public void Setup() => _jsonValue = JsonSerializer.SerializeToUtf8Bytes(_value, s_options);
[Benchmark]
public void Serialize()
{
_writer.Reset();
JsonSerializer.Serialize(_writer, _value, s_options);
}
[Benchmark]
public BindingFlags Deserialize() =>
JsonSerializer.Deserialize<BindingFlags>(_jsonValue, s_options);
}
Method | Runtime | _value |
---|---|---|
Serialize | .NET 8.0 | Default |
Serialize | .NET 9.0 | Default |
Deserialize | .NET 8.0 | Default |
Deserialize | .NET 9.0 | Default |
Serialize | .NET 8.0 | Instance, NonPublic |
Serialize | .NET 9.0 | Instance, NonPublic |
Deserialize | .NET 8.0 | Instance, NonPublic |
Deserialize | .NET 9.0 | Instance, NonPublic |
[th]
Mean
[/th][th]
Ratio
[/th][th]
Allocated
[/th][th]
Alloc Ratio
[/th][td]
38.67 ns
[/td][td]
1.00
[/td][td]
24 B
[/td][td]
1.00
[/td][td]
27.23 ns
[/td][td]
0.70
[/td][td]
–
[/td][td]
0.00
[/td][td]
73.86 ns
[/td][td]
1.00
[/td][td]
–
[/td][td]
NA
[/td][td]
70.48 ns
[/td][td]
0.95
[/td][td]
–
[/td][td]
NA
[/td][td]
37.60 ns
[/td][td]
1.00
[/td][td]
24 B
[/td][td]
1.00
[/td][td]
26.82 ns
[/td][td]
0.71
[/td][td]
–
[/td][td]
0.00
[/td][td]
97.54 ns
[/td][td]
1.00
[/td][td]
–
[/td][td]
NA
[/td][td]
70.72 ns
[/td][td]
0.73
[/td][td]
–
[/td][td]
NA
[/td]JsonSerializer
relies on lots of other functionality from System.Text.Json
, which has also improved. Here’s a sampling:- Direct use of UTF8.
JsonProperty.WriteTo
would always usewriter.WritePropertyName(Name)
to output the property name. However, thatName
property might end up allocating a newstring
if theJsonProperty
wasn’t already caching one. dotnet/runtime#90074 from @karakasa tweaked the implementation to write out thestring
if it already had one, or else to directly write out a name based on the UTF8 bytes it would have used to create thatstring
. - Avoiding unnecessary intermediate state. dotnet/runtime#97687 from @habbes is one of those lovely PRs that’s a pure win. The primary change here is to a
Base64EncodeAndWrite
method that’s Base64-encoding a sourceReadOnlySpan<byte>
to a destinationSpan<byte>
. The implementation was eitherstackalloc
‘ing a buffer or renting a buffer, then encoding into that temporary, and then copying the data into a buffer that is guaranteed to be large enough. Why wasn’t it just encoding directly into that destination buffer rather than going through a temporary? Unclear. But thanks to this PR, that intermediate overhead was simply removed. Similarly, dotnet/runtime#92284 removed some unnecessary intermediate state fromJsonNode.GetPath
.JsonNode.GetPath
was doing a lot of allocation, creating aList<string>
of all of the path segments which were then combined in reverse order into aStringBuilder
. This changes the implementation to extract the path segments in the reverse order in the first place, then building up the resulting path in stack space or an array rented from theArrayPool
.
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using System.Text.Json.Nodes; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [MemoryDiagnoser(false)] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private JsonNode _json = JsonNode.Parse(""" { "staff": { "Elsa": { "age": 21, "position": "queen" } } } """)["staff"]["Elsa"]["position"]; [Benchmark] public string GetPath() => _json.GetPath(); }
Method Runtime _value GetPath .NET 8.0 Default GetPath .NET 9.0 Default [th]Mean[/th][th]Ratio[/th][th]Allocated[/th][th]Alloc Ratio[/th][td]176.68 ns[/td][td]1.00[/td][td]472 B[/td][td]1.00[/td][td]27.23 ns[/td][td]0.30[/td][td]64 B[/td][td]0.14[/td] - Using existing caches.
JsonNode.ToString
andJsonNode.ToJsonString
were allocating a newPooledByteBufferWriter
andUtf8JsonWriter
, but the internalUtf8JsonWriterCache
type already provides support for using cached versions of these same objects. dotnet/runtime#92358 just updated theseJsonNode
methods to utilize the existing cache. - Pre-sizing collections.
JsonObject
has a constructor that accepts an enumerable of properties to add to the object. For a lot of properties, as it’s adding properties, the backing store may need to keep growing and growing, incurring the overhead of allocation and copies. dotnet/runtime#96486 from @olo-ntaylor tests to see whether a count can be retrieved from the enumerable, and if it can, it uses that count to pre-size the dictionary. - Allow fast paths to be fast.
JsonValue
has a niche feature that enables it to wrap an arbitrary .NET object. AsJsonValue
derives fromJsonNode
,JsonNode
needs to take that capability into account. The current way it does so makes some common operations much more expensive than they’d need to be. dotnet/runtime#103733 refactors the implementation to optimize for the common cases.
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using System.Text.Json; using System.Text.Json.Nodes; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private JsonNode[] _nodes = [42, "I am a string", false, DateTimeOffset.Now]; [Benchmark] [Arguments(JsonValueKind.String)] public int Count(JsonValueKind kind) { var count = 0; foreach (var node in _nodes) { if (node.GetValueKind() == kind) { count++; } } return count; } }
Method Runtime kind Count .NET 8.0 String Count .NET 9.0 String [th]Mean[/th][th]Ratio[/th][td]729.26 ns[/td][td]1.00[/td][td]12.14 ns[/td][td]0.02[/td] - Deduplicating accesses.
JsonValue.CreateFromElement
accessesJsonElement.ValueKind
to determine how to process the data, e.g.
Code:if (element.ValueKind is JsonValueKind.Null) { ... } else if (element.ValueKind is JsonValueKind.Object or JsonValueKind.Array) { ... } else { ... }
IfValueKind
were a simple field access, that’d be fine. But it’s a bit more complicated than that, involving a largeswitch
to determine what kind to return. Rather than possibly reading from it twice, dotnet/runtime#104108 from @andrewjsaid just makes a small tweak to only access the property once. No point in doing that work twice.- Spans over existing data. The
JsonElement.GetRawText
method is useful for extracting the original input backing theJsonElement
, but the data is stored as UTF8 bytes andGetRawText
returns astring
, so every call allocates and transcodes to produce the result. From dotnet/runtime#104595, the newJsonMarshal.GetRawUtf8Value
simply returns a span over the original data, no encoding, no allocation.
Code:// dotnet run -c Release -f net9.0 --filter "*" using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using System.Runtime.InteropServices; using System.Text.Json; using System.Text.Json.Nodes; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [MemoryDiagnoser(false)] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private JsonElement _json = JsonSerializer.Deserialize(""" { "staff": { "Elsa": { "age": 21, "position": "queen" } } } """); [Benchmark(Baseline = true)] public string GetRawText() => _json.GetRawText(); [Benchmark] public ReadOnlySpan<byte> TryGetRawText() => JsonMarshal.GetRawUtf8Value(_json); }
Method GetRawText TryGetRawText [th]Mean[/th][th]Ratio[/th][th]Allocated[/th][th]Alloc Ratio[/th][td]51.627 ns[/td][td]1.00[/td][td]192 B[/td][td]1.00[/td][td]7.998 ns[/td][td]0.15[/td][td]–[/td][td]0.00[/td]
Note that the new method is on the newJsonMarshal
class because it’s an API with safety concerns (in general, APIs on theUnsafe
class or in theSystem.Runtime.InteropServices
namespace are considered “unsafe”). The concern here is that theJsonElement
might be backed by an array rented from theArrayPool
, if theJsonElement
came from aJsonDocument
. TheReadOnlySpan<byte>
you get back is simply pointing into that array. If after getting the span, theJsonDocument
is disposed, it’ll return that array back to the pool, and now the span is referencing an array that someone else might rent. If they do and write into that array, the span will now contain whatever was written there, effectively yielding corrupted data. Try this:
Code:// dotnet run -c Release -f net9.0 using System.Runtime.InteropServices; using System.Text.Json; using System.Text; ReadOnlySpan<byte> elsaUtf8; using (JsonDocument elsaJson = JsonDocument.Parse(""" { "staff": { "Elsa": { "age": 21, "position": "queen" } } } """)) { elsaUtf8 = JsonMarshal.GetRawUtf8Value(elsaJson.RootElement); } using (JsonDocument annaJson = JsonDocument.Parse(""" { "staff": { "Anna": { "age": 18, "position": "princess" } } } """)) { Console.WriteLine(Encoding.UTF8.GetString(elsaUtf8)); // uh oh! }
When I run that, it prints out the information about “Anna,” even though I retrieved the raw text from the “Elsa”JsonElement
. Oops! As with anything in C# or .NET that’s “unsafe,” you just need to make sure you hold it correctly.
- Spans over existing data. The
One last improvement I want to call out. The feature itself is not actually about performance, but the workarounds I’ve seen folks employ for the lack of this capability do have a significant performance impact, and so having the feature built-in will be a net performance win. dotnet/runtime#104328 adds support to both
Utf8JsonReader
and JsonSerializer
for parsing out multiple top-level JSON objects from an input. Previously if any data was found after a JSON object in the input, that would be considered erroneous and fail to parse, and that means that if a particular data source served up multiple JSON objects one after the other, the data would need to be pre-parsed in order to feed only the relevant portions to System.Text.Json
. This is particularly relevant with services that stream data, as some of them use such a format.
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.Json;
using System.Text.Json.Nodes;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private MemoryStream _source = new MemoryStream("""
{
"name": "Alice",
"age": 30,
"city": "New York"
}
{
"name": "Bob",
"age": 25,
"city": "Los Angeles"
}
{
"name": "Charlie",
"age": 35,
"city": "Chicago"
}
"""u8.ToArray());
[Benchmark]
[Arguments("Dave")]
public async Task<Person?> FindAsync(string name)
{
_source.Position = 0;
await foreach (var p in JsonSerializer.DeserializeAsyncEnumerable<Person>(_source, topLevelValues: true))
{
if (p?.Name == name)
{
return p;
}
}
return null;
}
public class Person
{
public string Name { get; set; }
public int Age { get; set; }
public string City { get; set; }
}
}
Diagnostics
Being able to observe one’s application in production is critical to the operation of modern services.
System.Diagnostics.Metrics.Meter
is .NET’s recommended type for emitting metrics, and several improvements have gone into making it more efficient in .NET 9.Counter
and UpDownCounter
are often used for hot-path tracking of metrics like number of active or queued requests. In production environments, these instruments are frequently bombarded from multiple threads concurrently, which both means they need to be thread-safe but also that they need to be able to scale well. The thread-safety had been achieved by using a lock
around updates (which were simply reading a value, adding to it, and storing it back), but under heavy load that could result in significant contention on the lock. To address this, dotnet/runtime#91566 changed the implementation in a few ways. First, rather than using a lock
to protect the state:
Code:
lock (this)
{
_delta += value;
}
it used an interlocked operation to perform the addition atomically. Here
_delta
is a double
, and there’s no Interlocked.Add
that works with double
values, so instead the standard approach of using a loop around an Interlocked.CompareExchange
was employed.
Code:
double currentValue;
do
{
currentValue = _delta;
}
while (Interlocked.CompareExchange(ref _delta, currentValue + value, currentValue) != currentValue);
That helps, but while this does reduce the overhead and improve scalability, it still represents a bottleneck under heavy contention. To address that, the change also split the single
_delta
into an array of values, one per core, and a thread picks one of them to update, typically the value associated with the core on which it’s currently running. That way, contention is significantly reduced as it’s both distributed across N values instead of 1 value, and because threads prefer the value for the core on which they’re on, and because there’s only ever one thread executing on a specific core at a given moment, chances for conflicts are significantly reduced. There is still some contention, both because a thread isn’t guaranteed to use the associated value (e.g. the thread could migrate between the time it checks what core it’s on and does the access) and because we actually cap the size of the array (so that it doesn’t consume too much memory), but it still makes the system much more scalable.
Code:
// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Diagnostics.Metrics;
using System.Diagnostics.Tracing;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private MetricsEventListener _listener = new MetricsEventListener();
private Meter _meter = new Meter("Example");
private Counter<int> _counter;
[GlobalSetup]
public void Setup() => _counter = _meter.CreateCounter<int>("counter");
[GlobalCleanup]
public void Cleanup()
{
_listener.Dispose();
_meter.Dispose();
}
[Benchmark]
public void Counter_Parallel()
{
Parallel.For(0, 1_000_000, i =>
{
_counter.Add(1);
_counter.Add(1);
});
}
private sealed class MetricsEventListener : EventListener
{
protected override void OnEventSourceCreated(EventSource eventSource)
{
if (eventSource.Name == "System.Diagnostics.Metrics")
{
EnableEvents(eventSource, EventLevel.LogAlways, EventKeywords.All, new Dictionary<string, string?>() { { "Metrics", "Example\\upDownCounter;Example\\counter" } });
}
}
}
}
Method | Runtime |
---|---|
Counter_Parallel | .NET 8.0 |
Counter_Parallel | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
137.90 ms
[/td][td]
1.00
[/td][td]
30.65 ms
[/td][td]
0.22
[/td]There’s another interesting aspect of the improvement worth mentioning, and that’s the padding employed in the array. Going from a single
double _delta
to an array of deltas, you might imagine we’d end up with:private readonly double[] _deltas;
but if you look at the code, it’s instead:
private readonly PaddedDouble[] _deltas;
where
PaddedDouble
is defined as:
Code:
[StructLayout(LayoutKind.Explicit, Size = 64)]
private struct PaddedDouble
{
[FieldOffset(0)]
public double Value;
}
This effectively increases the size of each value from 8 bytes to 64 bytes, where only the first 8 bytes of each value is used and the other 56 bytes are padding. That’s odd, right? Normally we’d jump at an opportunity to shrink 64 bytes down to 8 bytes in order to reduce allocation and memory consumption, but here we’re purposefully going in the other direction.
The reason for that is “false sharing.” Consider this benchmark, which I’ve shamelessly borrowed from a conversation Scott Hanselman and I recently recorded for the Deep .NET series but which hasn’t yet posted online:
Code:
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private int[] _values = new int[32];
[Params(1, 31)]
public int Index { get; set; }
[Benchmark]
public void ParallelIncrement()
{
Parallel.Invoke(
() => IncrementLoop(ref _values[0]),
() => IncrementLoop(ref _values[Index]));
static void IncrementLoop(ref int value)
{
for (int i = 0; i < 100_000_000; i++)
{
Interlocked.Increment(ref value);
}
}
}
}
When I run that, I get results like this:
Method | Index |
---|---|
ParallelIncrement | 1 |
ParallelIncrement | 31 |
[th]
Mean
[/th][td]
1,779.9 ms
[/td][td]
432.3 ms
[/td]In this benchmark, one thread is incrementing
_values[0]
and the other thread is incrementing either _values[1]
or _values[31]
. That index is the only difference, yet the one accessing _values[31]
is several times faster than the one accessing _values[1]
. That’s because there’s contention here even if it’s not obvious in the code. The contention comes from the fact that the hardware works with memory in groups of bytes called a “cache line.” Most hardware has caches lines of 64 bytes. In order to update a particular memory location, the hardware will acquire the whole cache line. If another core wants to update that same cache line, it’ll need to acquire it. That back and forth results in a lot of overhead. It doesn’t matter if one core is touching the first of those 64 bytes and another thread is touching the last, from the hardware’s perspective there’s still sharing happening. “False sharing.” Thus, the Counter
fix is using padding around the double
values to try to space them out more so as to minimize the sharing that limits scalability.As an aside, there are some additional BenchmarkDotNet diagnosers that can help to highlight the effects of false sharing. ETW on Windows enables collecting various CPU performances counters, such as for branch misses or instructions retired, and BenchmarkDotNet has a
[HardwareCounters]
diagnoser that’s able to collect this ETW data. One such counter is for cache misses, which often reflect false sharing issues. If you’re on Windows, you can try grabbing the separate BenchmarkDotNet.Diagnostics.Windows
nuget package and using it as in this benchmark:
Code:
// This benchmark only works on Windows.
// Add a <PackageReference Include="BenchmarkDotNet.Diagnostics.Windows" Version="0.14.0" /> to the csproj.
// dotnet run -c Release -f net9.0 --filter "*"
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HardwareCounters(HardwareCounter.InstructionRetired, HardwareCounter.CacheMisses)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private int[] _values = new int[32];
[Params(1, 31)]
public int Index { get; set; }
[Benchmark]
public void ParallelIncrement()
{
Parallel.Invoke(
() => IncrementLoop(ref _values[0]),
() => IncrementLoop(ref _values[Index]));
static void IncrementLoop(ref int value)
{
for (int i = 0; i < 100_000_000; i++)
{
Interlocked.Increment(ref value);
}
}
}
}
Here I’ve asked for both instructions retired, which reflects how much instructions were fully executed (this in and of itself can be a useful metric when analyzing performance, as it’s not as prone to variation as wall-clock measurements), and cache misses, which reflects how many times data wasn’t available in the CPU’s cache.
Method | Index |
---|---|
ParallelIncrement | 1 |
ParallelIncrement | 31 |
[th]
Mean
[/th][th]
InstructionRetired/Op
[/th][th]
CacheMisses/Op
[/th][td]
1,846.2 ms
[/td][td]
804,300,000
[/td][td]
177,889
[/td][td]
442.5 ms
[/td][td]
824,333,333
[/td][td]
52,429
[/td]In the two benchmarks, we can see that the number of instructions executed is almost the same between when false sharing occurred (Index == 1) and didn’t (Index == 31), but the number of cache misses is more than three times larger in the false sharing case, and reasonably well correlated with the time increase. When one core performs a write, that invalidates the corresponding cache line in the other core’s cache, such that the other core then needs to reload the cache line, resulting in cache misses. But I digress…
Another nice improvement comes in dotnet/runtime#105011 from @stevejgordon, adding a new constructor to
Measurement
. Often when creating Measurement
s, you’re also tagging them with additional key/value pairs of information, for which the TagList
type exists. TagList
implements IList<KeyValuePair<string, object?>>
, and Measurement
has a constructor that takes an IEnumerable<KeyValuePair<string, object?>>
, so you can pass a TagList
to a Measurement
and it “just works”… albeit slower than it could. If you had code like:
Code:
measurements.Add(new Measurement<long>(
snapshotV4.LastAckCount,
new TagList { tcpVersionFourTag, new(NetworkStateKey, "last_ack") }));
that would end up boxing the
TagList
struct as an enumerable, and then enumerating through it via the interface, which also entails an enumerator allocation. The new constructor this PR adds takes a TagList
, avoiding those overheads. TagList
is also a large struct, as common usage has it living only on the stack and so as an optimization it stores some of the contained key/value pairs directly in fields on the struct rather than always forcing an array allocation. The net result is much less overhead in constructing these measurements.TagList
itself was also improved by dotnet/runtime#104132, which re-implemented the type for .NET 8+ on top of [InlineArray]
. TagList
is effectively a list of key/value pairs, but in order to avoid always allocating a backing store, it stores some of those pairs inline in itself. This previously was done with dedicated fields for each pair, and then code that directly accessed each field. Now, an [InlineArray]
is used, cleaning up the code and enabling access via spans.
Code:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Diagnostics;
using System.Diagnostics.Metrics;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private Counter<long> _counter;
private Meter _meter;
[GlobalSetup]
public void Setup()
{
this._meter = new Meter("TestMeter");
this._counter = this._meter.CreateCounter<long>("counter");
}
[GlobalCleanup]
public void Cleanup() => this._meter.Dispose();
[Benchmark]
public void CounterAdd()
{
this._counter?.Add(100, new TagList
{
{ "Name1", "Val1" },
{ "Name2", "Val2" },
{ "Name3", "Val3" },
{ "Name4", "Val4" },
{ "Name5", "Val5" },
{ "Name6", "Val6" },
{ "Name7", "Val7" },
});
}
}
Method | Runtime |
---|---|
CounterAdd | .NET 8.0 |
CounterAdd | .NET 9.0 |
[th]
Mean
[/th][th]
Ratio
[/th][td]
31.88 ns
[/td][td]
1.00
[/td][td]
13.93 ns
[/td][td]
0.44
[/td]Peanut Butter
Throughout this post, I’ve tried to group improvements by topic area in order to create a more fluid and interesting discussion. However, over the course of a year, with as vibrant a community as exists for .NET, and with the breadth of functionality that exists across the platform, there are invariably a large number of one-off PRs that improve this or that by a little. It’s often challenging to imagine any one of these significantly “moving the needle,” but altogether, such changes reduce the “peanut butter” of performance overhead spread thinly across the libraries. In no particular order, here’s a non-comprehensive look at some of these:
- StreamWriter.Null.
StreamWriter
exposes a staticNull
field. It stores aStreamWriter
instance that’s intended to be a “bit bucket,” a sink you can write to that just ignores all of the data, ala/dev/null
on Unix,Stream.Null
, and so on. Unfortunately, the way it was implemented had two problems, one of which I’m incredibly surprised took us this long to discover (as it’s been this way for as long as .NET has existed). It was implemented asnew StreamWriter(Stream.Null, ...)
. All of the state tracking done inStreamWriter
is not thread-safe, yet here this instance is exposed from a public static member, which means it should be thread-safe, and if multiple threads hammered thatStreamWriter
instance, it could result in really strange exceptions occurring, like arithmetic overflow. Performance-wise, it’s also problematic, because even though actual writes to the underlyingStream
are ignored, all of the work actually done byStreamWriter
is still done, even though it’s useless. dotnet/runtime#98473 fixes both of those problems by creating an internalNullStreamWriter : StreamWriter
type that overrides everything to be nops, and thenNull
is initialized to an instance of that.
Code:// dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { [Benchmark] public void WriteLine() => StreamWriter.Null.WriteLine("Hello, world!"); }
Method Runtime WriteLine .NET 8.0 WriteLine .NET 9.0 [th]Mean[/th][th]Ratio[/th][td]7.5164 ns[/td][td]1.00[/td][td]0.0283 ns[/td][td]0.004[/td] - NonCryptographicHashAlgorithm.Append{Async}
NonCryptographicHashAlgorithm
is the base class inSystem.IO.Hashing
for types likeXxHash3
andCrc32
. One nice feature it provides is the ability to append an entireStream
‘s contents to it in a single call, e.g.
Code:XxHash3 hash = new(); hash.Append(someStream);
The implementation ofAppend
was fairly straightforward: rent a buffer from theArrayPool
and then in a loop repeatedlyStream.Read
(orStream.ReadAsync
in the case ofAppendAsync
) into that buffer andAppend
that filled portion of the buffer. This has a couple of performance downsides, however. First, the buffer being rented was 4096 bytes. That’s not tiny, but using a larger buffer can reduce the number of calls to the source stream being appended, which in turn can reduce any I/O performed by thatStream
. Second, many streams have optimized implementations for pushing all of their contents to a sink like this:CopyTo
.MemoryStream.CopyTo
, for example, will just perform a single write of its entire internal buffer to theStream
passed to itsCopyTo
. But even if aStream
doesn’t overrideCopyTo
, the baseCopyTo
implementation already provides such a copying loop, and it does so by default using a much larger rented buffer. As such, dotnet/runtime#103669 changes the implementation ofAppend
to allocate a small temporaryStream
object that wraps thisNonCryptographicHashAlgorithm
instance, and any calls toWrite
are just translated to be calls toAppend
. This is a neat example where sometimes we actually choose to pay for a small, short-lived allocation in exchange for significant throughput benefits.
Code:// dotnet run -c Release -f net8.0 --filter "*" using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Configs; using BenchmarkDotNet.Environments; using BenchmarkDotNet.Jobs; using BenchmarkDotNet.Running; using System.IO.Hashing; var config = DefaultConfig.Instance .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet("System.IO.Hashing", "8.0.0").AsBaseline()) .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90).WithNuGet("System.IO.Hashing", "9.0.0-rc.1.24431.7")); BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config); [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "NuGetReferences")] public class Tests { private Stream _stream; private byte[] _bytes; [GlobalSetup] public void Setup() { _bytes = new byte[1024 * 1024]; new Random(42).NextBytes(_bytes); string path = Path.GetRandomFileName(); File.WriteAllBytes(path, _bytes); _stream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, 0, FileOptions.DeleteOnClose); } [GlobalCleanup] public void Cleanup() => _stream.Dispose(); [Benchmark] public ulong Hash() { _stream.Position = 0; var hash = new XxHash3(); hash.Append(_stream); return hash.GetCurrentHashAsUInt64(); } }
Method Runtime Hash .NET 8.0 Hash .NET 9.0 [th]Mean[/th][td]91.60 us[/td][td]61.26 us[/td] - Unnecessary virtual.
virtual
methods have overhead. First, they’re more expensive to invoke than non-virtual
methods because it requires several indirections to find the actual target method to invoke (the actual target may differ based on the concrete type being used). And second, without a technology like dynamic PGO,virtual
methods won’t be inlined, because the compiler can’t statically see which target should be inlined (and even if dynamic PGO makes such inlining possible for the most common type, there’s still a check required to ensure it’s ok to follow that path). As such, if things don’t need to bevirtual
, it’s better performance-wise for them to not be. And if such things areinternal
, unless they’re actively being overridden by something, there’s no reason to keep themvirtual
. dotnet/runtime#104453 from @xtqqczze, dotnet/runtime#104456 from @xtqqczze, and dotnet/runtime#104483 from @xtqqczze all address exactly such cases, removingvirtual
from a smattering ofinternal
members that weren’t being overridden. It might only save a few instructions here and there, but there’s effectively no downside to such a change (other than some minimal code churn), a pure win. - ReadOnlySpan vs Span. We as developers like to protect ourselves from ourselves, for example making fields
readonly
to avoid accidentally changing them. Such changes can also have performance benefits, for example the JIT can better optimize static fields that arereadonly
than those that aren’t. The same set of principles applies toSpan<T>
andReadOnlySpan<T>
. If a method doesn’t need to mutate the contents of a collection being passed in, it’s less accident prone to use aReadOnlySpan<T>
rather than aSpan<T>
. It also signals to the caller that they don’t need to be concerned about the data changing out from under them. Interestingly, here, too, there’s both a correctness and a performance benefit to usingReadOnlySpan<T>
instead ofSpan<T>
. The implementations of these two types is almost word-for-word identical, the critical difference being whether the indexer returns aref T
or aref readonly T
. There is one additional line inSpan<T>
, however, that doesn’t exist inReadOnlySpan<T>
.Span<T>
‘s constructor has this one extra check:
Code:if (!typeof(T).IsValueType && array.GetType() != typeof(T[])) ThrowHelper.ThrowArrayTypeMismatchException();
This check exists because of array covariance. Let’s say you have this:
Code:Base[] array = new Derived[3]; class Base { } class Derived : Base { }
That code compiles and runs successfully, because .NET supports array covariance, meaning an array of a derived type can be used as an array as the base type. But there’s an important catch here. Let’s augment the example slightly:
Code:Base[] array = new Derived[3]; array[0] = new AlsoDerived(); // uh oh! class Base { } class Derived : Base { } class AlsoDerived : Base { }
This will compile successfully, but at run-time it’ll fail with anArrayTypeMismatchException
. That’s because it’s trying to store anAlsoDerived
instance into aDerived[]
, and there’s no relationship between the two types that should permit that. The check required to enforce that comes at a cost, every single time you try to write into an array (except in cases where the compiler can prove it’s safe and elide the costs). WhenSpan<T>
was introduced, the decision was made to hoist that check up to the span’s constructor; that way, once you get a valid span, no such checking needs to be performed on every write, only once on construction. That’s what that additional line of code is doing, checking to ensure that the specifiedT
is the same as the provided array’s element type. That means code like this will also throw anArrayTypeMismatchException
:
Span<Object> span = new string[2]; // uh oh
But that also means if you useSpan<T>
in situations where you could have usedReadOnlySpan<T>
, there’s a good chance you’re unnecessarily incurring that check, which means you’re both possibly going to hit unexpected exceptions depending on what arrays are passed in, and you’re incurring a bit of peanut butter cost. dotnet/runtime#104864 replaced a bunch ofSpan<T>
s withReadOnlySpan<T>
s to reduce the chances we’d incur such overhead, while also just improving the maintainability of the code.- readonly and const. In the same vein, changing fields that could be
const
to be so, changing non-readonly
fields that could bereadonly
to be so, and removing unnecessary property setters is all goodness for maintainability while also having the chance of improving performance. Making fieldsconst
avoids unnecessary memory accesses while also allowing the JIT to better employ constant propagation. And making static fieldsreadonly
enables the JIT to treat them as if they wereconst
in tier 1 compilation. dotnet/runtime#100728 updates hundreds of occurrences.
- MemoryCache. dotnet/runtime#103992 from @ADNewsom09 addresses an inefficiency in
Microsoft.Extensions.Caching.Memory
. If multiple concurrent operations end up triggering the cache’s compaction operation, many of the involved threads can all end up duplicating each other’s work. The fix is to simply have only one of the threads do the compaction operation.
- BinaryReader. dotnet/runtime#80331 from @teo-tsirpanis made
BinaryReader
allocations relevant only to reading text be lazily allocated only when such reading occurs. If the reader is never used for reading text, the application won’t need to pay for the allocation.
- ArrayBufferWriter. dotnet/runtime#88009 from @AlexRadch adds a new
ResetWrittenCount
method toArrayBufferWriter
.ArrayBufferWriter.Clear
already exists, but in addition to setting the written count to 0, it also clears the underlying buffer. In many situations, that clearing is unnecessary overhead, soResetWrittenCount
allows it to be avoided. (There was an interesting debate about whether such a new method is even necessary, and whetherClear
could just be changed to remove the zeroing. But concerns about stale data finding their way into consuming code as corrupted data led to the new method being added instead.)
- Span-based File methods. The static
File
class provides simple helpers for interacting with files, e.g.File.WriteAllText
. Historically, these methods have worked with strings and arrays. That means, though, that if someone instead has a span, they either can’t use these simple helpers or they need to pay to create a string or an array from the span. dotnet/runtime#103308 adds new span-based overloads so that developers don’t need to choose here between simplicity and performance.
- string concat vs Append. string concatenation inside of a loop is well-known no-no, as in the extreme it can lead to significant
O(N^2)
costs. Such a string concatenation was occurring, however, inside ofMailAddressCollection
, where an encoded version of every address in the collection was being appended onto a string using string concatenation. dotnet/runtime#95760 from @YohDeadfall changed that to use a builder instead.
- Closures. The config source generator was introduced in .NET 8 to significantly improve the performance of configuration binding, while also making it friendlier to Native AOT. It achieved both. However, it can be improved further. There’s an unanticipated extra allocation that occurs on success paths that’s only relevant to failure paths, because of how the code is being generated. For a call site like this:
public static void M(IConfiguration configuration, C1 c) => configuration.Bind(c);
the source generator would emit a method like this:
Code:public static void BindCore(IConfiguration configuration, ref C1 obj, BinderOptions? binderOptions) { ValidateConfigurationKeys(typeof(C1), s_configKeys_C1, configuration, binderOptions); if (configuration["Value"] is string value15) obj.Value = ParseInt(value15, () => configuration.GetSection("Value").Path); }
That lambda being passed to theParseInt
helper is accessingconfiguration
, which is defined outside of the lambda as a parameter. To get that data into the lambda, the compiler allocates a “display class” to store the information, with the body of the lambda translated into a method on that display class. That display class gets allocated at the beginning of the scope that contains the data, which in this case means it’s allocated at the beginning of theBindCore
method. That means it’s allocated regardless of whether theif
block is true, and even ifParseInt
is called, the delegate passed to it is only ever invoked when there’s a failure. dotnet/runtime#100257 from @pedrobsaila reworks the source generator code so that this allocation isn’t incurred.
- Stream.Read/Write Span Overrides.
Stream
s that don’t override the span-basedRead
/Write
methods end up utilizing the base implementations, which allocate. There are a ton ofStream
implementations in dotnet/runtime, and we’ve overridden such methods almost everywhere, but now and again we find one that slipped through. dotnet/runtime#86674 from @hrrrrustic fixed one such case on theStreamOnSqlBytes
type.
- Globalization Arrays. Every
NumberFormatInfo
object defaults itsNumberGroupSizes
,CurrentGroupSizes
, andPercentGroupSizes
to each be new instances ofnew int[] { 3 }
(even if subsequent initialization overwrites them). And yet these arrays are never handed out to consumers: the properties that expose them make defensive copies. Which means all of these can just refer to the same shared singleton array. The same is true forNativeDigits
, which is initialized first to a new array of the numbers0
through9
. dotnet/runtime#93117 addresses all of these by creating and using such singletons.
- ColorTranslator.ToWin32. There’s a .NET design guideline that says properties should be like smarter fields. The resulting expectation is that they should be cheap, like just accessing a field or doing a very simple calculation over a field. Unfortunately, we don’t always follow our own guidance, and there exist some properties that really, really look like they should be trivial but are actually sometimes not.
System.Drawing.Color
is a good example. A very reasonable mental model forColor
(which according to the docs “Represents an ARGB (alpha, red, green, blue) color”) is that it’d just be four byte values, one for each channel, either in their own fields or packed together into anint
. Unfortunately, it’s not quite as simple as that.Color
can be that, such as if it’s constructed usingColor.FromArgb(uint)
, but it can also be used to represent a “known colors,” as is evident by theSystemColors
type having a bunch of static properties (e.g.SystemColors.Control
) that return a color for the underlying OS. And even there, you might think “oh, ok, well those properties must be what Stephen is referring to, they must call out to the OS to get the color, they probably do so and then useFromArgb
.” And again, that’s a very intuitive mental model, and again it’s not what actually happens. Those properties actually are cheap; all they do is construct aColor
with an enum value corresponding to the system color. Then where is the actual OS color value retrieved, you ask? As part of theR
,G
,B
, andA
properties onColor
!. That means if you access each of these properties, asColorTranslator
was doing in a variety of its methods, you’re making three or four times as many P/Invokes as you’d otherwise need to. dotnet/runtime#106042 fixes this forColorTranslator
, but it serves as a good reminder why such guidelines exist. (This benchmark is Windows-specific asSystemColor
doesn’t currently rely on OS information for Linux or macOS.)
Code:// Windows-specific (it works on Linux and macOS, but doesn't demonstrate the same thing.) // dotnet run -c Release -f net8.0 --filter "*" --runtimes net8.0 net9.0 using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using System.Drawing; BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); [MemoryDiagnoser(false)] [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] public class Tests { private Color _color = SystemColors.Control; [Benchmark] public int ColorToWin32() => ColorTranslator.ToWin32(_color); }
Method Runtime ColorToWin32 .NET 8.0 ColorToWin32 .NET 9.0 [th]Mean[/th][th]Ratio[/th][td]11.263 ns[/td][td]1.00[/td][td]4.711 ns[/td][td]0.42[/td]
- readonly and const. In the same vein, changing fields that could be
What’s Next?
Maybe one more poem? An acrostic this time:
Code:
Driving innovation with unmatched speed,
Opening doors to what developers need.
Turbocharged perf, breaking the mold,
New benchmarks surpassed, metrics so bold.
Empowering coders, dreams take flight,
Transforming visions with .NET might.
Navigating challenges with precision and flair,
Inspiring creativity, improvements everywhere.
Nurturing growth, pushing limits high,
Elevating success, reaching for the sky.
Several hundred pages later and still not a poet. Oh well.
I’m asked from time to time why I invest in writing these “Performance Improvements in .NET” posts. There’s no one answer. In no particular order:
- Personal learning. I pay close attention throughout the year to all of the various performance improvements happening in the release, sometimes from a distance, sometimes as the one making the changes. Writing this post serves as a forcing-function for me to revisit them all and really internalize the changes that were made and their relevance to the broader picture. It’s a learning opportunity for me.
- Testing. As one of the developers on the team recently said to me, “I like this time of the year when you give our optimizations a stress-test and uncover inefficiencies.” Every year when I’m going through the improvements, just the act of re-validating the improvements often highlights regressions, cases that were missed, or further opportunities that can be addressed in the future. Again, it’s a forcing function to do more testing and with a fresh set of eyes.
- Thanks. Many of the performance improvements in each release aren’t from the folks working on the .NET team or even at Microsoft. They’re from passionate and talented individuals throughout the global .NET ecosystem, and I like to highlight their contributions. That’s why throughout the post you see me calling out when PRs are from folks not employed by Microsoft as full-time employees. In this post, that accounts for ~20% of all the cited PRs. Amazing. Heartfelt thanks to everyone who’s worked to make .NET better for everyone.
- Excitement. Developers often have conflicting opinions about the speed at which .NET is advancing, some really appreciating the frequent introduction of new features, others concerned that they can’t keep up with all of the newness. But the one thing everyone seems to agree on is the love of “free perf,” and that’s a lot of what these posts talk about. .NET gets faster and faster every release, and it’s exciting to see a tour through the highlights collected all in one place.
- Education. There are multiple forms of performance improvements covered throughout the post. Some of the improvements you get completely for free just by upgrading the runtime; the implementations in the runtime are better, and so when you run on them, your code just gets better, too. Some of the improvements you get completely for free by upgrading the runtime and recompiling; the C# compiler itself generates better code, often taking advantage of newer surface area exposed in the runtime. And other improvements are new features that, in addition to the runtime and compiler utilizing, you can utilize directly and make your code even faster. Educating about those capabilities and why and where you’d want to utilize them is important to me. But beyond the new features, the techniques employed in making all of the rest of the optimizations throughout the runtime are often more broadly applicable. By learning how these optimizations are applied in the runtime, you can extrapolate and apply similar techniques to your own code, making it that much faster.
If you’ve read this far, I hope you indeed have learned something and are excited about the .NET 9 release. As is likely obvious from my enthusiastic ramblings and awkward poetry, I’m incredibly excited about .NET, everything that’s been achieved in .NET 9, and the future of the platform. If you’re already using .NET 8, upgrading to .NET 9 should be a breeze (the .NET 9 Release Candidate is available for download), and I’d love it if you’d do so and share with us any successes you achieve or issues you face along the way. We’d love to learn from you. And if you have ideas about how to further improve the performance of .NET for .NET 10, please join us in dotnet/runtime.
Happy coding!
The post Performance Improvements in .NET 9 appeared first on .NET Blog.
Continue reading...