Translated from a post, original: The performance characteristics of async methods in C #
Asynchronous series
- Anatomy of Async methods in C #
- Extending async methods in C #
- Performance characteristics of Async methods in C #
- Use a user scenario to illustrate the issues that need attention
In the first two articles, we covered the internal principles of async methods in C # and the extensibility provided by the C # compiler to customize the behavior of asynchronous methods. Today we will explore the performance characteristics of asynchronous methods.
As described in the first article, the compiler makes a number of transformations that make the asynchronous programming experience very similar to synchronous programming. But to do this, the compiler creates a state machine instance, passes it to the builder of the Async method, and the builder calls the task Awaiter and so on. Obviously, all of these logic costs, but how much do you have to pay?
Before the TPL, asynchronous operations are usually of a coarse-grained size, so the overhead of asynchronous operations is likely to be negligible. Today, however, even a relatively simple application can have hundreds of or thousands of asynchronous operations per second. TPL takes such workloads into account at design time, but it is not so divine that it has some overhead.
To measure the overhead of an async method, we will use the example that was used in the first article and modify it appropriately:
public class stockprices{Private Const int Count = 100; Private list< (string name, decimal price) > _stockpricescache; Async version Public Async task<decimal> Getstockpriceforasync (string companyid) {await Initializemapifneede Dasync (); Return Dogetpricefromcache (CompanyID); }//Call synchronous version of the Init method public decimal getstockpricefor (string CompanyID) {Initializemapifneededasync (). Getawaiter (). GetResult (); Return Dogetpricefromcache (CompanyID); }//Pure synchronous Version Public decimal getpricefromcachefor (string CompanyID) {initializemapifneeded (); Return Dogetpricefromcache (CompanyID); } Private Decimal Dogetpricefromcache (string name) {foreach (var kvp in _stockpricescache) { if (kvp.name = = name) {return kvp.price; }} throw new InvalidOperationException ($ "Can ' t ' find price for ' {name} '."); [MethodImpl (MethodImplOptions. noinlining)] private void initializemapifneeded () {//Similar initialization logic} private Async Task Initializemapifne Ededasync () {if (_stockpricescache! = null) {return; } await Task.delay (42); Get stock price from external data source//Generate 1000 elements, make cache hit slightly expensive _stockpricescache = Enumerable.range (1, Count). Select (n = = (name:n.tostring (), Price: (decimal) n)). ToList (); _stockpricescache.add ((Name: "MSFT", price:42)); }}
StockPrices
This class populates the cache with stock prices from external data sources and provides an API for querying. The main difference from the example in the first article is the dictionary of the price into the list of prices. To measure the overhead of different forms of asynchronous methods and synchronous methods, the operation itself should do at least some work, such as a linear search for _stockpricescache.
DoGetPriceFromCache
Use a loop to complete, thus avoiding any object assignment.
Synchronous vs. task-based asynchronous version
In the first benchmark, we compared 1. The Async method () that called the Asynchronous initialization Method ( GetStockPriceForAsync
), 2. The synchronous method () that called the Asynchronous initialization Method ( GetStockPriceFor
), 3. The synchronous method of the synchronous initialization method is called.
private readonly StockPrices _stockPrices = new StockPrices(); public SyncVsAsyncBenchmark(){ // 初始化_stockPricesCache _stockPrices.GetStockPriceForAsync("MSFT").GetAwaiter().GetResult();} [Benchmark]public decimal GetPricesDirectlyFromCache(){ return _stockPrices.GetPriceFromCacheFor("MSFT");} [Benchmark(Baseline = true)]public decimal GetStockPriceFor(){ return _stockPrices.GetStockPriceFor("MSFT");} [Benchmark]public decimal GetStockPriceForAsync(){ return _stockPrices.GetStockPriceForAsync("MSFT").GetAwaiter().GetResult();}
The results are as follows:
Method | Mean | Scaled | Gen 0 | Allocated |--------------------------- |---------:|-------:|-------:|----------:| GetPricesDirectlyFromCache | 2.177 us | 0.96 | - | 0 B | GetStockPriceFor | 2.268 us | 1.00 | - | 0 B | GetStockPriceForAsync | 2.523 us | 1.11 | 0.0267 | 88 B |
The results are interesting:
- The Async method is fast.
GetPricesForAsync
performed synchronously in this test, 15% slower than the pure synchronization method.
- The cost of calling a
InitializeMapIfNeededAsync
synchronous method is GetPricesFor
even smaller, but the most fascinating thing is that it doesn't have any allocations (managed heap) (the allocated column pair and all 0 in the result table above GetPricesDirectlyFromCache
GetStockPriceFor
).
Of course, you can't say that the overhead of asynchronous mechanisms is 15% for all asynchronous methods to execute synchronously. This percentage is very relevant to the amount of work done by the method. If you measure an asynchronous method that doesn't do anything and the cost comparison of a synchronous method that does nothing, it will show a big difference. The benchmark is to show that the overhead of performing a relatively small amount of asynchronous methods of work is modest.
Why InitializeMapIfNeededAsync
does the call have no allocations? As I mentioned in the first article, an async method must allocate at least one object--task instance itself on the managed heap. Let's explore this question:
Optimize #1. Possible caching of task instances
The answer to the previous question is simple: AsyncMethodBuilder
Use the same task instance for every asynchronous operation that completes successfully. A return Task
Async method relies on the AsyncMethodBuilder
SetResult
following logical processing in the method:
// AsyncMethodBuilder.cs from mscorlibpublic void SetResult(){ // I.e. the resulting task for all successfully completed // methods is the same -- s_cachedCompleted. m_builder.SetResult(s_cachedCompleted);}
Only for each asynchronous method that completes successfully, the SetResult
method is called, so the successful result of each method based on it Task
can be shared. We can see this through the following tests:
[Test]public void AsyncVoidBuilderCachesResultingTask(){ var t1 = Foo(); var t2 = Foo(); Assert.AreSame(t1, t2); async Task Foo() { }}
But this is not the only possible optimization to occur. AsyncTaskMethodBuilder<T>做
A similar optimization: It caches the Task<bool>
task as well as some other primitive type (primitive type). For example, it caches all default values for integer types, and for Task<int>
values that are also cached in the range of [-1; 9] (see below AsyncTaskMethodBuilder<T>.GetTaskForResult()
).
The following tests demonstrate that this is true:
[Test]public void AsyncTaskBuilderCachesResultingTask(){ // These values are cached Assert.AreSame(Foo(-1), Foo(-1)); Assert.AreSame(Foo(8), Foo(8)); // But these are not Assert.AreNotSame(Foo(9), Foo(9)); Assert.AreNotSame(Foo(int.MaxValue), Foo(int.MaxValue)); async Task<int> Foo(int n) => n;}
You should not be overly dependent on this behavior, but it is always good to know that the author of the language and framework optimizes performance in every possible way possible. Caching a task is a common optimization pattern, which is also used elsewhere. For example, the new socket implementation in the Corefx warehouse relies heavily on this optimization and uses the cache task as much as possible.
Optimize #2: Use
ValueTask
The above optimizations are only useful in some cases. Instead of relying on it, we can also use ValueTask<T>
: a special type of "class task", if the method is executed synchronously, then there is no additional allocation.
We can actually think ValueTask<T>
of it as T
Task<T>
a union: If "value task" is completed, then the underlying value will be used. If the underlying task is not completed, then a task instance is assigned.
This special type can help avoid unnecessary allocations when the operation is performed synchronously. To use ValueTask<T>
, we just need to GetStockPriceForAsync
change the return result Task<decimal
from ValueTask<decimal>
:
public async ValueTask<decimal> GetStockPriceForAsync(string companyId){ await InitializeMapIfNeededAsync(); return DoGetPriceFromCache(companyId);}
Then we can use an additional benchmark to measure the difference:
[Benchmark]public decimal GetStockPriceWithValueTaskAsync_Await(){ return _stockPricesThatYield.GetStockPriceValueTaskForAsync("MSFT").GetAwaiter().GetResult();}
Method | Mean | Scaled | Gen 0 | Allocated |-------------------------------- |---------:|-------:|-------:|----------:| GetPricesDirectlyFromCache | 1.260 us | 0.90 | - | 0 B | GetStockPriceFor | 1.399 us | 1.00 | - | 0 B | GetStockPriceForAsync | 1.552 us | 1.11 | 0.0267 | 88 B | GetStockPriceWithValueTaskAsync | 1.519 us | 1.09 | - | 0 B |
As you can see, the returned method is a little bit ValueTask
Task
faster than the method returned. The main difference is that the memory allocation on the heap is avoided. We'll discuss later whether it's worth the conversion, but before I do, I'd like to introduce a technical optimization.
Optimized #3: Avoids asynchronous mechanisms on a common path (avoid async machinery on a common path)
If you have a very widely used async method and want to further reduce the overhead, perhaps you might consider the following optimizations: You can remove async
modifiers, check the state of a task in a method, and perform the entire operation synchronously, eliminating the need for an asynchronous mechanism at all.
It sounds complicated? Take a look at an example:
public ValueTask<decimal> GetStockPriceWithValueTaskAsync_Optimized(string companyId){ var task = InitializeMapIfNeededAsync(); // Optimizing for a common case: no async machinery involved. if (task.IsCompleted) { return new ValueTask<decimal>(DoGetPriceFromCache(companyId)); } return DoGetStockPricesForAsync(task, companyId); async ValueTask<decimal> DoGetStockPricesForAsync(Task initializeTask, string localCompanyId) { await initializeTask; return DoGetPriceFromCache(localCompanyId); }}
In this example, the GetStockPriceWithValueTaskAsync_Optimized
method does not have async
a modifier, and InitializeMapIfNeededAsync
when it gets a task from the method, it checks whether the task is complete, and if it is done, it DoGetPriceFromCache
immediately gets the result directly from the call. But if the task is not yet complete, it invokes a local function, which is supported from C # 7.0, and then waits for the result.
Using local functions is not the only option, but is the simplest. But one thing to note is that the most natural implementation of local functions captures a closure state: Local Variables and parameters:
public ValueTask<decimal> GetStockPriceWithValueTaskAsync_Optimized2(string companyId){ // Oops! This will lead to a closure allocation at the beginning of the method! var task = InitializeMapIfNeededAsync(); // Optimizing for acommon case: no async machinery involved. if (task.IsCompleted) { return new ValueTask<decimal>(DoGetPriceFromCache(companyId)); } return DoGetStockPricesForAsync(); async ValueTask<decimal> DoGetStockPricesForAsync() // 注意这次捕获了外部的局部变量 { await task; return DoGetPriceFromCache(companyId); }}
Unfortunately, because of a compiler bug, this code will still be assigned a closure (closure), even if it is done from the usual path (that is, the IF clause). Here's what this method looks like after it's been converted by the compiler:
public ValueTask<decimal> GetStockPriceWithValueTaskAsync_Optimized(string companyId){ var closure = new __DisplayClass0_0() { __this = this, companyId = companyId, task = InitializeMapIfNeededAsync() }; if (closure.task.IsCompleted) { return ... } // The rest of the code}
The compiler uses a shared closure instance for all local variables/parameters in a given range. So the above code seems reasonable, but it makes it impossible to avoid heap allocation (heap allocation).
Tip: This optimization technique is very strong. The benefits are very small, and even if you write local functions that are no problem, you will likely make changes in the future and then accidentally capture the external variables, resulting in heap allocations. If you are writing a highly reusable class library like BCL, you can still use this technique to refine the methods that are sure to be used on hot paths.
The overhead of waiting for a task
So far we have only discussed a special case: the overhead of a synchronous execution of an asynchronous method. This is intentional. The smaller the Async method, the more significant the overall performance overhead is. Fine-grained asynchronous methods do things that are relatively small and easier to do synchronously. We also invoke them relatively frequently.
But we should also know the performance overhead of an asynchronous mechanism when a method waits for an unfinished task. To measure this overhead, we will InitializeMapIfNeededAsync
modify the call to Task.Yield()
:
private async Task InitializeMapIfNeededAsync(){ if (_stockPricesCache != null) { await Task.Yield(); return; } // Old initialization logic}
Let's add some of the following methods for our performance benchmark test:
[Benchmark]public decimal GetStockPriceFor_Await(){ return _stockPricesThatYield.GetStockPriceFor("MSFT");} [Benchmark]public decimal GetStockPriceForAsync_Await(){ return _stockPricesThatYield.GetStockPriceForAsync("MSFT").GetAwaiter().GetResult();} [Benchmark]public decimal GetStockPriceWithValueTaskAsync_Await(){ return _stockPricesThatYield.GetStockPriceValueTaskForAsync("MSFT").GetAwaiter().GetResult();}
Method | Mean | Scaled | Gen 0 | Gen 1 | Allocated |------------------------------------------ |----------:|-------:|-------:|-------:|----------:| GetStockPriceFor | 2.332 us | 1.00 | - | - | 0 B | GetStockPriceForAsync | 2.505 us | 1.07 | 0.0267 | - | 88 B | GetStockPriceWithValueTaskAsync | 2.625 us | 1.13 | - | - | 0 B | GetStockPriceFor_Await | 6.441 us | 2.76 | 0.0839 | 0.0076 | 296 B | GetStockPriceForAsync_Await | 10.439 us | 4.48 | 0.1577 | 0.0122 | 553 B | GetStockPriceWithValueTaskAsync_Await | 10.455 us | 4.48 | 0.1678 | 0.0153 | 577 B |
As we can see, the differences are obvious in terms of speed and memory. Here is a short explanation of the results.
- Each "await" operation on an unfinished task takes approximately 4us and each call allocates about 300B (dependent on the platform (x64 vs. x86), as well as the local variables or parameters in the Async method) memory. This explains why it
GetStockPriceFor
is about GetStockPriceForAsync
twice times faster and allocates less memory.
- When an async method is not executed synchronously, the based
ValueTask
async method is slightly slower than it is based Task
. Because ValueTask
the state machine based on the Async method needs to save more data.
A summary of the performance of asynchronous methods
- If the Async method executes synchronously, the extra overhead is quite small.
- If the Async method executes synchronously, the following memory overhead occurs: For
async Task
no additional overhead, for async Task<T
each asynchronous operation, the cost of the bytes (x64 platform) is incurred.
- For asynchronous methods that are completed synchronously,
ValueTask<T>
you can eliminate the extra overhead in the previous one.
- If the method is executed synchronously, then an asynchronous method based on it is
ValueTask<T>
Task<T>
slightly faster than the method based, and if it is executed asynchronously, it is slightly slower.
- Asynchronous methods that wait for an unfinished task have a much higher performance cost (on the x64 platform, each operation requires a bytes).
As always, remember to perform a performance test first. If you find that the asynchronous operation is causing performance problems, you can either ValueTask<T>
switch to ValueTask<T>
, cache a task, or add a common execution path (if possible). But you can also try to coarse-grained the asynchronous operation. This can improve performance, simplify debugging, and make your code better understood. Not every small piece of code must be asynchronous.
Other references
- Dissecting the Async methods in C #
- Extending the Async methods in C #
- Stephen Toub ' s comment about
ValueTask
' s usage scenarios
- "Dissecting the local functions in C #"
Translation Performance characteristics of Async methods in C #