This is a creation in Article, where the information may have evolved or changed.
We have a module that is sensitive to latency, which requires access to another machine on the network to fetch a timestamp. To implement a distributed transaction, you need to perform this operation two times, and if the time stamp is slow, the latency of the entire transaction will rise. In theory, the intranet environment is the same as the computer room round trip should be within 0.5ms, most of the simple read requests should fall on the 1ms,80% request latency is expected to be within 4MS. There are customers anti-鐀 said there is more than 30ms delay, in the intranet environment with Sysbench run OLTP test can be reproduced, so check this problem.
Observed in opentracing, this step does exist a large delay, the log also has a large number of print slow log, affecting the overall completion time of the transaction. The first direction is to determine where the slow is, whether the network has a problem or runtime.
One colleague observed that when running the benchmark of this module, an additional 1000 goroutine workers were opened, the tick was idling every second, and compared to running benchmark, the delay of the former was much higher than that of the latter. Suspect runtime has a problem.
Another test with the first run alone, and observe the network, you will find that the network retransmission has a significant impact on the results, and then change to the client server on the same machine again test, eliminate network interference, the results found that the process will affect each other, and then the server and the client are tied to different cores, the service side processing time is relatively stable, The client still has high latencies. The conclusion to this step is basically that neither runtime nor the network can guarantee stability.
But can the impact of the runtime reach the dozens of MS level? It doesn't look very reasonable. In my mind, how should all be controlled in the microsecond order, even if back to the version of Go Language 1.0, stop the world's GC is not so lame. And now it's all optimized to 1.9, and the GC won't stop the world anymore. So I use Go tool trace tools to continue to analyze the problem. If you don't see it, it's really a shock to see.
Captured this section of the graph, the Red Arrow refers to receive network messages, ublock the Goroutine read operation. Note that from the network message readable, to the read network Goroutine is again dispatched, the middle spends 4.368ms !!! I even found some of the more extreme scenarios, from receiving network messages to being readable, to goroutine actually being awakened, and spending 19ms. Here first insert the business situation, for performance reasons, the business implementation is to do batch, so the request will be forwarded through the channel to a goroutine, by the Goroutine to batch request. Obviously this goroutine is very critical, because the other goroutine are dependent on it. Here, the MS-level scheduling delay can have a significant impact on the overall latency of the business.
Then say goroutine scheduling timing, Goroutine is the co-process, if it can be executed, will be executed until the blocking will discard the CPU. such as execution encounters a lock, or read channel, or read IO request, etc. After the goroutine is switched out, if the condition is met, it will be thrown into the ready queue to queue and wait to be run again. However, when it is executed, it is uncertain , with a number of factors related to the queue length of the queued task, the time it takes to execute the task ahead of it, and the load situation at that time.
The problem here is not the GC, but the dispatch. The final delay problem is related to the scheduling design of Go, which is mainly the fair scheduling strategy of the co-process:
- Can not preempt
- No concept of priority
Since it is not possible to preempt, assuming that the network message is good, but this moment all the CPU on the top of the Goroutine is running, and can not be kicked off, so read the network of the Goroutine did not have the opportunity to be awakened.
Because there is no concept of priority, assuming that finally there is goroutine blocking and let out the CPU, at this time who executed completely look at the scheduler mood, read the network that goroutine bad luck, and did not wake up.
As long as the goroutine does not go to the function call, there is no chance to trigger the dispatch and will not yield the CPU.
Go claims to be able to open tens of thousands of goroutine, in fact, there is a cost: The more goroutine is "fair" scheduling, the more likely to affect the wake of important goroutine, and thus affect the overall delay.
Look back to the former colleague that test, the empty running workers will affect the delay, can also explain the pass: because the probability of equal scheduling, the more unrelated to the goroutine, then the work of the goroutine is dispatched the probability of the lower, resulting in a delay increase.
Go garbage Collection Although does not stop the world, still may affect the delay: GC is can interrupt goroutine, ask to give up the CPU, and when Goroutine is again transferred back and look at the face.
There are too many factors to influence scheduling, making the delay in the entire runtime become uncontrolled. In peacetime, the pressure of the hourly scheduling may not see anything, however, especially in the high pressure, the performance of the worse.