This is a creation in Article, where the information may have evolved or changed.
This article was translated from: Https://blog.twitch.tv/gos-march-to-low-latency-gc-a6fa96f06eb7
We have developed many high-concurrency systems in twitch using go. Its simplicity, security, performance and readability make it a great tool to solve our problems, and we provide live video and chat services to millions of of users.
But this is not another article on how go can work for us, this article is about the limitations we encounter in using go and how we can overcome these limitations.
The go runtime improvements between go 1.4 and go 1.6 have reduced our garbage collection (GC) Pause time by 20 times times, how we can further reduce the pause time by 10 times times on the Go 1.6 pause, and how to share our case with the Go team, The simultaneous pause time of 1.7 without the use of our manual scheme is reduced by 10 times times.
Begin
Our IRC-based chat system was first written in go at the end of 2013, replacing the previous Python implementation. With the prerelease version of Go 1.2, it can provide more than 500,000 concurrent users per physical host without special adjustments. There is a set of three goroutine (Go's lightweight execution threads) that provide services for each connection, and the program has 1,500,000 goroutine in each of these processes. Even with such a large number of goroutine, the only performance issue we encountered in using Go1.2 was the GC pause time, and the execution GC would freeze our application for dozens of seconds.
Not only are each GC paused very expensive, the GC runs several times per minute. We strive to reduce the number and size of memory allocations so that the frequency of GC runs is reduced, and if the heap grows by only 50% per two minutes, it means that the allocated quantity is low enough. Although the pause time is reduced, each GC pause is disruptive.
Once go 1.2 is officially released, the GC pause time drops to "just" a few seconds. We spread the traffic over a larger number of processes, reducing the pauses to a more receptive range.
As the go version grows, the effort to reduce allocations continues to benefit our chat server, but the decomposition chat process is a specific range of go versions of the solution. Such solutions can not withstand the test of time, but it is very important for our users to provide good service. Sharing our experience helps to create lasting improvements for the go runtime, thereby benefiting individual programs.
Starting with Go 1.5 in August 2015, the go garbage collector is mostly concurrent and incremental, which means that at most stages it doesn't need to stop the application altogether. In addition to the relatively short marking and termination phases, our programs can continue to run while running garbage collection. Upgrading to go 1.5 immediately led to a 10 times-fold reduction in the GC pause time in our chat system, and the pause time on the heavy load test instance was reduced from 2 seconds to about 200ms.
GO 1.5 gc New era
While the latency reduction for Go 1.5 is a celebration, the biggest implication of the new GC is that it lays the groundwork for further incremental improvements.
The Go 1.5 garbage collector still has the same two main stages-the tagging phase (the GC determines which memory allocations are still in use), and the scanning phase (where unused memory is ready to be reused)-but each stage is divided into two sub-stages. First, the application pauses while the previous scan phase terminates. The concurrency tagging phase then looks for memory that is in use when user code is running. Finally, the application pauses for the second time and the mark phase terminates. After that, unused memory is scanned and the application executes its business.
Runtime's gctrace feature prints each GC cycle, including the duration of each phase. For our chat server, it indicates that most of the remaining pause times are in the mark termination phase, so the analysis will be focused there.
Of course, we need more details about what the GC did during these pauses. The go core package has CPU profiler, combined with the Linux perf tool. Using perf allows for a higher sampling frequency and a visualization of the time spent in the kernel. Monitors used in the kernel can help us debug slow system calls and transparently complete virtual memory management.
The picture below is part of our chat server configuration file and runs go1.5.1. This is a flame diagram made using the Brendan Gregg tool, trimmed with only a sample of the Runtime.gcmark function on the stack, which is the time that go 1.5 spends in the mark termination phase.
The flame diagram displays the stack depth as an upward growth and the CPU time as the width of each part. (The color is meaningless, and the sorting on the x-axis is irrelevant-it's just alphabetical.) On the left side of the chart, we can see that Runtime.gcmark calls Runtime.parfordo in almost all of the sample stacks. Up we see most of the time spent in Runtime.markroot calling Runtime.scang,runtime.scanobject and Runtime.shrinkstack.
The Runtime.scang function is used to rescan memory to help terminate the tagging phase. The whole idea behind the mark termination phase is to finish scanning the application's memory.
Next comes the runtime.scanobject. This function does a few things, but the reason for running in the end of the chat server tag in Go 1.5 is to implement finalizer. Why does the program use so many finalizer, why do they take up so long GC pause time? The problematic application is a chat server that handles thousands of users at the same time. Go's core "NET" package attaches a finalizer to each TCP connection to help control file descriptor leaks-and because each user has his own TCP connection, even if there is only one finalizer per link, it adds up to a significant amount.
The problem seems worth reporting to the Go Runtime team. We communicate via e-mail, and the Go team is very helpful in how to diagnose performance issues and how to refine them into minimal test case recommendations. For Go 1.6, the runtime team moves the finalizer scan to the concurrency tagging phase, resulting in shorter pauses for applications with a large number of TCP connections. In conjunction with all the other improvements in the release, our chat server's pause time on go 1.6 was about half of Go 1.5, down to about 100ms on the test case.
Stack shrinkage
The concurrency of Go makes it very inexpensive to start a large number of goroutine. Although the performance of a program using 10,000 operating system threads may be poor, this number of goroutine is normal. One difference is that Goroutine starts with a very small stack-only 2kB-grows as needed, in contrast to large fixed-size stacks that are common elsewhere.
The function call prefix of Go ensures that there is enough stack space for the next call, and if not, move the goroutine stack to a larger memory area before allowing the call to continue-rewrite the pointer as needed.
So for a program to support the deepest calls they make, the stack of its goroutine will grow. One of the duties of the garbage collector is to reclaim the stack memory that is no longer needed. The task of moving the goroutine stack to a more appropriate size memory area is done by Runtime.shrinkstack, in Go 1.5 and 1.6, during the mark termination.
Above the flame diagram, in its sample of about 3/4 shows Runtime.shrinkstack. If this work can be done while the application is running, it can greatly speed up our chat server and other programs.
The Go Runtime package docs explains how to disable stack contraction. For our chat server, a shorter pause time is easier to accept than a waste of memory. When the stack shrink is disabled, the chat server's pause time is reduced again to between 30 and 70ms.
While keeping the structure and operation of the chat Service relatively constant, we endured a multi-second GC pause of Go 1.2 to 1.4. Go 1.5 reduces them to about 200ms, and go 1.6 further cuts them to about 100ms. The pause is now typically less than 70 milliseconds, and it now seems that its improvements have resulted in a 30 times times shorter pause time.
Of course there is room for improvement; Let's take a look at another profile.
Page faults
The GC now varies in the range of approximately 30 to 70ms. Here is the flame diagram that spends the cycle during some longer marks terminating the pause:
When the go GC calls Runtime.gcremovestackbarriers, the system generates a page fault, which causes the Page_fault function of the kernel to be called, causing the width tower in the chart to be located exactly in the center. Page fault is the way the kernel maps virtual memory (typically 4kB) to a piece of physical RAM. Typically, a process is allowed to allocate a large amount of virtual memory, which is converted to resident memory by the page fault only when the program accesses it.
The Runtime.gcremovestackbarriers function modifies the memory recently accessed by the program. In fact, it is intended to remove the memory barrier that was added at the beginning of the previous GC cycle. The operating system has enough memory available, but does not allocate physical RAM to some other more active processes. Why is this memory access causing a page fault?
Some of the background of our computational hardware may be helpful. The server we use to run the chat system is a modern dual-slot machine. Each CPU slot has several directly connected memory groups. This configuration causes NUMA (non-Uniform memory access). When a thread runs on the core in slot 0, it accesses the physical memory attached to the slot more quickly and accesses the other memory relatively slowly. The Linux kernel attempts closer memory and reduces latency by moving physical memory pages closer to where the related thread is running.
With this in mind, we can take a closer look at the behavior of the kernel's Page_fault function. Looking at the call stack (moving up on the flame graph), we can see that the kernel calls Do_numa_page and migrate_misplaced_page, indicating that the kernel is moving program memory between blocks of physical memory.
The Linux kernel chooses an almost meaningless memory access pattern during GC's mark-termination phase and is migrating memory pages at great cost to match it. This behavior appears to be so small in the go1.5.1 flame diagram, but now our focus is more pronounced on runtime.gcremovestackbarriers.
This is where the benefits of using Perf's profiling are most obvious. The Perf tool can display the kernel stack, and go's user-level analyzer will not be able to track it here. Using perf is quite complex and requires root access to view the kernel stack, and for Go 1.5 and 1.6 a non-standard go toolchain build (via goexperiment = Framepointer./make.bash, not required in Go 1.7) is required. For questions like this, it's worth using the Perf tool.
Control migration
If using two CPU slots and two memory chips is troublesome, let's use only one. The simplest tool available for this is the Taskset command, which restricts the program to run only on a single socket CPU. Because the program's thread accesses memory from only one slot, the kernel moves the process memory to the memory adjacent to that CPU socket.
After the program is restricted to a single NUMA node, the application's mark-to-end time drops to 10-15ms. (by setting the process's memory policy to Mpol_bind through Set_mempolicy (2) or Mbind (2), you can achieve the same benefit without sacrificing half of the servers) the above profile is based on the 1.6 release from October ago, The left Runtime.freestackspans is moved to a concurrent GC phase and no longer causes a long pause. Now there is no excess work to be removed from the mark termination phase.
Go1.7
Go 1.6, we avoid the high cost of stack shrinkage by disabling the program's functionality. This has less impact on the memory usage of the chat server, but the operation is more complex. Stack shrinkage is very important for some programs, so we implement a small set of applications instead of applying the changes to all programs. Go 1.7 can reduce the stack while the application is running.
Since the introduction of concurrent gc,runtime in Go 1.5 traces the information that Goroutine has performed since the last time it was scanned. The mark termination phase examines each goroutine to see if it has recently run, and will rescan several goroutine that have already been run. In Go 1.7, the runtime maintains a short list of goroutine that ran after the last GC. This eliminates the time to view the Goroutine list when user code pauses, and significantly reduces the number of accesses to the associated memory migration code that can trigger kernel NUMA.
Finally, the compiler for the AMD64 schema holds the frame pointer by default, so standard debugging and performance tools, such as perf, can determine the current function call stack. Users who use the Go Binary Release build program will be able to get more advanced tools when needed without having to delve into how to rebuild the go toolchain and recompile/deploy their programs. This is advantageous for future performance improvements to the go core pack and runtime, as engineers will be able to collect high quality error reporting information.
With the Go 1.7 release released in June 2016, the GC pause time is better than ever, without manual adjustments. Our chat server's pause time is close to the out-of-the-box 1ms, which is 10 times times better than the adjusted go 1.6!
Share our experience with the Go team and enable them to find permanent solutions to problems. Analysis and tuning enabled our applications to reduce the pause time by 10 times times in Go 1.5 and 1.6, but between go 1.5 and go 1.7, the runtime team was able to reduce the pause time for all applications by 100 times times.
Next
All of these analyses are focused on the Stop-the-world pause time of our chat server, but this is only one dimension of GC performance. As the GC's awkward pauses are finally under control, the runtime team is ready to solve the throughput problem.
Their recent proposal for a transaction collector describes a way to transparently allocate and collect memory that is not shared between goroutine. This may delay the need for full GC runs and reduce the total number of CPU cycles that the program spends on garbage collection.
Of course, Twitch is recruiting! If this kind of thing is interested in you, please send us an e-mail.
Thank
I want to thank Chris Carroll and John Rizzo for the safe testing of the new Go version on their chat system, and Spencer Nelson and Mike Ossareh to edit this article with me. I also want to thank the Go Runtime team for helping me to submit a good bug report and their continuous improvement to the go garbage collector.