To justify the Go language GC-2 seconds to 1 milliseconds evolution history

Last Update:2016-08-16 Source: Internet

Author: User

Tags root access

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.

Reprinted from: http://blog.csdn.net/erlib/article/details/51850912

English original link: https://blog.twitch.tv/gos-march-to-low-latency-gc-a6fa96f06eb7#.lrmfby2xs

Below we will introduce the history of GC time-consuming evolution of https://www.twitch.tv video live website in the process of using go.

We are a live video system and have millions of of online users, messaging and chat systems are all written with go, the service of a single machine connected to about 500,000 users. In the version iterations of Go1.4 to 1.5, the GC gained 20 times times the boost, 10 times times the 1.6 gain, and then, after interacting with the Go Runtime Development Group, it was 1.7 times higher in the 10 times version ( Before 1.7, we did a lot of GC parameter tuning, in 1.7 of these tuning are not required, the native runtime can support), a total of 2000 times times!!! The specific GC stop time from 2 seconds to 1 milliseconds!! And you don't need any GC tuning!!

So let's start the GC adventure.

In 2013, we used go to rewrite the IRC-based chat system, which was previously written in Python. The go version that was used at that time was 1.1, and after refactoring, it was possible to reach a single 500,000 user online without special tuning. Each user uses 3 Goroutine, so the system has a full 1.5 million goroutine running, but the magic is that the system does not have any performance problems, in addition to gc--basically run a few times per minute GC, each time the GC takes a few seconds to 10 seconds, For our interactive service, this is absolutely intolerable.

We have made a lot of optimizations for the system, including reducing the object allocation, controlling the number of objects, and so on, and the frequency of the GC and the STW (Stop the World) time have been improved. Basically, the system is automatically available every 2 minutes, although the number of GC times is low, but the time of each pause is still devastating.

With the release of Go1.2, the GC STW time is shortened to about a few seconds, and then we slice the service so that the GC is reduced to a slightly acceptable level. But this partitioning service is also a huge burden for us, and it is also closely related to the release of Go.

After starting using Go1.5 in August 2015, go uses parallel and value-added GC, which means the system doesn't have to endure a super-long STW time. Upgrading to 1.5 brings us 10 times times the GC boost, from 2 seconds to 200 milliseconds.

GO1.5-GC New Era

While Go1.5 's GC improvements are great, it's great to set the stage for future continuous improvement!

The GC for Go1.5 is still divided into two main stages-MARKL phases: GC flags objects and memory that is no longer used, sweep phase, ready for recycling. This is also divided into two sub-stages, the first stage, the suspension of the application, the end of the last sweep, and then into the concurrent Mark phase: Find the memory in use, the second phase, Mark, the end of the period, the application again paused Finally, unused memory is gradually recycled, and this phase is asynchronous and does not STW.

Gctrace can be used to track GC cycles, including time-consuming for each phase. For our service, it shows that most of the time is spent at Mark's end, so our GC analysis will also be focused on the mark's end-of-phase.

Here we need to trace the GC, the go was born with a pprof, but we decided to use the Linux perf tool. Use perf to capture higher-frequency samples or to observe the time-consuming OS kernel. Monitoring the kernel can help us debug slow system calls and so on.

Here's our profile chart, using the Go1.5.1, which is a flame Graph, used the Brendan Gregg tool to get, and trimmed, to remove the unimportant part, Leaving the Runtime.gcmark section, the time spent in this function can be thought of as the STW time of the Mark phase.

This diagram shows the stack call in a sequential way, and the width of each block represents the CPU time, and the color and order of the same row are not important. On the far left of the chart we can find the Runtime.gcmark function, which calls the Runtime.parfordo function. Further up, we found that most of the time spent on Runtime.markroot, it called Runtime.scang, Runtime.scanobject, Runtime.shrinkstack.

The Runtime.scang function is re-scanned at Mark's end, which is a required function that cannot be optimized. Let's take another look at the other two functions.

The next is the Runtime.scanobject function, which does a few things, but the reason for running in the mark phase is to implement finalizers. You might think: Why does the program use so much finalizer to put such pressure on the GC? Because our app is a messaging and chat service, it handles hundreds of thousands of of connections. Go's core NET package assigns a finalizer to each TCP connection to help control file descriptor leaks.

We had a lot of communication with the Go Runtime group, and they gave us some diagnostic options. In Go1.6, the finalizer scan is moved to the concurrency phase, and the performance of the GC has been significantly improved for a number of connected applications. So at 1.6, STW time is 1.5 twice times, 200ms--100ms!

Stack shrinkage

The gourtine of Go has a stack size of 2KB at initialization and grows as needed. The Go function assumes that the stack size is sufficient before calling, and if not enough, the old Gourtine stack is moved to the new memory area, and the pointer is rewritten as needed.

As a result, when the program runs, the Goroutine stack automatically grows to meet the function call requirements. One goal of GC is to reclaim these unwanted stack spaces. Move the goroutine stack to a suitable size memory space, this work is done through runtime.shrinkstack work, the work in 1.5 and 1.6 is completed in the Mark STW phase.

1.6 of GC charts were recorded, and Runtime.shrinkstack accounted for 3/4 of the time. If this function can be done asynchronously while the app is running, it can be greatly improved for our service.

The Go Runtime package document describes how to disable stack shrinkage. For our service, some memory is wasted in exchange for GC promotion. So we decided to disable stack sthrinking, and then the GC got a 2x boost, and STW time came to 30-70ms.

Is there any way to continue optimizing? Let's get another profile!

Page faults?!

Careful readers should find out that the range of GC time above is still quite large: 30-70ms. The flame graph here shows the STW situation for a longer period of time:

When the GC calls Runtime.gcremovestackbarriers, the system produces a page fault, resulting in a system function call: Page_fault. Page Fault is a way for kernel to map virtual memory to physical memory, which is often allowed to allocate a large amount of virtual memory, and when the program accesses the page Fault, it maps to access the physical memory.

The Runtime.gcremovestackbarriers function corrects the stack memory that has just been accessed by the program, in fact, the purpose of this function is to remove the stack barriers (at the start of the GC), during which the system has a large amount of memory available, So here's the question: Why is this memory access causing pagefaults?

This time, some background knowledge of computer hardware may help us. The server we use is a modern dual-socket machine (it should be a machine with two CPU slots on the motherboard). Each CPU slot has its own memory bar, this is the Numa,non-uniform memory access schema, when the thread runs on socket 0, the thread accesses the socket 0 memory will be very fast, access to other memory will be slow. Linux kernel attempts to reduce this delay by having the threads run next to the memory they use and moving the physical memory paging to the next thread.

With this basic knowledge, take a look at kernel's page_fault function. Continue looking up at the call stack of flame graph, and you can see that kernel calls the do_numa_page and Migrate_misplaced_page functions, which move program memory between the memory of each socket.

Here, kernel's memory access pattern is basically meaningless, and migrating memory paging to match this pattern is expensive.

Fortunately we have perf, relying on it we track the behavior of kernel, these only rely on go internal pprof is not possible-you can only see the program mysterious slow, but where is slow? Sorry, we don't know. However, using perf is relatively complex and requires root access to the kernel stack, while Go1.5 and 1.6 are required to use a nonstandard build (compiled by Goexperiment=framepointer./make.bash), But the good news is that the go 1.7 version natively supports this debug and does not need to do any extra work. But no matter how troublesome, this kind of testing is very necessary for our service.

Controlling Memory migration

If using two CPU sockets and two memory slots is too complex, then we only use one CPU socket. You can bind a process to a CPU by using the Linux tastkset command. In this scenario, the program's thread accesses only the adjacent memory, and kernel the memory to the corresponding socket memory.

After the above transformation (in addition to binding the CPU, you can also set the Set_mempolicy (2) function or the Mbind (2) function to set the memory policy to Mpol_bind to achieve), STW time reduced to 10-15ms. This picture is obtained under the pre-1.6 version. Note here that the Runtime.freestackspans, this function has been moved to the concurrent GC phase, so there is no need to pay attention. Here, for STW, there is not much to optimize.

GO 1.7

Until 1.6, we optimized the GC by disabling stack shrinkage. Although these methods have some side effects, such as increased memory consumption, and greatly increased the complexity of operations. For some programs, stack shrinkage is very important, so these optimizations are used only on some applications. Fortunately Go1.7 to come, this is known as the most improved version of the history, the improvement on the GC is also significant: the concurrent stack contraction, so that we achieve low latency, but also avoid the runtime tuning, as long as the use of standard runtime can.

Since GO1.5 introduced concurrent GC, runtime has tracked whether a goroutine was executed after the last scan of the stack. The STW phase checks to see if each goroutine is executed, and then scans those that have been executed again. At the beginning of GO1.7, runtime maintains a separate short list so that it does not need to traverse all of the goroutine during STW, while significantly reducing memory access for NUMA migrations that trigger kernel.

Finally, in 1.7, the AMD64 compiler will maintain the frame pointers by default, so that the standard debug and performance testing tools, such as perf, can debug the current Go function call stack. This allows you to select more advanced tools using standard-built programs, eliminating the need to reuse non-standard ways to build the go toolchain. This improvement is very good for the overall performance test of the system!

Using the pre-1.7 version released in June 2016, GC's STW time reached an astonishing 1ms, and without any tuning!! Contrast Go1.6 is also 10 times times the promotion!!

Share our experience with the Go development team to help them find solutions to some of the issues in GC. All in all, the performance from the very beginning to the GO1.7,GC has been improved by the ten * = 2000x!!!! Hats off to go development Group!

What's next?

All the analysis is focused on the GC's STW phase, but for GC, this is just one dimension of tuning. The next step in the Go runtime development will be in terms of throughput.

Their recent proposal transaction oriented collector describes a way to provide inexpensive allocation and recycling for memory that is not shared by the Goroutines (the private stack of goroutine). This reduces the number of full GC cycles and reduces CPU clock consumption throughout the GC process.

Summarize:

In today's go version, it is meaningless to bite the old idea that the go GC doesn't work, unless it's a very demanding application, such as not allowing a pause of more than 1ms.

Generics are now on the GO development team's agenda, but they are still talking about a more perfect solution, and so on, maybe next year.

Wish the Go language tomorrow getting better!

Advertising time

Welcome to join Golang Priory < Span style= "Font-family:"microsoft yahei",sans-serif", qq Group 894864 Welcome to this large

family, Here's all you want and a lot of enthusiasm for the great God Oh!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More