Real-time gc--theory and practice of Go language

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

The go language supports real-time, high-concurrency messaging systems that can reduce latency to less than 100ms in a message system up to millions, with a large portion of it being attributed to go's efficient garbage collection system.

For real-time systems, a garbage collection system can be a huge risk because the entire program needs to be paused during garbage collection. So when we design the message bus system, we need to be careful to choose our language. Go has been emphasizing its low latency, but has it really done it? If yes, how did it do it?

In this article, we will see how the GC of the Go language is implemented (tricolor algorithm, tri-color algorithm), and why this method can achieve such a low GC pause, and most importantly, whether it really works (Benchmar test for these GC pauses , and comparisons with other types of languages).

From Haskell to Go

We use the PUB/SUB message bus system as an example to illustrate the problem, these systems are in-memory stored when the message is published. In the early days, we implemented the first version of the message system with Haskell, but later found that GHC's gabage collector had some underlying latency problems, and we abandoned the system to go.

This is some of the implementation details of our Haskell messaging system, and the most important thing in GHC is that its GC pause time is proportional to the size of the current working set (that is, the GC time is related to the number of objects stored in memory). In our case, the number of objects stored in memory is often very large, which results in GC time often up to hundreds of milliseconds. This will cause the entire system to be blocked in the GC.

In the go language, unlike the GHC global Pause (stop-the-world) collector, Go's garbage collector is parallel to the main program. This avoids long pauses in the program. We are more focused on the low latency promised by go and whether the latency improvements mentioned in each new release are really what they say.

How does parallel garbage collection work?

How is the GC of go implemented in parallel? The key is the tricolor mark-and-sweep algorithm three-color mark removal algorithm. This algorithm can make the GC pause time of the system become a predictable problem. The scheduler is able to implement GC scheduling in a very short period of time and has minimal impact on the source program. Let's look at how the tri-color marker cleanup algorithm works:

Let's say we have this kind of code for a list operation:

var A LinkedListNode;var B LinkedListNode;// ...B.next = &LinkedListNode{next: nil};// ...A.next = &LinkedListNode{next: nil};*(B.next).next = &LinkedListNode{next: nil};B.next = *(B.next).next;B.next = nil;

The first step

var A LinkedListNode;var B LinkedListNode;// ...B.next = &LinkedListNode{next: nil};

At first we assumed that there were three nodes A, B, and C, and as root nodes, red nodes A and B were always accessible and then assigned once B.next = &C . Initially, the garbage collector has three collections, black, gray, and white, respectively. Now, because the garbage collector is not running, three nodes are in the White collection.

Step Two

We create a new node D and assign it a value A.next . That

var A LinkedListNode;var B LinkedListNode;// ...B.next = &LinkedListNode{next: nil};// ...A.next = &LinkedListNode{next: nil};

It is important to note that as a new memory object, it needs to be placed in a gray area. Why do you want to put it in a gray area? Here's a rule that if a pointer field changes, the object being pointed to needs to change color. Because all new memory objects need to have their addresses assigned to a reference, they will immediately become grayed out. (This needs to ask, why is C not gray?) )

Step Three

At the beginning of the GC, the root node will be moved into the gray area. At this point A, B, D three nodes are in the gray area. Because all of the program sub-processes (process, because it can not be said to be processes, should be considered a thread, but in go is not entirely a thread) either the program is normal logic, or the GC process, and the GC and program logic is parallel, so the program logic and GC process should be alternating CPU resources.

Fourth Step scan Memory Object

When a memory object is scanned, the GC collector marks the memory object as black and then marks its child memory object as gray. At either stage, we are able to calculate the number of steps that are required for the current GC collector: to 2*|white| + |grey| move at least once per scan GC collector until the current number of gray area memory objects reaches 0.

Fifth Step

The logic of the program at this time is, the new assignment of a Memory object E given C.next , the code is as follows:

var A LinkedListNode;var B LinkedListNode;// ...B.next = &LinkedListNode{next: nil};// ...A.next = &LinkedListNode{next: nil};//新赋值 C.next = &E*(B.next).next = &LinkedListNode{next: nil};

According to our previous rules, the new memory object needs to be placed in a gray area:

In doing so, the collector needs to do more, but doing so can delay the final cleanup operation when a lot of memory objects are created. It is worth mentioning that the volume of processing white areas will be reduced until the collector really cleans up the heap space and then re-fills in the new memory object.

Sixth step pointer re-assign value

The program logic is assigned to it at this point B.next.next B.next , that is, the E is assigned to the value B.next . The code is as follows:

var A LinkedListNode;var B LinkedListNode;// ...B.next = &LinkedListNode{next: nil};// ...A.next = &LinkedListNode{next: nil};*(B.next).next = &LinkedListNode{next: nil};// 指针重新赋值:B.next = *(B.next).next;

After doing so, C will not be reached.

This means that the collector needs to remove C from the white area and then reclaim the memory space it occupies in the GC loop.

Seventh Step

Moving a Memory object with no reference dependency in the gray area to the black area, where D has no other dependency in the gray area, and relies on its memory object A is already in the black area and moves it to the black area.

Eighth step

In the program logic, the B.next value will be assigned to nil , at this point, E will become unreachable. But when E is in the gray area, it will not be recycled, so will this cause a memory leak? In fact, E will be recycled in the next GC cycle, and the tri-color algorithm can guarantee this: if a Memory object cannot be accessed at the start of a GC cycle, it will be frozen and recycled at the end of the GC.

Nineth Step

In the second GC loop, E is moved into the black area, but C does not move because C refers to E, not E to C.

Tenth step

The collector then scans the memory object B in the last gray area and moves it to the black area.

11th step to reclaim the white area

Now that there is no memory object in the gray area, the memory object in the white area is recycled. At this stage, the collector already knows that the memory object in the white area is no longer referenced and inaccessible, and it is recycled as garbage. At this stage, E will not be recycled, because in this cycle, E has just become unreachable and it will be recycled in the next loop.

12th Step Area Discoloration

This step is the most interesting, in the next GC cycle, there is no need to move all the memory objects back to the white area, just need to change the black area and the color of the white area is good, simple and efficient.

Summary of GC tri-color algorithm

There are some details of the three-color mark removal algorithm, and there are still two stages to be stop-the-world under the current algorithm: one is the stack scan of the root memory object, and the other is the terminating pause of the marking phase. It is exciting that the stop-pause of the mark-up phase will be removed. In practice, we find that the GC pause time realized by this algorithm can achieve <1ms performance in the case of super large heap space recovery.

Latency VS Throughput

If a parallel GC collector can achieve very low latency when dealing with a very large memory heap, why is anyone still using the Stop-the-world GC collector? Isn't Go's GC collector good enough?

This is not absolute, because low latency is a cost. The main overhead is that low latency cuts throughput. concurrency requires additional synchronization and assignment operations that will consume the processing logic of the program. While Haskell's GHC is optimized for throughput, go focuses on latency, and we need to choose which language to use for our own needs, and for a system with high real-time requirements for push systems, choosing go language is the tradeoff.

Actual performance

For now, go seems to be able to meet the requirements of a low-latency system, but what about performance in practice? Compare with the same benchmark test logic implementation: The benchmark will constantly push messages to a buffer size buffers, and the old messages will constantly expire and garbage needs to be reclaimed, which requires that the memory heap needs to remain in a large state, which is important Because the entire memory heap needs to be scanned during the recycling phase to determine if there is a memory reference. This is why the GC's run time is proportional to the number of surviving memory objects and pointers.

This is the benchmark code for the Go language version, where buffer is implemented as an array:

package mainimport (    "fmt"    "time")const (    windowSize = 200000    msgCount   = 1000000)type (    message []byte    buffer  [windowSize]message)var worst time.Durationfunc mkMessage(n int) message {    m := make(message, 1024)    for i := range m {        m[i] = byte(n)    }    return m}func pushMsg(b *buffer, highID int) {    start := time.Now()    m := mkMessage(highID)    (*b)[highID%windowSize] = m    elapsed := time.Since(start)    if elapsed > worst {        worst = elapsed    }}func main() {    var b buffer    for i := 0; i < msgCount; i++ {        pushMsg(&b, i)    }    fmt.Println("Worst push time: ", worst)}

The same logic, different language implementations (Haskell/ocaml/racke<gabriel scherer>, Java<santeri hiltune>), test results under the same test conditions are as follows:

Benchmark Longest Pause (ms)
OCaml 4.03.0 (map based) (manual timing) 2.21
HASKELL/GHC 8.0.1 (Map based) (RTS timing) 67.00
HASKELL/GHC 8.0.1 (array based) (RTS timing) 58.60
Racket 6.6 Experimental Incremental GC (map based) (tuned) (RTS timing) 144.21
Racket 6.6 Experimental Incremental GC (map based) (untuned) (RTS timing) 124.14
Racket 6.6 (map based) (tuned) (RTS timing) 113.52
Racket 6.6 (Map based) (untuned) (RTS timing) 136.76
Go 1.7.3 (array based) (manual timing) 7.01
Go 1.7.3 (map based) (manual timing) 37.67
Go HEAD (map based) (manual timing) 7.81
Java 1.8.0_102 (Map based) (RTS timing) 161.55
Java 1.8.0_102 G1 GC (map based) (RTS timing) 153.89

Surprisingly Java, very general, and OCaml is very good, OCaml language can achieve about 3MS GC Pause time, this is because OCaml uses the GC algorithm is incremental GC algorithm ( The reason for not using OCaml in real-time systems is that the language is not good for multicore support.

As shown in the table, Go's GC pause time around 7ms, performance, has been fully able to meet our requirements.

Some precautions

    1. Benchmarking often requires caution because different runtimes have different degrees of optimization for different test cases, so there are often differences in performance. We need to write test cases for our own needs, and we should be able to meet our own product requirements for benchmark testing. As you can see in the example above, go is fully capable of meeting our product needs.
    2. map Vs. Array: Initially our benchmark was to insert and delete operations in the map, but go has a bug in GC for large maps. So when designing go benchmarks, use a modifiable array as an alternative to the map. The bug of Go map has been fixed in version 1.8, but not all benchmark tests have been fixed, which is something we need to address. But anyway, there's no reason to say that GC time will be greatly increased by using map (except for bugs and bad implementations).
    3. Manual timing Vs. RST timing : As another consideration, some benchmarks will vary under different timing systems because some languages do not support runtime time statistics, such as go, while others support it. Therefore, we should set the timekeeping mode to manual timing when testing.
    4. The last thing to note is that the implementation of the test case will greatly affect the results of the benchmark, and if the insertion and deletion of the map is implemented poorly, it will adversely affect the test results, which is another reason for using array.

Why can't the result of go be better?

Although we are using the map bugfixed version or the array version of the go implementation can achieve ~7MS GC pause performance, this is good, but according to go official release of "1.5 Garbage Benchmark Latency" (https:// Talks.golang.org/2015 ... , in the case of 200MB heap memory, the GC pause delay of ~1ms can be reached (the GC pause time should be related to the number of pointer references and not the capacity of the heap but we cannot get the exact data). The Twitch team also released an article saying that it was possible to achieve a GC delay of about 1ms in Go1.7.

After contacting the Go-nuts mail list, the answer is that these pause experiments may have been caused by some of the bugs that were not fixed. The idle tag worker may block the program logic, and in order to determine the problem, I use a go tool trace visualizer to track the run-time behavior of go.

Right, there are nearly 12ms of background mark workers running in all processor (CPU cores). This makes me more convinced that this is the problem caused by the above bug.

Summarize

The focus of the survey is that the GC is either focused on low latency or focused on high throughput. Of course it all depends on how our program uses heap space (do we have a lot of memory objects?). is the life cycle of each object long or short? )

It is important to understand whether the underlying GC algorithm is appropriate for your test case. Of course, the actual implementation of the GC system is also critical. The memory footprint of your benchmark program should be similar to the real program you are going to implement, in order to verify in practice that the GC system is efficient for your program. As I said earlier, the GC system of Go is not perfect, but it is acceptable to our system.

Despite some problems, the GC performance of Go has been better than most of the same languages that have a GC system, and the Go development team is optimized for GC latency and continues. The go GC does have a point, either theoretically or in practice.

Ref:golang ' s real-time GC in theory and Practice (en)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.