The Go blog recently published an article on Richard Hudson's "Getting to Go" lecture on ISMM 2018, including keynote and notes, where you can see the considerations of Go GC design, and the evolution of the context, The article below summarizes some content to come out.
The Go Scheduler is a lightweight goroutine,goroutine that is dispatched to a limited number of threads and each goroutine has its own stack, so go will have thousands of stacks as the GC root, in GC SafePoint need to stop and traverse.
Go supports value types, or primarily based on value types, where value types do not have the overhead of an additional object header. Because of the existence of value types, the ability to control layout, the interaction between FFI and C + + is fast, and for GC, the value type allocated on the stack can reduce the GC pressure. A value type can be assigned to a pointer to a value type on the heap, or a pointer to a field inside a value type, to keep the value type alive in the GC.
Static AOT compilation. Go has only been statically compiled and does not support JIT. The benefit is that the compiled results have predictable, stable execution performance, and the bad places do not use the feedback from execution to optimize compilation as JIT does.
Go now has only two ways to control GC, which also reflects the simple tone of gc--or go language design.
One is a very early setup, you can set a new allocated memory since the last GC and the last GC after the use of the ratio of memory, to this scale triggered a new GC, the default is 100. Can be specified by the GOGC parameter or by setgcpercent at run time.
Another setmaxheap, now only used internally, is the size of the largest heap that go can use, similar to Java's-xmx. Adding this parameter is based on: if the GC is unable to drop the memory, it should reduce the load on the program instead of allocating more memory.
The Go GC was originally designed with low latency as its primary goal, making it an incredibly correct decision. Now the JVM's GC is also moving from a throughput-targeted GC to a low-latency G1,ZGC.
A simple math problem, if you can ensure that 99% of the system GC delay below 10ms, the user's browser needs to send 100 requests to the server, or access five pages, each need to send 20 requests, then only 37% of users can enjoy the full 10ms delay-in the service of the architecture , this user can be considered a server that calls other services, and the problem is even more obvious.
If you want to allow 99% of users to have a delay of less than 10ms, then the system needs to ensure that 99.99% of the GC delay below 10ms. Jeff Dean's 2014 paper, "The Tail at scale," details this issue called tyranny of the 9s.
The go GC targets developed in 2014. In the 2014, the GC implementations of other languages were largely catastrophic in terms of latency, and the goal was to dig a lot of holes for themselves.
Originally intended to be a non-read barrier, parallel, with memory copying GC, but the time tight task heavy, so the last thing to do is not copying GC. The cost of Read barrier, low latency, memory compaction, the three targets to discard the last one, usually the consequence of discarding compaction is that memory fragmentation may occur, reducing memory allocation speed. But Tcmalloc, hoard, Intel's scalable malloc, and so on. These allocator in C give go team confidence that the GC does not have to do memory move.
Of course, the implementation of the low latency GC for concurrent copy is all with read barrier, and go begins with a big ambition.
Write barrier is not saved, and concurrent tagging requires write barrier support. Because the write barrier is only opened in the GC, the effect of the program performance is reduced as much as possible.
The implementation of the Go GC memory allocation. The memory is divided into spans, and each span allocates only the same size of memory, and objects of different sizes are separated by span from each other. The benefits of doing this:
- The allocated memory size is fixed, and if you point to a pointer to a field inside the object, you can calculate the start address of the object directly.
- Low memory fragmentation. Even if the GC does not do compaction, it will not suffer from serious memory fragmentation issues.
- After the memory is divided by size, the contention for allocating memory is low and therefore has high performance.
- Allocation speed. While there is no GC that collates memory like a JVM, it can be bump pointer to allocate fast, but it is faster than C.
Use separate mark Bits records to record meta-information that each field is not a pointer. Used for GC tags and memory allocation.
1.6,1.7,1.8, three successive versions of GC dramatically reduced GC latency, from 40ms down to 1ms levels.
In the 2014 SLO, the target of 10ms latency was reached in 1.6.3. Now it's 8012, with a new slo,500 microsecond of STW time looking to dig a hole for himself.
Rick also talked about some of the failed jobs, mainly the request-oriented GC and the generational GC.
ROC is a more efficient collection of objects for short lifecycles during a request, for most of the Request-response-type online application scenarios. The idea is that when a goroutine dies, only the objects used by this goroutine are recycled, because the objects that are recycled are not used by other goroutine, so they do not need to be synchronized.
However, this needs to be kept open write barrier to record whether an object is used only by the current goroutine, or is passed to other goroutine, and write barrier is too slow.
The next step after ROC failure is to try using a long-standing, well-established generational GC. But for the goal of low latency, the GC is not intended to do copying, this is difficult, do not do copying how to do promotion? The workaround is to not differentiate between the old and young areas, but instead use a bit verctor to record whether each chunk of memory in the area is old (1) or young (0). Each time the young GC is marked as old by the old pointer, all the tokens marked as 0 are recycled. Then allocate the time from this Bitvetcor to find the next area with a value of 0 allocated, until the next young GC.
This solution still requires writer barrier to remain open, but there can be a fast writer barrier optimization during non-GC. The final performance of the generational GC is also unsatisfactory, fast writer barrier, but not fast enough.
Another reason why the GC is not ideal is because go is based on the value type, and even if a pointer is useful, the object will be allocated on the stack as long as the escape analysis discovers that the scope does not escape. The result is that the short life cycle objects of go are usually allocated on the stack, making the benefits of young GC smaller.
Use card marking to eliminate writer barrier when not in GC. Card table is an optimization method commonly used in generational GC, which can save the overhead of writer barrier, but has a pointer hash cost. Card records within a certain area of the pointer hash value, if there is pointer change, the hash will change, card is considered to be dirty. In modern hardware that supports AES (Advanced Encryption Standard) directives, it is very fast to maintain such a hash.
The generational GC performance test using card marking is still not particularly desirable and has many factors to influence. It might be possible to find the generational in the GC to be faster, or turn it off.
Look at the hardware, the capacity of RAM is growing fast, and the price is falling fast, perhaps not on the GC so keyed?