JDK ZGC Introduction
Note 1: This article is translated from this article
Note 2: I have a new independent blog address, welcome to visit
Objective
ZGC is the latest open source new garbage collector by Oracle for OPENJDK. It is written mainly by the per Liden. ZGC is similar to Shenandoah or Azul C4, focusing on reducing the pause time while still compressing the heap.
Although I will not give a complete description here, the "compressed heap" simply means moving objects that are still alive to other areas of the heap. This helps to reduce fragmentation, but usually this also means that the entire application (including all its threads) needs to be paused, which is often referred to as Stop the world. The application can be resumed only after the GC is complete.
In GC-related literature, applications are often referred to as mutator, because from a GC's point of view, the application changes the heap (mutates the heap). Depending on the size of the heap, such pauses can take a few seconds, which may be unacceptable for interactive applications.
There are several ways to reduce the pause time:
- GC can use multiple threads during compression (parallel compression parallel compaction)
- Compression work can also be divided into multiple pauses (incremental compression incremental compaction)
- Compacting the heap without pausing the application, or just pausing for a short time (concurrent compression concurrent compaction)
- Go's GC is completely uncompressed heap
As mentioned earlier, ZGC will perform concurrent compression, which is certainly not a simple implementation, so I want to describe how this works. Why is this complicated?
You need to copy the object to another memory address while another thread can still read and write the old object.
If the object has been copied successfully, there are still many references to the old address in the heap that need to be updated to the new address.
While concurrent compression (concurrent compaction) seems to be the best solution for reducing pause times in the above scenarios, there are certainly some tradeoffs involved. Therefore, if you do not care about the pause time, it is best to use a GC focused on throughput.
GC Barrier (GC barriers)
The key to understanding how ZGC carries out concurrent compression is the load barrier (commonly referred to as read barrier in GC literature). Here is a brief description, please see the Load barrier section below.
If the GC has a read barrier (Load barrier), the GC needs to perform some extra action when reading the reference from the heap. In Java, that is, it requires extra action like executing code object Xxx=obj.field.
For operations like Obj.field = value, the GC may also need to write a barrier (called Write Barrier or store Barrier) [write barriers are used in generational GC and reference counting].
Both of these operations are special because they occur each time the heap is read or written. The names of Load barrier and store barrier are a bit confusing, but note that this barrier is completely different from the CPU's memory barrier two concepts
Reads and writes are common in the heap, so both GC barriers need to be very efficient, and in common cases, some assembly code. Read barrier is usually an order of magnitude larger than the write barrier (which may vary depending on the application), so read barrier has a higher performance requirement.
For example, generational GC usually requires only one write barrier and no reading barrier. The ZGC requires a reading barrier but no barrier to write. For concurrent compression, I don't see a solution without a read barrier.
It is important to note that even if a GC requires some type of barrier, they are only required to read or write to references in the heap. It is not necessary to read or write basic types such as int or double.
Pointer markers (Pointer tagging Or colored pointers)
ZGC stores additional metadata in the heap reference, which is the x64 bit (ZGC currently does not support compressed Oops and class pointers). 48 bits in 64 bits are used as virtual memory addresses on x64. Although only 47 bits are exactly, because the 47th bit determines the value of bit 48-63 (these bits are currently 0). ZGC retains the first 42 bits of the actual address of the object (called offsets in the source code). 42-bit addresses theoretically there would be a heap size limit of 4TB. The remaining bits are used for these flags: finalizable, remapped, marked1, and marked0 (reserved for future use). As shown in the following:
6 4 4 4 4 4 0 3 7 6 5 2 1 0+-------------------+-+----+-----------------------------------------------+|000000 XX 00000000 0|0|1111|11 11111111 11111111 11111111 11111111 11111111|+-------------------+-+----+-------------------- ---------------------------+| | | || | | * 41-0 Object Offset (42-bits, 4TB address space) | | || | * 45-42 Metadata Bits (4-bits) 0001 = marked0| | 0010 = marked1| | 0100 = remapped| | finalizable| = || * 46-46 Unused (1-bit, always zero) |* 63-47 Fixed (17-bits, always zero)
Having metadata information in a heap reference makes the dereference more expensive because a mask address is required to obtain a real address without meta-information. ZGC uses a good technique to avoid this situation:
When read from memory, one of the marked0, marked1, or remapped is set.
When you assign a page at offset x (allocating a page), ZGC maps the same page to 3 different addresses:
- For marked0: (0b0001 << 42) | X
- For marked1: (0b0010 << 42) | X
- For remapped: (0b0100 << 42) | X
Therefore, ZGC preserves the 16TB address space from address 4TB (but does not actually use all of this memory). Such as:
+--------------------------------+ 0x0000140000000000 (20TB) | Remapped View | +--------------------------------+ 0x0000100000000000 (16TB) | (Reserved, but unused) | +--------------------------------+ 0x00000c0000000000 (12TB) | Marked1 View | +--------------------------------+ 0x0000080000000000 (8TB) | Marked0 View | +--------------------------------+ 0x0000040000000000 (4TB)
At any point in time, use only one of these three views. You can cancel the mapping (unmapped) unused views to verify correctness when debugging.
Pages & Physical & Virtual Memory
Shenandoah the heap into a large number of areas of the same size. In addition to large objects that do not fit in a single region, objects typically do not span multiple regions. Large objects are assigned to multiple contiguous regions. I like this method very much, because it is very simple.
In this regard, ZGC is very similar to Shenandoah. In ZGC's parlance, the area is called page pages.
The main difference from Shenandoah: pages in ZGC can have different sizes (but are always multiples of 2MB on x64).
ZGC has 3 different page types: Small (2MB size), medium (32MB size), and large (multiples of 2MB).
Allocate small objects (up to 256KB size) in small pages, and allocate medium-sized objects (up to 4MB) in a medium-sized page. An object larger than 4MB is allocated in a large page. A large page can store only one object. Small pages or intermediate pages may be assigned multiple.
What's confusing is that large pages may actually be smaller than medium pages (for example, large objects of size 6MB).
Another good feature of ZGC is that it can also differentiate between physical memory and virtual memory. The idea behind this is that there is usually enough virtual memory (ZGC is always 4TB), while physical memory is more scarce. Physical memory can scale to the maximum heap size (using the-XMX setting), so this is much smaller than 4 TB of virtual memory. Assigning pages of a specific size in ZGC means allocating physical and virtual memory. In ZGC, the physical memory does not need to be contiguous, and the virtual memory space is contiguous.
Why is that a good attribute?
It is easy to allocate a contiguous range of virtual memory because we usually have enough virtual memory. However, it is common to have 3 free pages in physical memory that are 2MB in size, but we need 6MB of contiguous memory for large object allocations. There is enough free physical memory, but unfortunately this memory is not contiguous. ZGC can map these non-contiguous physical pages to a single contiguous virtual memory space. If we can't map, we'll run out of memory (Oom happens)
Marking and relocation Objects (marking & relocating objects)
Garbage collection is divided into two phases: tagging and resettlement (in fact, you can look up the source code in more than one of these two stages).
[relocating] refers to moving objects from one area of memory to another, and remapping (remapping) only refers to updating references to old addresses to new addresses]
A GC marks all reachable objects, starting with the marker phase. At the end of this phase, we know which objects are still alive and which objects are garbage. ZGC stores this information in a live map on each page. Live map is a bitmap (bitmap) that stores whether the object at a given index is reachable and/or eventually reachable (for objects with a Finalize method).
During the tagging phase, load-barrier in the application thread pushes unmarked references to the thread-local tag buffer. As long as this buffer is full, the GC thread can take ownership of this buffer and recursively traverse all reachable objects in this buffer. Markup in the application thread simply pushes the reference to the buffer, and the GC thread is responsible for traversing the object graph and updating the live map.
After the marking phase finishes, ZGC will relocate all active objects in the relocation set.
The relocation set represents a set of pages (pages) that need to be recycled, such as those that are the most spammy. The surviving object is relocated by a GC thread or application thread through the Read barrier (Load Barrier) (relocated) (that is, to the new address). ZGC assigns forwarding table to each page in the relocation set.
Forwarding table is basically a hash map that stores an address where an object has been relocated (if the object has been relocated).
The advantage of the ZGC method is that we only need to allocate forwarding table space for the pages in the relocation set.
By contrast, Shenandoah stores forward pointers on each object itself, so that there is some extra memory overhead.
The GC thread traverses the surviving objects in the relocation set and re-orients (relocate) the objects that have not been relocated. At this point, the application thread and the GC thread can be resettled simultaneously (relocate) the same object, in which case who first relocate who wins, ZGC uses atomic CAS operations to determine the winner.
When not in the marking phase, Load-barrier will relocate (relocates)/Remap (remaps) all references loaded from the heap. This ensures that each new reference that Mutator sees has pointed to the most recent copy of the object. The Remap (remaps) object is to find the new object address in the forwarding table.
Once the GC thread has finished processing the relocation set, the resettlement phase is complete. While this means that all objects have been relocated, there is usually still a reference to relocation set, which needs to be remapped (remapped) to the new address. These references will be load-barrier self-healing. If the read for these references does not occur fast enough (that is, the application does not read these references to the relocation set), the references are fixed at the next mark stage. This means that the tagging phase also needs to check the forward table for remapping (remap) (but not relocation, all objects are guaranteed to be relocated before the stage) to their new address.
This also explains why there are two marker bits (MARKED0 and marked1) in an object reference. The marker stage alternates between the marked0 and marked1 bits of the tag. After the resettlement phase, there may still be non-redirected (remapped) references, so we need to know about the previous GC cycle. If the new markup stage uses the same tag bit, then Load-barrier knows that the reference is marked.
This looks like the GC cycle remap and mark can overlap, and indeed overlap. :
More detailed information can be seen in this slide)
Load-barrier
When reading references from the heap, ZGC needs a so-called Load-barrier (also known as Read-barrier). Every time a Java program accesses a field of an object type, we need to insert this load-barrier, such as Obj.field. Access to some other primitive types of fields does not require a barrier, such as Obj.anint or obj.andouble. ZGC does not require Obj.field = somevalue storage/write barriers.
Depending on the current stage of the GC (stored in the global variable zglobalphase), if the object has not been marked or relocated, the barrier will mark the object or relocate it
Global Variables Zaddressgoodmask and Zaddressbadmask
Stores the corresponding mask, which determines whether the reference is considered good (which means that remapped/relocated has been marked or remapped/relocated) or whether some action is still required. These variables are changed only at the beginning and the relocation stages of the tag. This table in the ZGC source code provides a good overview of the status of these masks:
GoodMask BadMask WeakGoodMask WeakBadMask --------------------------------------------------------------Marked0 001 110 101 010Marked1 010 101 110 001Remapped 100 011 100 011
The assembly code for the barrier can be seen in Macroassembler for x64, and I will only show some pseudo-assembly code for this barrier:
mov rax, [r10 + some_field_offset]test rax, [address of ZAddressBadMask]jnz load_barrier_mark_or_relocate# otherwise reference in rax is considered good
The first assembly instruction reads the reference from the heap: R10 stores the object reference, and Some_field_offset is some field offset constant. The loaded reference is stored in the Rax register.
The reference is then tested against the current bad mask (this is just a bit with). Synchronization is not required here, because Zaddressbadmask is only updated at STW. If the result is not zero, we need to implement the barrier.
The barrier needs to mark or relocate the object according to the GC phase we are currently in. After this operation, he needs to update the reference stored in R10 + Some_field_offset to point to the new reference. This step is necessary so that subsequent loading of the field returns the correct reference.
Since we may need to update the reference address, we need to use two registers R10 and Rax as the loaded reference and object address. The correct reference also needs to be stored in the register Rax, so that we have loaded the correct reference during the subsequent execution.
Because each reference needs to be marked or relocated, throughput may be reduced immediately after the start tag or resettlement phase. This should become faster when most references are fixed.
Stop-the-world Pause
ZGC did not completely get rid of STW. The collector needs to be paused at the start tag, end tag, and start relocation. But this pause is usually short, only a few milliseconds.
When the start tag is started, ZGC iterates through all the thread stacks to mark the root set. The root set is where the object graph begins to traverse. The root set is typically made up of local and global variables, but it also includes other internal VM structures, such as JNI handles.
The end tag phase needs to be paused again. In this pause, the GC needs to empty and traverse all thread-local tag buffers. Because the GC may find an unlabeled large sub-graph, it may take longer. ZGC attempts to avoid this situation by stopping the end of the tagging phase after 1 milliseconds. It returns to the concurrency tag stage until the entire object graph is traversed, and can then start the end tag phase again
Starting the resettlement phase pauses the application again. This phase is very similar to the start tag, except that the object in the root set is relocated at this stage.
Conclusion
I wish I could briefly introduce ZGC. I certainly can't describe all the details about this GC in a blog post. If you need more information, ZGC is open source so you can study the entire implementation.
JDK ZGC Introduction