Disclaimer: This article is from hackernews on the [Erlang garbage Collection Details and why It
MATTERS][1] compiled, in the spirit of learning and research attitude, the compilation, reproduced please indicate the source.
One of the important issues that Erlang needs to solve is to create a platform for a soft real-time system that is highly responsive. Such a system requires a fast garbage collection mechanism, which does not prevent the system from responding in a timely manner. On the other hand, when we look at Erlang as an immutable language with non-destructive updates, this garbage collection mechanism is more important because it has a high chance of generating garbage.
Memory layout
Before delving into the GC, it is important to examine the three important points of the memory layout of the Erlang process: Process Control modules, stacks, and heaps. It is very similar to the memory layout of UNIX.
Process Control Module : The Process Control module holds some information about the process, such as its identifier (PID) in the process table, its current state (run, wait), its registered name, initial and current calls, and the PCB holds pointers to incoming messages. These incoming messages are stored in the connection table in the heap.
Stack: It is a downward-growing storage area that holds input and output parameters, return addresses, local variables, and temporary space for evaluating expressions.
Heap: It is an upward-growing storage area that holds physical messages for process mailboxes, such as lists, tuples, and binaries, as well as larger objects than a single machine word like floating-point numbers. Binary items with more than 64 machine words are not stored in the process private heap. They are called REFC binary (Reference counted binary) and are stored in a large shared heap, as long as the process with that REFC Binary pointer can access the heap. The pointer stored in the private heap of the process is called Procbin.
The details of the GC
In order to make the GC mechanism of the current default Erlang simple, it is a generational replication garbage collection that runs independently of the private heap of each erlang process, and it is also a reference count garbage collection that occurs in the global shared heap.
Private Heap GC
The GC of the private heap is generational. The generational GC divides the heap into two parts of the new and old ages. If an object survives the GC cycle, it will be less likely to become garbage in the short term, which is the basis for this division. Therefore, the new generation is prepared for the newly allocated data, the old age is to survive after several GC startup data. This generational helps the GC reduce unnecessary loops that have not yet become garbage data. There are two policies for Erlang garbage collection: Generational (Minor) and Fullsweep (Major). The generational GC collects only the newborn heap, while the Fullsweep heap is collected by old and new. Now, let's review a new GC step that starts the private heap of the Erlang process:
Scenario 1:
Spawn > No GC > Terminate
GC does not occur if a short process does not use a heap that is more than min_heap_size. In this case, all memory that is used by the process is collected.
Scenario 2:
Spawn > Fullsweep > Generational > Terminate
If the data for a new production process grows more than min_heap_size, then the Fullsweep GC is used, obviously because there is no GC, and there is no generation or age. After the first fullsweep GC, the heap is divided into two parts, and then the GC policy is converted to generational and persisted to the end of the process.
Scenario 3:
Spawn > Fullsweep > Generational > Fullsweep > Generational > ... >
Terminate
In several cases, GC policies are converted from generational to fullsweep during the process. In the first case, a certain number of generational GC has been entered. This number can be a specific global or a process with Fullsweep_after flag. The generational GC counters for each process and its upper bound before the Fullsweep GC are Minor_gcs and Fullsweep_after respectively, and are visible in the Process_info (PID, garbage_collection) return value. The second case is when the generational GC does not collect enough memory, and the last case is that the Garbage_collect (PID) function is called manually. After these conditions, the GC policy will revert back from fullsweep to generational and then remain until the above scenario occurs.
Scenario 4:
Spawn > Fullsweep > Generational > Fullsweep > Increase Heap >
Fullsweep > ... > Terminate
In Scenario 3, if the second Fullsweep GC does not collect enough memory, the heap size increases, and the GC policy is converted to Fullsweep, as is the case with the newly generated process, these four scenarios can continue to occur.
The question now is why the automatic garbage collection language like Erlang is so important. First of all, this knowledge can help you to make your system run faster by adjusting the occurrence and strategy of GC. Second, this is where we understand the important reason why Erlang becomes a real-time platform for software from a garbage collection perspective. This is because each process has its own heap and its own GC, so each time the GC appears in a process, it simply stops the Erlang process being collected, but does not stop the other processes, which is what a soft real-time system needs.
Shared Heap GC
The GC for a shared heap is a reference count. Each object in the shared heap (REFC) has reference counters that are relative to the other objects stored (procbin), and these other objects (Procbin) are stored inside the private heap of the Erlang process. If an object reference counter reaches 0, the object becomes inaccessible and will be destroyed. Reference counters are inexpensive and can help the system avoid unexpected pauses and improve the responsiveness of the decency. However, when designing your Actor model system, it is not known that some of the famous anti-patterns can cause problems, such as memory leaks.
When REFC for the first time into a sub-binary. To reduce costs, a sub-binary is not a new copy of the original binary split, just a reference to that part. However, this sub-binary will be considered as a new reference to the original binary, and you know that this may cause problems when the binary must hang on its sub-binary.
Other known issues occur when a middleware with a long life cycle is treated as a request controller or messaging router that controls and passes large REFC binary messages. When the process touches each REFC message, their counters are incremented. So collecting these REFC messages relies on collecting all Procbin objects, even if they are in the middleware process. Unfortunately, because Procbin is just a pointer, they cost very little and take a long time to trigger the GC in the middleware process. So even though REFC messages have been collected from all other processes in the middleware, they need to remain in the shared heap.
The shared heap is important because it reduces the IO due to the large number of binary messages that are passed between processes. Since sub-binaries is just another binary pointer, they can be created quickly. But as a rule of thumb, using a shortcut that gets faster can create a cost that builds your system in a way that doesn't trap in bad conditions. There are also many well-known methods for dealing with REFC binary leaks, such as the Hebert in his ebook, the Erlang in Anger. I don't think I can explain better than him, so I highly recommend you to read.
Summarize:
Even if we use a self-managing memory language like Erlang, it is also necessary to understand how memory is allocated and released. Unlike Go's in-memory model documentation, "If you have to read the rest of the documentation to understand the behavior of your programming, you're too smart." Don't be so smart ", I believe we have to be smart enough to make our system run faster and safer, but to do this, it is essential to understand how it works.
Information:
academic and historical Questions about Erlang
implementation of FPL & Concurrency
efficient Memory Management for message-passing Concurrency Paper
Programming the Parallel world by Erlang Paper
Some analysis of the problem of Erlang memory leaks can be found in an article on Erlang memory leak analysis prior to Inyumba
What is the problem welcome message exchange.
Two or three things about the Erlang garbage collection mechanism