90-New garbage collection mechanism description
In PHP 5.2 and earlier, there is no dedicated garbage collector GC (garbage Collection), the engine in determining whether a variable space can be released is based on the Zval refcount value of this variable, if RefCount is 0, Then the variable space can be freed, otherwise it is not released, this is a very simple GC implementation. However, in this simple GC implementation scenario, an unexpected variable memory leak (bug:http://bugs.php.net/bug.php?id=33595) occurs, and the engine will not be able to reclaim the memory. As a result of the new GC in PHP5.3, the new GC has a dedicated mechanism for cleaning up garbage data and preventing memory leaks. This article will elaborate on the new GC operating mechanism in PHP5.3.
At present, there is very little detailed information about the new GC, this article will be the most detailed from the source point of view of the PHP5.3 GC principle of the article. Among them, the garbage generation and the introduction of the algorithm from the author according to the manual translation, of course, the integration of some of my views. Related content in the manual: Garbage Collection
What's rubbish?
First of all we need to define the concept of "garbage", the new GC is responsible for cleaning up the garbage refers to the variable container zval still exist, but there is no variable name to point to this zval. Therefore, an important criterion for GC to determine whether it is garbage is that there is no variable name pointing to the variable container zval.
Suppose we have a PHP code that uses a temp variable $tmp to store a string, and after the string is processed, the $tmp variable is not needed, and the$tmp variable is a "garbage" for us. However, for GC,$tmp is not actually a garbage,$tmp variable does not make sense to us, but this variable actually exists, the$tmp symbol still points to its corresponding zval, The GC will assume that this variable may also be used in PHP code, so it will not be defined as garbage.
So if we use $tmp in PHP code, call unset to delete this variable, then $tmp is not going to be a garbage anymore. Unfortunately, GC still does not think $tmp is a garbage, since $tmp after unset, RefCount reduced by 1 to 0 (assuming there are no other variables and tmp pointing to the same zval), This time the GC will directly release the Zval memory space corresponding to $tmp,and the corresponding zval will not exist at all. At this point, the $tmp is not the kind of "junk" that the new GC will deal with. So what kind of rubbish does the new GC have to deal with, and we're going to produce one of these rubbish.
The process of producing stubborn rubbish
If the reader has read the contents of the variables ' internal storage, the information inside the variables RefCount and isref must be understood. Here we will use an example from the manual to introduce the garbage generation process:
<?php\$a = "new string";?>
In such a simple code, the$A variable internally stores information as: A: (Refcount=1, is_ref=0) = ' new String '
When you assign a value of $A to another variable,$a corresponds to the zval of RefCount plus 1.
<?php\$a = "new string";\$b = \$a;?>
At this point the internal storage information for the $A and $b variables corresponds to a, B: (refcount=2, is_ref=0) = ' new String '
When we delete the $b variable with unset,$b corresponds to the refcount of the zval will be reduced by 1
<?php\$a = "new string"; //a: (refcount=1, is_ref=0)=‘new string‘\$b = \$a; //a,b: (refcount=2, is_ref=0)=‘new string‘unset(\$b); //a: (refcount=1, is_ref=0)=‘new string‘?>
For normal variables, it all seems normal, but in compound type variables (arrays and objects), something interesting happens:
<?php\$a = array(‘meaning‘ => ‘life‘, ‘number‘ => 42);?>
The internal storage information for A is:
a: (refcount=1, is_ref=0)=array ( ‘meaning‘ => (refcount=1, is_ref=0)=‘life‘, ‘number‘ => (refcount=1, is_ref=0)=42)
The array variable itself ($a) is actually a hash table inside the engine, with two zval items meaning and number in it, so there is actually a total of 3 zval in that line of code, and the 3 zval follow the variable reference and counting principles, Use a graph to indicate:
The following adds an element in $A and assigns the value of an existing element to the new element:
<?php\$a = array(‘meaning‘ => ‘life‘, ‘number‘ => 42);\$a[‘life‘] = \$a[‘meaning‘];?>
Then the internal storage for $A is:
a: (refcount=1, is_ref=0)=array ( ‘meaning‘ => (refcount=2, is_ref=0)=‘life‘, ‘number‘ => (refcount=1, is_ref=0)=42, ‘life‘ => (refcount=2, is_ref=0)=‘life‘)
The meaning elements and life elements point to the same zval:
Now, if we try to assign a reference to an array to an element in the array, the interesting thing happens:
<?php\$a = array(‘one‘);\$a[] = &\$a;?>
The $A array has two elements, an index of 0, a character one, and an index of 1, a reference to $a itself, stored internally as follows:
a: (refcount=2, is_ref=1)=array ( 0 => (refcount=1, is_ref=0)=‘one‘, 1 => (refcount=2, is_ref=1)=...)
"..." means that 1 points to a itself, which is a circular reference:
At this time we unset $A, then $A will be removed from the symbol table, while $a points to the refcount of the Zval reduced by 1
<?php\$a = array(‘one‘);\$a[] = &\$a;unset(\$a);?>
Then the problem arises,$A is no longer in the symbol table, the user cannot access the variable again, but the refcount of the Zval pointed to by $A is changed to 1 instead of 0 and therefore cannot be reclaimed, resulting in a memory leak:
In this way, such a zval becomes a really meaningful rubbish, and the new GC's job is to clean up the rubbish.
A new GC algorithm
In order to solve this rubbish, a new GC was produced.
In the PHP5.3 version, using a special GC mechanism to clean up the garbage, in the previous version there is no special GC, then garbage generation, there is no way to clean up, memory wasted. In the PHP5.3 source code, the following files: {phpsrc}/zend/zend_gc.h {phpsrc}/zend/zend_gc.c, here is the implementation of the new GC, we first briefly introduce the idea of the algorithm, Then from the source point of view in detail how the engine implementation of this algorithm.
In the newer PHP manual, there is a simple introduction to the garbage cleanup algorithm used by the new GC, called Concurrent Cycle Collection in Reference counted Systems, which is not described in detail here, According to the contents of the manual, the first simple introduction to the idea:
First, we have a few basic guidelines:
- If a zval refcount increases, then this zval is still in use, not a garbage
- If a zval refcount is reduced to 0, then Zval can be released, not garbage.
- If a zval refcount is reduced by more than 0, then this zval can not be released, this zval may become a garbage
Only under guideline 3 will the GC collect Zval and then use the new algorithm to determine if the zval is garbage. So how do you judge if such a variable is a real garbage?
In a nutshell, it is a refcount minus 1 operation for each element in this zval, and after the operation is completed, if Zval refcount=0, then this zval is a garbage. This principle seems very simple, but it is not so easy to understand, at first I can not understand the meaning of the original, until the source code is understood. If you do not understand now, the following will be described in detail, here first to describe the algorithm in a few steps, first quoted a picture in the manual:
- A: In order to avoid the refcount of each variable when the GC algorithm is called Garbage judgment, the algorithm will first put all the previous criteria 3 in the case of the Zval node in a node (root buffer), and these zval nodes are marked purple, The algorithm must also ensure that each Zval node appears once in the buffer. When the buffer is filled by the node, the GC begins to refuse to judge the Zval node in the buffer.
- B: When the buffer is full, the algorithm zval the 1 of each node with a depth preference, in order to ensure that the refcount of the same zval is not repeatedly performed minus 1, once the refcount of Zval is reduced by 1, the zval is marked as gray. It should be emphasized that, at the beginning of this step, the node Zval itself does not do minus 1 operations, but if the node zval contains the Zval point to the node Zval (ring Reference), then this time needs to reduce the node Zval 1 operations.
- C: The algorithm again in depth first judge each node contains the value of the Zval, if the Zval refcount equals 0, then mark it as white (for garbage), if Zval RefCount is greater than 0, The Zval and its included zval are then RefCount plus 1, which is a non-garbage restore operation that turns these zval colors into black (the default color attribute for Zval).
- D: Traverse the Zval node and release the node Zval marked white in C.
This ABCD four process is the manual in the introduction of this algorithm, it is not so easy to understand the principle of the algorithm, what is the meaning of this method? My own understanding is this:
For example, the previous one into the garbage array $a corresponds to the zval, named Zval_a, if not executed unset, Zval_a RefCount is 2, respectively by $A and $ Index 1 in a points to this zval. Using the algorithm for all elements of this array (index 0 and Index 1) zval RefCount 1 operation, because the index 1 corresponds to zval_a, so this time zval_a refcount should become 1, so zval_a is not a garbage. If the unset operation is performed, the refcount of Zval_a is 1, which is directed to Zval_a by index 1 in Zval_a, minus 1 for zval refcount of all elements in the array (index 0 and Index 1), so Zval_ A's refcount will become 0, so we find zval_a is a rubbish. The algorithm finds stubborn garbage data in this way.
Given this example, the reader should probably be able to understand the clues:
For an array that contains a ring reference, the zval of each element contained in the array is reduced by 1, and then if the zval of the refcount is found to be 0, the array is judged to be a garbage.
This truth is very simple, assuming that the refcount of the array A is equal to M, a has n elements and a, if M equals n, then the result of the algorithm is M minus n,m-n=0, then A is garbage, if m>n, then the result of the algorithm m-n>0, so a is not garbage.
What does m=n represent? The refcount representing a is derived from the Zval element contained in array a itself, which means that there is no variable pointing to it outside of a, which means that the user code space can no longer access the corresponding zval of a, which means that a is a leaked memory, so the GC recycles a garbage.
In PHP, the GC is turned on by default, and you can turn the GC on or off via the zend.enable_gc word in the INI file. When the GC is turned on, the garbage analysis algorithm starts after the node buffer (roots buffer) is full. The buffer can put 10,000 nodes by default, and of course you can change this value by modifying the gc_root_buffer_max_entries in ZEND/ZEND_GC.C, and you need to recompile the link PHP. When the GC is closed, the garbage analysis algorithm will not run, but the related nodes will also be put into the node buffer, this time if the buffer node is full, then the new node will not be recorded, these records are not recorded nodes will never be analyzed by garbage analysis algorithm. If there are circular references in these nodes, a memory leak can occur. The reason to record these nodes when the GC is closed is because it is easier to record these nodes than to determine if the GC is turned on faster each time the node is generated, and the GC can be turned on in the script, so record the nodes and, if the GC is turned on at some point in the code's run, These nodes can be analyzed by analytical algorithms. Of course, the garbage analysis algorithm is a more time-consuming operation.
In PHP code we can turn the GC on and off via the gc_enable () and gc_disable () functions, or you can enforce the garbage analysis algorithm by calling Gc_collect_cycles () when the node buffer is not full. This allows the user to turn off or turn on the GC in certain parts of the program, or to enforce the garbage analysis algorithm.
The performance of the new GC algorithm
1. Prevent leakage Save memory
The purpose of the new GC algorithm is to prevent the memory leakage problem caused by circular reference variables, in PHP, GC algorithm, when the node buffer is full, the garbage analysis algorithm will start, and will release the garbage found, so as to reclaim memory, the PHP manual gives a code and memory usage diagram:
<?phpclass Foo{ public \$var = ‘3.1415962654‘;}\$baseMemory = memory_get_usage();for ( \$i = 0; \$i <= 100000; \$i++ ){ \$a = new Foo; \$a->self = \$a; if ( \$i % 500 === 0 ) { echo sprintf( ‘%8d: ‘, \$i ), memory_get_usage() - \$baseMemory, "/n"; }}?>
In the loop body of this code, a new object variable is created, and a member of the object points to itself, so that a circular reference is made, and when the next loop is entered again, the object variable is re-assigned, which causes the previous object variable memory leak, in this case there are two variables leaking, One is the object itself and the other is the member self in the object, but these two variables only the object will be put into the buffer as the garbage collector's node (since the re-assignment is equivalent to unset it, satisfying the preceding guideline 3). Here we have 100,000 cycles, and the GC will start the garbage analysis algorithm when there are 10,000 nodes in the buffer, so there will be a total of 10 garbage analysis algorithms. It can be clearly seen that in PHP 5.3, each time GC's garbage analysis algorithm is triggered, there is a noticeable decrease in memory. In the 5.2 version of PHP, the amount of memory used will always increase.
2. Operational efficiency impact
With the new GC enabled, the garbage analysis algorithm will be a more time-consuming operation, with a test code in the manual:
<?phpclass Foo{ public \$var = ‘3.1415962654‘;}for ( \$i = 0; \$i <= 1000000; \$i++ ){ \$a = new Foo; \$a->self = \$a;}echo memory_get_peak_usage(), "/n";?>
This code is then executed in the case of GC opening and closing, respectively:
time php -dzend.enable_gc=0 -dmemory_limit=-1 -n example2.php# andtime php -dzend.enable_gc=1 -dmemory_limit=-1 -n example2.php
Finally on the machine, the first execution about 10.7 seconds, the second execution about 11.4 seconds, the performance of about 7%, but the use of memory reduced by 98%, from 931M down to 10M. Of course, this is not a more scientific test method, but also can explain certain problems. This code test is an extremely bad condition, in the actual code, especially in the Web application, it is difficult to have a large number of circular references, GC analysis algorithm is not started so frequently, small-scale code even rarely the opportunity to start GC analysis algorithm.
Summarize:
When the GC's garbage analysis algorithm executes, the efficiency of the PHP script is affected, but small-scale code generally does not have the opportunity to run the algorithm. If the GC analysis algorithm in the script starts to run, it will take a small amount of time to save a lot of memory, which is a very cost-effective thing. The new GC works better for some long-running PHP scripts, such as PHP's daemon daemon, or php-gtk process, and so on.
90-New garbage collection mechanism description