PHP kernel exploration: New garbage collection mechanism description

Source: Internet
Author: User
Tags gc collect
In PHP 5.2 and earlier versions, there is no dedicated Garbage Collector GC (GarbageCollection ), the engine determines whether a variable space can be released based on the zval refcount of the variable. If the refcount is 0, the variable space can be released; otherwise, the variable space will not be released, this is a very simple GC implementation. However

In PHP 5.2 and earlier versions, there is no dedicated Garbage Collector GC (Garbage Collection ), the engine determines whether a variable space can be released based on the zval refcount of the variable. If the refcount is 0, the variable space can be released; otherwise, the variable space will not be released, this is a very simple GC implementation. However

In PHP 5.2 and earlier versions, there is no dedicated Garbage Collector GC (Garbage Collection ), the engine determines whether a variable space can be released based on the zval refcount value of the variable. If the refcount is 0, the variable space can be released; otherwise, the variable space will not be released, this is a very simple GC implementation. However, in this simple GC implementation scheme, unexpected variable Memory leakage occurs (Bug: http://bugs.php.net/bug.php? Id = 33595), the engine will not be able to recycle the memory, so a new GC emerged in PHP5.3, the new GC has a dedicated mechanism to clean up junk data, to prevent memory leakage. This article will elaborate on the new GC Operating Mechanism in PHP5.3.


At present, there are few details about the new GC. This article will be the most detailed domestic article about the GC principle in PHP5.3 from the source code point of view. The introduction to garbage generation and algorithms is translated by the author according to the manual. Of course, some of my views are incorporated into the Manual. Related content in the manual: Garbage Collection


What is garbage


First, we need to define the concept of "garbage". The garbage that the new GC is responsible for cleaning means that the variable container zval still exists, but there is no variable name pointing to this zval. Therefore, an important criterion for GC to determine whether it is spam is whether the variable name points to the variable container zval.


Suppose we have a piece of PHP code that uses a temporary variable $ tmp to store a string. After processing the string, we do not need this $ tmp variable, $ tmp variable can be regarded as a "garbage" for us, but for GC, $ tmp is actually not a garbage, $ tmp variable has no meaning for us, however, this variable actually exists, and the $ tmp symbol still points to its corresponding zval. GC will think that this variable may be used in PHP code, so it will not be defined as garbage.


If we call unset to delete this variable after $ tmp is used in PHP code, Will $ tmp become garbage. Unfortunately, GC still does not regard $ tmp as a garbage because $ tmp is after unset, reduce refcount by 1 to 0 (assuming that no other variable points to the same zval as $ tmp), GC will directly release the zval memory space corresponding to $ tmp, $ tmp and its corresponding zval do not exist at all. At this time, $ tmp is not the "garbage" that the new GC will deal ". So what kind of garbage should the new GC deal with? Below we will produce such garbage.


The process of producing stubborn garbage


If you have read the content related to the internal storage of variables, you must have a certain understanding of the internal information of the refcount and isref variables. Here we will introduce the garbage generation process with an example in the manual:


1

2

3

$a = "new string";

?>

In such a simple code, the internal storage information of the $ a variable is: a: (refcount = 1, is_ref = 0) = 'new string'


When $ a is assigned to another variable, the refcount of zval corresponding to $ a is incremented by 1.


1

2

3

4

$a = "new string";

$b = $a;

?>

The internal storage information corresponding to $ a and $ B is a, B: (refcount = 2, is_ref = 0) = 'new string'


When we use unset to delete the $ B variable, the refcount of zval corresponding to $ B will decrease.

1

2

3

4

5

$a = "new string"; //a: (refcount=1, is_ref=0)='new string'

$b = $a; //a,b: (refcount=2, is_ref=0)='new string'

unset($b); //a: (refcount=1, is_ref=0)='new string'

?>

For common variables, this seems to be normal, but in composite variables (arrays and objects), something interesting will happen:


1

2

3

$a = array('meaning'=> 'life','number' => 42);

?>

A's internal storage information is:


A: (refcount = 1, is_ref = 0) = array (

'Meaning' => (refcount = 1, is_ref = 0) = 'LIFE ',

'Number' => (refcount = 1, is_ref = 0) = 4)

The array variable itself ($ a) is actually a hash table inside the engine. This table has two zval items meaning and number, so in fact, that line of code has three zval, the three zval values follow the reference and counting principles of variables and are represented in graphs:



Add an element to $ a and assign the value of an existing element to the new element:


1

2

3

4

$a = array('meaning'=> 'life','number' => 42);

$a['life'] =$a['meaning'];

?>

The internal storage of $ a is:


A: (refcount = 1, is_ref = 0) = array (

'Meaning' => (refcount = 2, is_ref = 0) = 'LIFE ',

'Number' => (refcount = 1, is_ref = 0) = 42,

'LIFE' => (refcount = 2, is_ref = 0) = 'LIFE'

)

The meaning and life elements point to the same zval:


Now, if we try to assign an array reference to an element in the array, an interesting thing will happen:


1

2

3

4

$a = array('one');

$a[] = &$a;

?>

In this way, the $ a array has two elements. one index is 0, the value is one, and the other index is 1, which is a reference of $ a. The internal storage is as follows:


A: (refcount = 2, is_ref = 1) = array (

0 => (refcount = 1, is_ref = 0) = 'one ',

1 => (refcount = 2, is_ref = 1) =...

)

"..." Indicates that 1 points to a itself and is a circular reference:


When $ a is unset, $ a is deleted from the symbol table, and the refcount of zval pointed to by $ a is reduced.

1

2

3

4

5

$a = array('one');

$a[] = &$a;

unset($a);

?>

Then the problem arises. $ a is no longer in the symbol table and the user cannot access this variable. However, the refcount of zval pointed to before $ a is changed to 1 rather than 0, so it cannot be recycled, this results in Memory leakage:


In this way, such a zval becomes a real garbage. The new GC job is to clear the garbage.


New GC Algorithm


To solve this garbage, a new GC is generated.


In PHP5.3, a dedicated GC mechanism is used to clean up garbage. In previous versions, there was no specific GC, so there was no way to clean up garbage production, memory is wasted. In PHP5.3 source code, the following files are added: {PHPSRC}/Zend/zend_gc.h {PHPSRC}/Zend/zend_gc.c. Here is the implementation of the new GC. Let's briefly introduce the algorithm IDEA, then we will introduce in detail how the engine implements this algorithm from the source code perspective.


The new PHP manual briefly introduces the garbage cleaning algorithm used by the new GC. This algorithm is called Concurrent Cycle Collection in Reference Counted Systems. This algorithm is not described here, follow the instructions in the manual to briefly introduce the following ideas:


First, we have several basic principles:


If the refcount of a zval is increased, the zval is still in use and does not belong to garbage.

If the refcount of a zval is reduced to 0, zval can be released instead of garbage.

If the refcount of a zval is reduced and greater than 0, the zval cannot be released, and the zval may become a garbage

Only under Criterion 3 will GC collect zval and use a new algorithm to determine whether zval is garbage. So how can we determine whether such a variable is real garbage?


To put it simply, we perform a refcount minus 1 operation for each element in zval. After the operation is complete, if zval's refcount is 0, this zval is a garbage. This principle seems simple, but it is not so easy to understand. At first, the author could not understand its meaning until the source code is mined. It doesn't matter if you don't understand it now. I will introduce it in detail later. Here I will describe the steps of this algorithm. First I will refer to a figure in the manual:


A: To avoid the GC algorithm being called every time the variable's refcount is reduced, this algorithm first puts all zval nodes in the preceding Criterion 3 into A node (root) buffer (root buffer), and mark these zval nodes as purple, and the algorithm must ensure that each zval node appears in the buffer. When the buffer zone is full of nodes, GC starts to spam the zval nodes in the buffer zone.

B: When the buffer zone is full, the algorithm takes precedence over the zval minus 1 operation on each node in depth. To ensure that the refcount operation on the same zval is not repeated, once zval's refcount is reduced by 1, zval is marked as gray. Note that in this step, zval of the node itself is not reduced by 1 at first, but if the zval contained in the node zval points to the node zval (circular reference ), in this case, the zval node needs to be reduced by 1.

C: The algorithm checks the zval value contained by each node in depth first. If zval's refcount is equal to 0, mark it as white (representing garbage ), if zval's refcount is greater than 0, the zval and its contained zval will be refcount plus 1, which is a non-junk Restoration Operation, change the zval color to Black (the default color attribute of zval ).

D: traverse the zval node and release the zval marked as a white node in C.

The four ABCD processes are the introduction of this algorithm in the manual. This is not so easy to understand the principle. What does this algorithm mean? My own understanding is as follows:


For example, the zval corresponding to $ a in the previous garbage array is named zval_a. If unset is not executed, the refcount of zval_a is 2, the zval is directed by index 1 in $ a and $ a respectively. The algorithm is used to subtract 1 from the refcount of zval of all elements (index 0 and Index 1) in this array. Because Index 1 corresponds to zval_a, therefore, zval_a's refcount should be changed to 1 at this time, so zval_a is not a garbage. If the unset operation is performed, the refcount of zval_a is 1, and the index 1 in zval_a points to zval_a. All elements in the array are indexed by the algorithm (index 0 and Index 1) zval_a's refcount is reduced by 1, so zval_a's refcount is changed to 0, so zval_a is a garbage. In this way, the algorithm discovers stubborn junk data.


Taking this example, the reader should probably be able to understand the clues:


For an array containing a ring reference, perform the zval minus 1 operation on each element contained in the array. If the refcount of the zval of the array itself is changed to 0, then we can judge that this array is a garbage.


This principle is actually very simple. Suppose the refcount of array a is equal to m, and n elements in a point to a. If m is equal to n, then the result of the algorithm is m minus n, m-n = 0, then a is garbage. If m> n, then the result of the algorithm is m-n> 0, so a is not garbage.


What does m = n represent? The refcount OF a comes from the zval element contained in array a. It means that no variable exists in a and the zval corresponding to a cannot be accessed in the user code space, it indicates that a is the leaked memory, so GC recycles the garbage from.


In PHP, GC is enabled by default. You can enable or disable GC by using the zend. enable_gc item in the INI file. When GC is enabled, the spam analysis algorithm is started when the node buffer (roots buffer) is full. By default, the buffer can contain 10,000 nodes. Of course, you can change the value by modifying GC_ROOT_BUFFER_MAX_ENTRIES in Zend/zend_gc.c. You need to re-compile the link to PHP. When GC is disabled, the spam analysis algorithm will not run, but the related nodes will be placed in the node buffer. If the buffer node is full, then, new nodes will not be recorded, and those not recorded will never be analyzed by the spam analysis algorithm. If these nodes have cyclic references, memory leakage may occur. The reason for recording these nodes when GC is disabled is that it is faster to simply record these nodes than to determine whether GC is enabled every time a node is generated, in addition, GC can be enabled during script running, so record these nodes. If GC is enabled at some time during code running, these nodes can be analyzed by analysis algorithms. Of course, the spam analysis algorithm is a time-consuming operation.


In PHP code, we can enable and disable GC through the gc_enable () and gc_disable () functions, or call gc_collect_cycles () the Spam analysis algorithm is enforced when the node buffer is not full. In this way, you can disable or enable GC in some parts of the program, or force the spam analysis algorithm.


Performance of new GC Algorithms


1. prevent leakage and save memory


The purpose of the new GC algorithm is to prevent memory leakage caused by variable referenced cyclically. in PHP, the GC algorithm starts when the node buffer is full, in addition, the detected garbage is released to recycle the memory. In the PHP manual, a code and memory usage diagram are provided:


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

class Foo

{

public$var = '3.1415962654';

}

$baseMemory= memory_get_usage();

for ( $i = 0; $i <= 100000; $i++ )

{

$a= new Foo;

$a->self =$a;

if( $i % 500 === 0 )

{

echosprintf( '%8d: ',$i ), memory_get_usage() -$baseMemory, "/n";

}

}

?>

In the loop body of this Code, a new object variable is created, and a member of the object points to itself. This forms a circular reference. When it enters the next loop, assign a value to the object variable again, which may cause memory leakage of the previous object variable. In this example, two variables are leaked. One is the object itself, and the other is the object's member self, however, only the two variables will be put into the buffer as the node of the garbage collector (because the re-assignment is equivalent to the unset operation on the object, meeting the previous Criterion 3 ). Here we perform 100,000 cycles, and GC starts the spam analysis algorithm when there are 10,000 nodes in the buffer zone. Therefore, a total of 10 spam analysis algorithms will be performed here. We can clearly see that in PHP 5.3, the memory will be significantly reduced after each GC spam analysis algorithm is triggered. In PHP 5.2, memory usage continues to increase.


2. Impact of Operation Efficiency


After the new GC is enabled, the spam analysis algorithm is a time-consuming operation. The manual provides a test code:


1

2

3

4

5

6

7

8

9

10

11

12

class Foo

{

public$var = '3.1415962654';

}

for ( $i = 0; $i <= 1000000; $i++ )

{

$a= new Foo;

$a->self =$a;

}

echo memory_get_peak_usage(), "/n";

?>

Then execute this code when GC is enabled or disabled:



Time php-dzend. enable_gc = 0-dmemory_limit =-1-n example2.php

# And

Time php-dzend. enable_gc = 1-dmemory_limit =-1-n example2.php

Finally, on this machine, the first execution took about 10.7 seconds, the second execution took about 11.4 seconds, and the performance reduced by about 7%, but the memory usage reduced by 98%, reduced from 931M to 10 M. Of course, this is not a scientific test method, but it can also explain some problems. This type of code tests is an extremely harsh condition. In actual code, especially in WEB applications, it is difficult to produce a large number of cyclic references, and GC analysis algorithms are not started so frequently, few small-scale codes even have the opportunity to start GC analysis algorithms.


Summary:


When GC spam analysis algorithms are executed, the efficiency of PHP scripts will be affected. However, small-scale Code generally does not have the opportunity to run this algorithm. If the GC Analysis Algorithm in the script starts to run, it will take a small amount of time to save a lot of memory, which is very cost-effective. The new GC has a better effect on some long-term running PHP scripts, such as the DAEMON, or the PHP-GTK process.


Site: http://www.phpdoor.com/PHP/280.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.