[Translation] Understand the implementation of arrays in PHP (PHP developer's PHP source code-Part 4), developer source code
Article from: http://www.aintnot.com/2016/02/15/understanding-phps-internal-array-implementation-ch
Original article: https://nikic.github.io/2012/03/28/Understanding-PHPs-internal-array-implementation.html
Welcome to the fourth part of the "PHP source code for PHP developers" series. We will talk about how PHP arrays are represented internally and used in the code library.
To prevent you from missing the previous article, the following links are provided:
Part 1: PHP source code for PHP developers-source code structure
Part 2: Understanding the definition of PHP internal functions
Part 3: PHP variable implementation
All things are hash tables.
Basically, everything in PHP is a hash table. Not only in the following PHP array implementations, they are also used to store object attributes, methods, functions, variables, and almost everything.
Because the hash table is too basic for PHP, it is worth studying in depth how it works.
So what is a hash table?
Remember, in C, arrays are memory blocks. You can access these memory blocks by subscript. Therefore, the array in C can only use an integer and ordered key value (that is, you cannot use 1332423442 of the key value after the key value 0 ). C does not contain arrays.
Hash Tables use Hash Functions to convert string keys to normal integer keys. The hash result can be used as the key value (also called memory block) of the Normal C array ). The problem is that the hash function may conflict, that is, multiple string key values may generate the same hash value. For example, in PHP, strings "foo" and "oof" have the same hash value in arrays with more than 64 elements.
This problem can be solved by storing conflicting values in the linked list instead of directly storing the values in the generated subscript.
HashTable and Bucket
Now the basic concepts of the hash table are clear. Let's look at the structure of the hash table implemented in PHP:
typedef struct _hashtable { uint nTableSize; uint nTableMask; uint nNumOfElements; ulong nNextFreeElement; Bucket *pInternalPointer; Bucket *pListHead; Bucket *pListTail; Bucket **arBuckets; dtor_func_t pDestructor; zend_bool persistent; unsigned char nApplyCount; zend_bool bApplyProtection; #if ZEND_DEBUG int inconsistent; #endif} HashTable;
Quick Start:
nNumOfElements
The number of values that are stored in the array. This is also a functioncount($array)
Returned value.
nTableSize
The capacity of the hash table. It is usually the next one greater than or equalnNumOfElements
The power of 2. For example, if the array stores 32 elements, the hash table also has a capacity of 32. However, if one more element is added, that is, the array now has 33 elements, the capacity of the hash table will be adjusted to 64.
This is to ensure that the hash table is always valid in space and time. Obviously, if the hash table is too small, there will be many conflicts and the performance will be reduced. On the other hand, if the hash table is too large, memory is wasted. The power of 2 is a good compromise.
nTableMask
Is the capacity of the hash table minus one. This mask is used to adjust the generated hash value based on the current table size. For example, the real hash value of "foo" (using the DJBX33A hash function) is 193491849. If we have a 64-capacity hash table, we obviously cannot use it as the subscript of the array. Instead, we apply the mask of the hash table and then only take the low position of the hash table.
hash | 193491849 | 0b1011100010000111001110001001& mask | & 63 | & 0b0000000000000000000000111111---------------------------------------------------------= index | = 9 | = 0b0000000000000000000000001001
nNextFreeElement
Is the next usable numeric key value. When you use $ array [] = xyz, it is used.
pInternalPointer
Stores the current position of the array. You can use the reset (), current (), key (), next (), prev (), and end () functions to access this value in the foreach time.
pListHead
AndpListTail
The position of the first and last elements of the array. Remember: PHP arrays are ordered sets. For example, ['foo' => 'bar', 'bar' => 'foo'] and ['bar' => 'foo ', 'foo' => 'bar'] These two arrays contain the same elements, but they have different order.
arBuckets
It is the "hash table (internal C array)" that we often talk about )". It is defined by Bucket **, so it can be seen as the bucket pointer of an array (we will immediately talk about what the Bucket is ).
pDestructor
Is the value destructor. If a value is removed from HT, this function will be called. The common destructor is zval_ptr_dtor. Zval_ptr_dtor will reduce the number of zval references, and, if it encounters o, it will destroy and release it.
The last four variables are not that important to us. Therefore, the persistent table can survive multiple requests. nApplyCount and bApplyProtection prevent multiple recursion. inconsistent is used to capture the illegal use of the hash table in the debugging mode.
Let's continue with the second important structure: Bucket:
typedef struct bucket { ulong h; uint nKeyLength; void *pData; void *pDataPtr; struct bucket *pListNext; struct bucket *pListLast; struct bucket *pNext; struct bucket *pLast; const char *arKey;} Bucket;
h
Is a hash value (the value before the mask value ing is not applied ).
arKey
Used to save the string key value.nKeyLength
Is the corresponding length. If it is a numeric key value, neither of these two variables will be used.
pData
AndpDataPtr
Used to store real values. For the PHP array, its value is a zval struct (but it is also used elsewhere ). Don't worry about the two attributes. The difference between them is who is responsible for releasing values.
pListNext
AndpListLast
Identifies the next element and the previous element of the array element. If PHP wants to traverse the array sequentially, it will start from the bucket of pListHead (in the HashTable structure) and use pListNext bucket as the traversal pointer. The same is true in reverse order, starting with the pListTail pointer, and then using the pListLast pointer as the variable pointer. (You can call end () in your code and then call the prev () function to achieve this effect .)
pNext
AndpLast
Generate the "List of conflicting values" I mentioned above ". The arBucket array stores the first bucket with a possible value. If the bucket does not have the correct key value, PHP searches for the bucket to which pNext points. It will always point to the following bucket until the correct bucket is found. PLast works the same way in reverse order.
As you can see, the implementation of hash tables in PHP is quite complicated. This is the cost of using an ultra-flexible array type.
How is a hash table used?
Zend Engine defines a large number of API functions for hash tables. You can preview low-level hash table functions inzend_hash.h
File. In addition, Zend Enginezend_API.h
File defines a slightly more advanced API.
We don't have enough time to talk about all functions, but we can at least view some Instance functions to see how they work. We will usearray_fill_keys
As an instance function.
You can easily find the function inext/standard/array.c
File. Now, let's quickly view this function.
Like most functions, there is a bunch of variable definitions at the top of the function, and then callszend_parse_parameters
Function:
zval *keys, *val, **entry;HashPosition pos;if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "az", &keys, &val) == FAILURE) { return;}
Obviously,az
Parameter description the first parameter type is an array (that is, a variablekeys
), The second parameter is any zval (that is, the variableval
).
After the parameters are parsed, the returned array is initialized:
/* Initialize return array */array_init_size(return_value, zend_hash_num_elements(Z_ARRVAL_P(keys)));
This line contains three important parts of the array API:
1. Z_ARRVAL_P macro extracts values from zval to the hash table.
2. zend_hash_num_elements: number of elements in the hash table extracted (nNumOfElements attribute ).
3. array_init_size uses the size variable to initialize the array.
Therefore, this row uses the same size as the key-value array to initialize the arrayreturn_value
Variable.
The size here is only an optimization solution. The function can also be called only.array_init(return_value)
As more and more elements are added to the array, PHP resets the array size multiple times. By specifying a specific size, PHP allocates the correct memory space at the beginning.
After the array is initialized and returned, the function uses the code structure roughly the same as below and uses the while loop variable keys array:
zend_hash_internal_pointer_reset_ex(Z_ARRVAL_P(keys), &pos);while (zend_hash_get_current_data_ex(Z_ARRVAL_P(keys), (void **)&entry, &pos) == SUCCESS) { // some code zend_hash_move_forward_ex(Z_ARRVAL_P(keys), &pos);}
This can be easily translated into PHP code:
reset($keys);while (null !== $entry = current($keys)) { // some code next($keys);}
Like the following:
foreach ($keys as $entry) { // some code}
The only difference is that the traversal of C does not use an internal array pointer, but uses its own pos variable to store the current position.
The code in the loop is divided into two branches: one is for the number key value, and the other is for the other key value. The branch of the numeric key value has only the following two lines of code:
zval_add_ref(&val);zend_hash_index_update(Z_ARRVAL_P(return_value), Z_LVAL_PP(entry), &val, sizeof(zval *), NULL);
This looks too straightforward: first, the reference of the value is increased (adding a value to the hash table means adding another reference pointing to it), and then the value is inserted into the hash table.zend_hash_index_update
The macro parameters are the hash tables to be updated.Z_ARRVAL_P(return_value)
, Integer subscriptZ_LVAL_PP(entry)
, Value&val
, Value sizesizeof(zval *)
And the target pointer.NULL
).
The branch of a non-numeric subobject is a little more complicated:
zval key, *key_ptr = *entry;if (Z_TYPE_PP(entry) != IS_STRING) { key = **entry; zval_copy_ctor(&key); convert_to_string(&key); key_ptr = &key;}zval_add_ref(&val);zend_symtable_update(Z_ARRVAL_P(return_value), Z_STRVAL_P(key_ptr), Z_STRLEN_P(key_ptr) + 1, &val, sizeof(zval *), NULL);if (key_ptr != *entry) { zval_dtor(&key);}
First, useconvert_to_string
Converts a key value to a string (unless it is already a string ). Before that,entry
Copied to the newkey
Variable.key = **entry
This line is implemented. In addition,zval_copy_ctor
The function will be called, otherwise complicated structures (such as strings or arrays) will not be correctly copied.
The above copy operation is very necessary, because to ensure that the type conversion will not change the original array. Without the copy operation, the forced conversion not only modifies local variables, but also modifies the values in the key-value array (obviously, this is very unexpected ).
Obviously, after the loop ends, the copy operation needs to be removed again,zval_dtor(&key)
This is the job.zval_ptr_dtor
Andzval_dtor
The difference is thatzval_ptr_dtor
Onlyrefcount
When the variable is 0, the zval variable is destroyed, andzval_dtor
It will be destroyed immediately, instead of relying onrefcount
. That's why you seezval_pte_dtor
Using the "normal" variable whilezval_dtor
Use temporary variables, which are not used elsewhere. And,zval_ptr_dtor
Zval content will be released after being destroyed.zval_dtor
No. Because we do notmalloc()
Nothing, so we do not needfree()
, So in this regard,zval_dtor
Make the right choice.
Now let's take a look at the remaining two rows (two important rows ^ ):
zval_add_ref(&val);zend_symtable_update(Z_ARRVAL_P(return_value), Z_STRVAL_P(key_ptr), Z_STRLEN_P(key_ptr) + 1, &val, sizeof(zval *), NULL);
This is very similar to the operation after the number key-value branch is completed. The difference is that what we call iszend_symtable_update
Insteadzend_hash_index_update
The key-value string and its length are passed.
Symbol table
The "normal" function for inserting string key values to the hash table iszend_hash_update
But it is used here.zend_symtable_update
. What are their differences?
A symbol table is simply a special type of a hash table. This type is used in an array. What is different from the original hash table is how it handles digital key values: In the symbol table, "123" and "123" are considered the same. Therefore, if you store a value in $ array ["123"], you can use $ array [123] to obtain it later.
The underlying layer can be implemented in two ways: "123" to save 123 and "123", or 123 to save the two key values. Obviously, PHP selects the latter (because Integer type is faster than string type and occupies less space ).
If you accidentally use "123" instead of forcibly converting to 123 and then insert data, you will find some interesting things in the symbol table. A forced conversion from an array to an object is as follows:
$obj = new stdClass;$obj->{123} = "foo";$arr = (array) $obj;var_dump($arr[123]); // Undefined offset: 123var_dump($arr["123"]); // Undefined offset: 123
Object properties are always saved using string key values, even though they are numbers. Therefore$obj->{123} = 'foo'
This line of code actually saves the 'foo' variable to the subscript "123. This value is not changed when an array is forcibly converted. However, when$arr[123]
And$arr["123"]
If you want to access the value of 123 (not an existing "123" subscript), an error is thrown. Congratulations, you have created a hidden array element.
Next part
The next part will be published again in ircmaxell's blog. The next article will introduce how objects and classes work internally.