Understanding the implementation of arrays in PHP Welcome to the fourth part of the "PHP source code for PHP developers" series, in this section, we will talk about how PHP arrays are represented internally and used in the code library. To prevent you from missing the previous article, the following links are provided: Part 1: PHP developer's PHP source code-source code structure Part 2: understanding the definition of PHP internal functions
Part 3: PHP variable implementation
All things are hash tables.
Basically, everything in PHP is a hash table. Not only in the following PHP array implementations, they are also used to store object attributes, methods, functions, variables, and almost everything.
Because the hash table is too basic for PHP, it is worth studying in depth how it works.
So what is a hash table?
Remember, in C, arrays are memory blocks. you can access these memory blocks by subscript. Therefore, the array in C can only use an integer and ordered key value (that is, you cannot use 1332423442 of the key value after the key value 0 ). C does not contain arrays.
Hash tables use hash functions to convert string keys to normal integer keys. The hash result can be used as the key value (also called memory block) of the normal C array ). The problem is that the hash function may conflict, that is, multiple string key values may generate the same hash value. For example, in PHP, strings "foo" and "oof" have the same hash value in arrays with more than 64 elements.
This problem can be solved by storing conflicting values in the linked list instead of directly storing the values in the generated subscript.
HashTable and Bucket
Now the basic concepts of the hash table are clear. let's look at the structure of the hash table implemented in PHP:
typedef struct _hashtable { uint nTableSize; uint nTableMask; uint nNumOfElements; ulong nNextFreeElement; Bucket *pInternalPointer; Bucket *pListHead; Bucket *pListTail; Bucket **arBuckets; dtor_func_t pDestructor; zend_bool persistent; unsigned char nApplyCount; zend_bool bApplyProtection; #if ZEND_DEBUG int inconsistent; #endif} HashTable;
Quick start:
NNumOfElements identifies the number of values currently stored in the array. This is also the value returned by the count ($ array) function.
NTableSize indicates the capacity of the hash table. It is usually the power of 2 equal to or greater than nNumOfElements. For example, if the array stores 32 elements, the hash table also has a capacity of 32. However, if one more element is added, that is, the array now has 33 elements, the capacity of the hash table will be adjusted to 64. This is to ensure that the hash table is always valid in space and time. Obviously, if the hash table is too small, there will be many conflicts and the performance will be reduced. On the other hand, if the hash table is too large, memory is wasted. The power of 2 is a good compromise.
NTableMask is the capacity of the hash table minus one. This mask is used to adjust the generated hash value based on the current table size. For example, the real hash value of "foo" (using the DJBX33A hash function) is 193491849. If we have a 64-capacity hash table, we obviously cannot use it as the subscript of the array. Instead, we apply the mask of the hash table and then only take the low position of the hash table.
hash | 193491849 | 0b1011100010000111001110001001
NNextFreeElement is the next usable numeric key value. when you use $ array [] = xyz, it is used.
PInternalPointer stores the current position of the array. You can use the reset (), current (), key (), next (), prev (), and end () functions to access this value in the foreach time.
PListHead and pListTail identify the positions of the first and last elements of the array. Remember: PHP arrays are ordered sets. For example, ['foo' => 'bar', 'bar' => 'foo'] and ['bar' => 'foo ', 'foo' => 'bar'] These two arrays contain the same elements, but they have different order.
ArBuckets is a frequently talked about "hash table (internal C array )". It is defined by Bucket **, so it can be seen as the bucket pointer of an array (we will immediately talk about what the Bucket is ).
PDestructor is the value destructor. If a value is removed from HT, this function will be called. The common destructor is zval_ptr_dtor. Zval_ptr_dtor will reduce the number of zval references, and, if it encounters o, it will destroy and release it.
The last four variables are not that important to us. Therefore, the persistent table can survive multiple requests. nApplyCount and bApplyProtection prevent multiple recursion. inconsistent is used to capture the illegal use of the hash table in the debugging mode.
Let's continue with the second important structure: Bucket:
typedef struct bucket { ulong h; uint nKeyLength; void *pData; void *pDataPtr; struct bucket *pListNext; struct bucket *pListLast; struct bucket *pNext; struct bucket *pLast; const char *arKey;} Bucket;
H is a hash value (the value before mask value ING is not applied ).
ArKey is used to save the string key value. NKeyLength is the corresponding length. If it is a numeric key value, neither of these two variables will be used.
PData and pDataPtr are used to store real values. For the PHP array, its value is a zval struct (but it is also used elsewhere ). Don't worry about the two attributes. The difference between them is who is responsible for releasing values.
PListNext and pListLast identify the next and previous elements of the array element. If PHP wants to traverse the array sequentially, it will start from the bucket of pListHead (in the HashTable structure) and use pListNext bucket as the traversal pointer. The same is true in reverse order, starting with the pListTail pointer, and then using the pListLast pointer as the variable pointer. (You can call end () in your code and then call the prev () function to achieve this effect .)
PNext and pLast generate the "list of values that may conflict with each other" I mentioned above ". The arBucket array stores the first bucket with a possible value. If the bucket does not have the correct key value, PHP searches for the bucket to which pNext points. It will always point to the following bucket until the correct bucket is found. PLast works the same way in reverse order.
As you can see, the implementation of hash tables in PHP is quite complicated. This is the cost of using an ultra-flexible array type.
How is a hash table used?
Zend Engine defines a large number of API functions for hash tables. Preview low-level hash table functions can be found in the zend_hash.h file. In addition, Zend Engine defines a slightly more advanced API in the zend_API.h file.
We don't have enough time to talk about all functions, but we can at least view some instance functions to see how they work. We will use array_fill_keys as the instance function.
Using the technique mentioned in the second part, you can easily find the function defined in the ext/standard/array. c file. Now, let's quickly view this function.
Like most functions, there is a bunch of variable definitions at the top of the function, and then the zend_parse_parameters function is called:
zval *keys, *val, **entry;HashPosition pos;if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "az", &keys, &val) == FAILURE) { return;}
Obviously, the az parameter indicates that the first parameter type is array (that is, the variable keys), and the second parameter is any zval (that is, the variable val ).
After the parameters are parsed, the returned array is initialized:
/* Initialize return array */array_init_size(return_value, zend_hash_num_elements(Z_ARRVAL_P(keys)));
This line contains three important parts of the array API:
1. Z_ARRVAL_P macro extracts values from zval to the hash table.
2. zend_hash_num_elements: number of elements in the hash table extracted (nNumOfElements attribute ).
3. array_init_size uses the size variable to initialize the array.
Therefore, this row uses the same size as the key-value array to initialize the array to the return_value variable.
The size here is only an optimization solution. The function can also only call array_init (return_value). as more and more elements are added to the array, PHP resets the array size multiple times. By specifying a specific size, PHP allocates the correct memory space at the beginning.
After the array is initialized and returned, the function uses the code structure roughly the same as below and uses the while loop variable keys array:
zend_hash_internal_pointer_reset_ex(Z_ARRVAL_P(keys), &pos);while (zend_hash_get_current_data_ex(Z_ARRVAL_P(keys), (void **)&entry, &pos) == SUCCESS) { // some code zend_hash_move_forward_ex(Z_ARRVAL_P(keys), &pos);}
This can be easily translated into PHP code:
reset($keys);while (null !== $entry = current($keys)) { // some code next($keys);}
Like the following:
foreach ($keys as $entry) { // some code}
The only difference is that the traversal of C does not use an internal array pointer, but uses its own pos variable to store the current position.
The code in the loop is divided into two branches: one is for the number key value, and the other is for the other key value. The branch of the numeric key value has only the following two lines of code:
zval_add_ref(&val);zend_hash_index_update(Z_ARRVAL_P(return_value), Z_LVAL_PP(entry), &val, sizeof(zval *), NULL);
This looks too straightforward: first, the reference of the value is increased (adding a value to the hash table means adding another reference pointing to it), and then the value is inserted into the hash table. The parameters of the zend_hash_index_update macro are: the hash table Z_ARRVAL_P (return_value) to be updated, the integer subscript Z_LVAL_PP (entry), the value & val, and the value size sizeof (zval *) and the target pointer (this is not concerned, so it is NULL ).
The branch of a non-numeric subobject is a little more complicated:
zval key, *key_ptr = *entry;if (Z_TYPE_PP(entry) != IS_STRING) { key = **entry; zval_copy_ctor(&key); convert_to_string(&key); key_ptr = &key;}zval_add_ref(&val);zend_symtable_update(Z_ARRVAL_P(return_value), Z_STRVAL_P(key_ptr), Z_STRLEN_P(key_ptr) + 1, &val, sizeof(zval *), NULL);if (key_ptr != *entry) { zval_dtor(&key);}
First, use convert_to_string to convert the key value to a string (unless it is already a string ). Before that, the entry is copied to the new key variable. Key = ** entry. In addition, the zval_copy_ctor function will be called, otherwise complicated structures (such as strings or arrays) will not be correctly copied.
The above copy operation is very necessary, because to ensure that the type conversion will not change the original array. Without the copy operation, the forced conversion not only modifies local variables, but also modifies the values in the key-value array (obviously, this is very unexpected ).
Obviously, after the loop ends, the copy operation needs to be removed again, and zval_dtor (& key) does this. The difference between zval_ptr_dtor and zval_dtor is that zval_ptr_dtor will only destroy the zval variable when the refcount variable is 0, and zval_dtor will destroy it immediately instead of relying on the refcount value. This is why zval_pte_dtor uses the "normal" variable while zval_dtor uses the temporary variable, which is not used elsewhere. In addition, zval_ptr_dtor will release zval content after destruction, but zval_dtor will not. Because we don't have anything in malloc (), we don't need free (), so zval_dtor makes the right choice in this regard.
Now let's take a look at the remaining two rows (two important rows ^ ):
zval_add_ref(&val);zend_symtable_update(Z_ARRVAL_P(return_value), Z_STRVAL_P(key_ptr), Z_STRLEN_P(key_ptr) + 1, &val, sizeof(zval *), NULL);
This is very similar to the operation after the number key-value branch is completed. The difference is that zend_symtable_update is called instead of zend_hash_index_update, and the key-value string and its length are passed.
Symbol table
The "normal" function for inserting string key values to the hash table is zend_hash_update, but zend_symtable_update is used here. What are their differences?
A symbol table is simply a special type of a hash table. This type is used in an array. What is different from the original hash table is how it handles digital key values: in the symbol table, "123" and "123" are considered the same. Therefore, if you store a value in $ array ["123"], you can use $ array [123] to obtain it later.
The underlying layer can be implemented in two ways: "123" to save 123 and "123", or "123" to save the two key values. Obviously, PHP selects the latter (because integer type is faster than string type and occupies less space ).
If you accidentally use "123" instead of forcibly converting to 123 and then insert data, you will find some interesting things in the symbol table. A forced conversion from an array to an object is as follows:
$obj = new stdClass;$obj->{123} = "foo";$arr = (array) $obj;var_dump($arr[123]); // Undefined offset: 123var_dump($arr["123"]); // Undefined offset: 123
Object properties are always saved using string key values, even though they are numbers. Therefore, the $ obj-> {123} = 'foo' line of code actually saves the 'foo' variable to the subscript "123. This value is not changed when an array is forcibly converted. However, when both $ arr [123] and $ arr ["123"] want to access the value of the 123 submark (not an existing "123" subscript), an error is thrown. Congratulations, you have created a hidden array element.
The next part will be published again in ircmaxell's blog. The next article will introduce how objects and classes work internally.
I hope you can translate more articles. thank you!
Rewarding translators
I hope you can translate more articles. thank you!
Any payment method
About the Author: hoohack
An ongoing Cainiao personal homepage · My Articles · 15 ·