[Translation] Understand the implementation of arrays in PHP (PHP developer's PHP source code-Part 4), developer source code

Source: Internet
Author: User

[Translation] Understand the implementation of arrays in PHP (PHP developer's PHP source code-Part 4), developer source code

Article from: http://www.aintnot.com/2016/02/15/understanding-phps-internal-array-implementation-ch

Original article: https://nikic.github.io/2012/03/28/Understanding-PHPs-internal-array-implementation.html

Welcome to the fourth part of the "PHP source code for PHP developers" series. We will talk about how PHP arrays are represented internally and used in the code library.

To prevent you from missing the previous article, the following links are provided:

  • Part 1: PHP source code for PHP developers-source code structure

  • Part 2: Understanding the definition of PHP internal functions

  • Part 3: PHP variable implementation

All things are hash tables.

Basically, everything in PHP is a hash table. Not only in the following PHP array implementations, they are also used to store object attributes, methods, functions, variables, and almost everything.

Because the hash table is too basic for PHP, it is worth studying in depth how it works.

So what is a hash table?

Remember, in C, arrays are memory blocks. You can access these memory blocks by subscript. Therefore, the array in C can only use an integer and ordered key value (that is, you cannot use 1332423442 of the key value after the key value 0 ). C does not contain arrays.

Hash Tables use Hash Functions to convert string keys to normal integer keys. The hash result can be used as the key value (also called memory block) of the Normal C array ). The problem is that the hash function may conflict, that is, multiple string key values may generate the same hash value. For example, in PHP, strings "foo" and "oof" have the same hash value in arrays with more than 64 elements.

This problem can be solved by storing conflicting values in the linked list instead of directly storing the values in the generated subscript.

HashTable and Bucket

Now the basic concepts of the hash table are clear. Let's look at the structure of the hash table implemented in PHP:

typedef struct _hashtable {    uint nTableSize;    uint nTableMask;    uint nNumOfElements;    ulong nNextFreeElement;    Bucket *pInternalPointer;    Bucket *pListHead;    Bucket *pListTail;    Bucket **arBuckets;    dtor_func_t pDestructor;    zend_bool persistent;    unsigned char nApplyCount;    zend_bool bApplyProtection;     #if ZEND_DEBUG        int inconsistent;     #endif} HashTable;
Quick Start:

nNumOfElementsThe number of values that are stored in the array. This is also a functioncount($array)Returned value.

nTableSizeThe capacity of the hash table. It is usually the next one greater than or equalnNumOfElementsThe power of 2. For example, if the array stores 32 elements, the hash table also has a capacity of 32. However, if one more element is added, that is, the array now has 33 elements, the capacity of the hash table will be adjusted to 64.

This is to ensure that the hash table is always valid in space and time. Obviously, if the hash table is too small, there will be many conflicts and the performance will be reduced. On the other hand, if the hash table is too large, memory is wasted. The power of 2 is a good compromise.

nTableMaskIs the capacity of the hash table minus one. This mask is used to adjust the generated hash value based on the current table size. For example, the real hash value of "foo" (using the DJBX33A hash function) is 193491849. If we have a 64-capacity hash table, we obviously cannot use it as the subscript of the array. Instead, we apply the mask of the hash table and then only take the low position of the hash table.

hash           |        193491849 |     0b1011100010000111001110001001& mask         | &             63 | &   0b0000000000000000000000111111---------------------------------------------------------= index        | = 9              | =   0b0000000000000000000000001001

nNextFreeElementIs the next usable numeric key value. When you use $ array [] = xyz, it is used.

pInternalPointerStores the current position of the array. You can use the reset (), current (), key (), next (), prev (), and end () functions to access this value in the foreach time.

pListHeadAndpListTailThe position of the first and last elements of the array. Remember: PHP arrays are ordered sets. For example, ['foo' => 'bar', 'bar' => 'foo'] and ['bar' => 'foo ', 'foo' => 'bar'] These two arrays contain the same elements, but they have different order.

arBucketsIt is the "hash table (internal C array)" that we often talk about )". It is defined by Bucket **, so it can be seen as the bucket pointer of an array (we will immediately talk about what the Bucket is ).

pDestructorIs the value destructor. If a value is removed from HT, this function will be called. The common destructor is zval_ptr_dtor. Zval_ptr_dtor will reduce the number of zval references, and, if it encounters o, it will destroy and release it.

The last four variables are not that important to us. Therefore, the persistent table can survive multiple requests. nApplyCount and bApplyProtection prevent multiple recursion. inconsistent is used to capture the illegal use of the hash table in the debugging mode.

Let's continue with the second important structure: Bucket:

typedef struct bucket {    ulong h;    uint nKeyLength;    void *pData;    void *pDataPtr;    struct bucket *pListNext;    struct bucket *pListLast;    struct bucket *pNext;    struct bucket *pLast;    const char *arKey;} Bucket;

hIs a hash value (the value before the mask value ing is not applied ).

arKeyUsed to save the string key value.nKeyLengthIs the corresponding length. If it is a numeric key value, neither of these two variables will be used.

pDataAndpDataPtrUsed to store real values. For the PHP array, its value is a zval struct (but it is also used elsewhere ). Don't worry about the two attributes. The difference between them is who is responsible for releasing values.

pListNextAndpListLastIdentifies the next element and the previous element of the array element. If PHP wants to traverse the array sequentially, it will start from the bucket of pListHead (in the HashTable structure) and use pListNext bucket as the traversal pointer. The same is true in reverse order, starting with the pListTail pointer, and then using the pListLast pointer as the variable pointer. (You can call end () in your code and then call the prev () function to achieve this effect .)

pNextAndpLastGenerate the "List of conflicting values" I mentioned above ". The arBucket array stores the first bucket with a possible value. If the bucket does not have the correct key value, PHP searches for the bucket to which pNext points. It will always point to the following bucket until the correct bucket is found. PLast works the same way in reverse order.

As you can see, the implementation of hash tables in PHP is quite complicated. This is the cost of using an ultra-flexible array type.

How is a hash table used?

Zend Engine defines a large number of API functions for hash tables. You can preview low-level hash table functions inzend_hash.hFile. In addition, Zend Enginezend_API.hFile defines a slightly more advanced API.

We don't have enough time to talk about all functions, but we can at least view some Instance functions to see how they work. We will usearray_fill_keysAs an instance function.

You can easily find the function inext/standard/array.cFile. Now, let's quickly view this function.

Like most functions, there is a bunch of variable definitions at the top of the function, and then callszend_parse_parametersFunction:

zval *keys, *val, **entry;HashPosition pos;if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "az", &keys, &val) == FAILURE) {    return;}

Obviously,azParameter description the first parameter type is an array (that is, a variablekeys), The second parameter is any zval (that is, the variableval).

After the parameters are parsed, the returned array is initialized:

/* Initialize return array */array_init_size(return_value, zend_hash_num_elements(Z_ARRVAL_P(keys)));

This line contains three important parts of the array API:

1. Z_ARRVAL_P macro extracts values from zval to the hash table.

2. zend_hash_num_elements: number of elements in the hash table extracted (nNumOfElements attribute ).

3. array_init_size uses the size variable to initialize the array.

Therefore, this row uses the same size as the key-value array to initialize the arrayreturn_valueVariable.

The size here is only an optimization solution. The function can also be called only.array_init(return_value)As more and more elements are added to the array, PHP resets the array size multiple times. By specifying a specific size, PHP allocates the correct memory space at the beginning.

After the array is initialized and returned, the function uses the code structure roughly the same as below and uses the while loop variable keys array:

zend_hash_internal_pointer_reset_ex(Z_ARRVAL_P(keys), &pos);while (zend_hash_get_current_data_ex(Z_ARRVAL_P(keys), (void **)&entry, &pos) == SUCCESS) {    // some code    zend_hash_move_forward_ex(Z_ARRVAL_P(keys), &pos);}

This can be easily translated into PHP code:

reset($keys);while (null !== $entry = current($keys)) {    // some code    next($keys);}

Like the following:

foreach ($keys as $entry) {    // some code}

The only difference is that the traversal of C does not use an internal array pointer, but uses its own pos variable to store the current position.

The code in the loop is divided into two branches: one is for the number key value, and the other is for the other key value. The branch of the numeric key value has only the following two lines of code:

zval_add_ref(&val);zend_hash_index_update(Z_ARRVAL_P(return_value), Z_LVAL_PP(entry), &val, sizeof(zval *), NULL);

This looks too straightforward: first, the reference of the value is increased (adding a value to the hash table means adding another reference pointing to it), and then the value is inserted into the hash table.zend_hash_index_updateThe macro parameters are the hash tables to be updated.Z_ARRVAL_P(return_value), Integer subscriptZ_LVAL_PP(entry), Value&val, Value sizesizeof(zval *)And the target pointer.NULL).

The branch of a non-numeric subobject is a little more complicated:

zval key, *key_ptr = *entry;if (Z_TYPE_PP(entry) != IS_STRING) {    key = **entry;    zval_copy_ctor(&key);    convert_to_string(&key);    key_ptr = &key;}zval_add_ref(&val);zend_symtable_update(Z_ARRVAL_P(return_value), Z_STRVAL_P(key_ptr), Z_STRLEN_P(key_ptr) + 1, &val, sizeof(zval *),             NULL);if (key_ptr != *entry) {    zval_dtor(&key);}

First, useconvert_to_stringConverts a key value to a string (unless it is already a string ). Before that,entryCopied to the newkeyVariable.key = **entryThis line is implemented. In addition,zval_copy_ctorThe function will be called, otherwise complicated structures (such as strings or arrays) will not be correctly copied.

The above copy operation is very necessary, because to ensure that the type conversion will not change the original array. Without the copy operation, the forced conversion not only modifies local variables, but also modifies the values in the key-value array (obviously, this is very unexpected ).

Obviously, after the loop ends, the copy operation needs to be removed again,zval_dtor(&key)This is the job.zval_ptr_dtorAndzval_dtorThe difference is thatzval_ptr_dtorOnlyrefcountWhen the variable is 0, the zval variable is destroyed, andzval_dtorIt will be destroyed immediately, instead of relying onrefcount. That's why you seezval_pte_dtorUsing the "normal" variable whilezval_dtorUse temporary variables, which are not used elsewhere. And,zval_ptr_dtorZval content will be released after being destroyed.zval_dtorNo. Because we do notmalloc()Nothing, so we do not needfree(), So in this regard,zval_dtorMake the right choice.

Now let's take a look at the remaining two rows (two important rows ^ ):

zval_add_ref(&val);zend_symtable_update(Z_ARRVAL_P(return_value), Z_STRVAL_P(key_ptr), Z_STRLEN_P(key_ptr) + 1, &val, sizeof(zval *), NULL);

This is very similar to the operation after the number key-value branch is completed. The difference is that what we call iszend_symtable_updateInsteadzend_hash_index_updateThe key-value string and its length are passed.

Symbol table

The "normal" function for inserting string key values to the hash table iszend_hash_updateBut it is used here.zend_symtable_update. What are their differences?

A symbol table is simply a special type of a hash table. This type is used in an array. What is different from the original hash table is how it handles digital key values: In the symbol table, "123" and "123" are considered the same. Therefore, if you store a value in $ array ["123"], you can use $ array [123] to obtain it later.

The underlying layer can be implemented in two ways: "123" to save 123 and "123", or 123 to save the two key values. Obviously, PHP selects the latter (because Integer type is faster than string type and occupies less space ).

If you accidentally use "123" instead of forcibly converting to 123 and then insert data, you will find some interesting things in the symbol table. A forced conversion from an array to an object is as follows:

$obj = new stdClass;$obj->{123} = "foo";$arr = (array) $obj;var_dump($arr[123]); // Undefined offset: 123var_dump($arr["123"]); // Undefined offset: 123

Object properties are always saved using string key values, even though they are numbers. Therefore$obj->{123} = 'foo'This line of code actually saves the 'foo' variable to the subscript "123. This value is not changed when an array is forcibly converted. However, when$arr[123]And$arr["123"]If you want to access the value of 123 (not an existing "123" subscript), an error is thrown. Congratulations, you have created a hidden array element.

Next part

The next part will be published again in ircmaxell's blog. The next article will introduce how objects and classes work internally.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.