Understanding the implementation of arrays within PHP

Source: Internet
Author: User
Welcome to the fourth part of the "PHP Source for PHP developers" series, which we'll talk about how the PHP array is represented internally and used in the code base. To prevent you from missing the previous article, here is the link: Part I: PHP source for PHP Developers-source Structure Part II: Understanding the definition of PHP intrinsics

Part III: PHP's variable implementation

Everything is a hash table.

Basically, everything in PHP is a hash table. Not just in the following PHP array implementations, they are also used to store object properties, methods, functions, variables, and almost everything.

Because the hash table is too basic for PHP, it's worth delving into how it works.

So, what is a hash table?

Remember, in C, arrays are memory blocks, and you can access them by subscript. Therefore, arrays in C can only use integers and ordered key values (that is, you cannot use the 1332423442 key value after the key value 0). There's no such thing as an associative array in C.

A hash table is something like this: they use a hash function to convert a string key value to a normal integer key value. The result of the hash can be used as the key value of the normal C array (also known as a memory block). The problem now is that there is a conflict in the hash function, which means that multiple string key values may generate the same hash value. For example, in PHP, an array of more than 64 elements, the string "foo" and "Oof" have the same hash value.

This can be done by storing potentially conflicting values in the linked list, rather than storing the values directly in the generated subscript.

Hashtable and Buckets

So, now that the basic concept of a hash table is clear, let's look at the hash table structure implemented within PHP:

typedef struct _HASHTABLE {    uint ntablesize;    UINT Ntablemask;    UINT Nnumofelements;    ULONG Nnextfreeelement;    Bucket *pinternalpointer;    Bucket *plisthead;    Bucket *plisttail;    Bucket **arbuckets;    dtor_func_t Pdestructor;    Zend_bool Persistent;    unsigned char napplycount;    Zend_bool bapplyprotection;     #if zend_debug        int inconsistent;     #endif} HashTable;

To quickly cross:

The nnumofelements identifies the number of values that are now stored in the array. This is also the value returned by the function count ($array).

Ntablesize represents the capacity of the hash table. It is usually the power of the next 2 that is greater than or equal to nnumofelements. For example, if the array stores 32 elements, then the hash table is also a 32 size capacity. But if one more element is added, that is, the array now has 33 elements, then the hash table's capacity is adjusted to 64. This is to keep the hash table always valid in space and time. Obviously, if the hash table is too small, there will be a lot of conflicts and performance will be reduced. On the other hand, if the hash table is too large, it wastes memory. A power value of 2 is a good compromise.

Ntablemask is the capacity of the hash table minus one. This mask is used to adjust the generated hash value based on the current table size. For example, the true hash value of "foo" (using the djbx33a hash function) is 193491849. If we now have a 64-capacity hash table, we obviously can't use it as an array subscript. Instead, the mask of the hash table is applied, and then only the low of the hash table is taken.

Hash           |        193491849 |     0b1011100010000111001110001001

Nnextfreeelement is the next numeric key value that can be used when you use $array[] = xyz is used.

Pinternalpointer stores the current location of the array. This value can be accessed using the Reset (), current (), key (), Next (), Prev (), and end () functions during the foreach traversal.

Plisthead and Plisttail identify the position of the first and last elements of the array. Remember: The array of PHP is an ordered set. For example, [' foo ' = ' bar ', ' bar ' = ' + ' foo '] and [' Bar ' = ' foo ', ' foo ' + ' bar '] These two arrays contain the same elements, but in a different order.

Arbuckets is the "hash table (internal C array)" We often talk about. It is defined in bucket * *, so it can be treated as a bucket pointer to an array (we'll talk about what buckets are right away).

Pdestructor is the destructor for the value. If a value is removed from HT, then this function is called. The common destructor is zval_ptr_dtor. Zval_ptr_dtor will reduce the number of references to Zval, and if it encounters O, it will destroy and release it.

The last four variables are not so important to us. So simply to say that the persistent identity hash table can survive in multiple requests, napplycount and bapplyprotection prevent multiple recursion, inconsistent is used to capture the illegal use of hash tables in debug mode.

Let's continue with a second important structure: buckets:

typedef struct BUCKET {    ulong H;    UINT Nkeylength;    void *pdata;    void *pdataptr;    struct bucket *plistnext;    struct bucket *plistlast;    struct bucket *pnext;    struct bucket *plast;    const char *arkey;} Buckets;

H is a hash value (the value before the Mask value mapping is not applied).

Arkey is used to hold string key values. The nkeylength is the corresponding length. If it is a numeric key value, neither of these variables will be used.

Pdata and pdataptr are used to store real values. For a PHP array, its value is a zval struct (but it is also used elsewhere). Don't dwell on why there are two attributes. The difference between them is who is responsible for releasing the value.

Plistnext and Plistlast identify the next element and the previous element of the array element. If PHP wants to iterate through the array sequentially, it will start with the bucket of Plisthead (inside the hashtable structure) and use the Plistnext bucket as the traversal pointer. The same is true in reverse order, starting with the Plisttail pointer and then using the Plistlast pointer as the variable pointer. (You can call end () in the user code and call the Prev () function to achieve this effect.) )

Pnext and Plast generate the "list of possible conflicting values" that I mentioned above. The Arbucket array stores the buckets of the first possible value. If the bucket does not have the correct key value, PHP will look for the bucket pointed to by Pnext. It will always point back to the bucket until it finds the right bucket. Plast is the same principle in reverse order.

As you can see, PHP's hash table implementation is quite complex. This is the price it will pay to use the super-flexible array type.

How is a hash table used?

Zend engine defines a number of API functions for use in hash tables. A low-level hash table function preview can be found in the Zend_hash.h file. In addition, Zend engine defines a slightly more advanced API in the Zend_api.h file.

We don't have enough time to talk about all the functions, but at least we can look at some instance functions to see how it works. We will use Array_fill_keys as an instance function.

Using the techniques mentioned in the second section, you can easily find the functions defined in the Ext/standard/array.c file. Now, let's take a quick look at this function.

Like most functions, the top of a function has a bunch of variables defined and then calls the Zend_parse_parameters function:

Zval *keys, *val, **entry; Hashposition pos;if (Zend_parse_parameters (Zend_num_args () tsrmls_cc, "Az", &keys, &val) = = FAILURE) {    return;}

Obviously, the AZ parameter indicates that the first parameter type is an array (that is, the variable keys), and the second argument is any zval (that is, the variable val).

After parsing the parameters, the returned array is initialized:

/* Initialize return array */array_init_size (Return_value, Zend_hash_num_elements (z_arrval_p (keys)));

This line contains three important parts of the array API:

1, z_arrval_p macro extracts the value from the Zval to the hash table.

2. Zend_hash_num_elements extract the number of hash table elements (nnumofelements attribute).

3. Array_init_size Initializes an array with a size variable.

Therefore, this line initializes the array to the RETURN_VALUE variable using the same size as the key-value array.

The size here is just an optimization scenario. Functions can also call only Array_init (Return_value), so that as more and more elements are added to the array, PHP resets the size of the array multiple times. By specifying a specific size, PHP allocates the correct memory space at the outset.

After the array is initialized and returned, the function uses the same code structure as the following, using the while loop variable keys array:

ZEND_HASH_INTERNAL_POINTER_RESET_EX (Z_arrval_p (keys), &pos); while (ZEND_HASH_GET_CURRENT_DATA_EX (Z_ARRVAL_P ( keys), (void *) &entry, &pos) = = SUCCESS) {    //some code    ZEND_HASH_MOVE_FORWARD_EX (Z_arrval_p (keys), & Amp;pos);}

This can be easily translated into PHP code:

Reset ($keys), while (null!== $entry = current ($keys)) {    //some code    next ($keys);}

As in the following:

foreach ($keys as $entry) {    //some code}

The only difference is that the traversal of C does not use an internal array pointer, but uses its own POS variable to store the current position.

The code inside the loop is divided into two branches: one for numeric keys and another for other keys. The branch of a numeric key value has only the following two lines of code:

Zval_add_ref (&val); Zend_hash_index_update (Z_arrval_p (Return_value), z_lval_pp (entry), &val, sizeof (Zval *) , NULL);

This looks too straightforward: the reference to the first value is incremented (adding a value to the Hashtable means adding another reference to it), and then the value is inserted into the hash table. ZEND_HASH_INDEX_UPDATE macro parameters are, need to update the hash table z_arrval_p (return_value), integer subscript z_lval_pp (Entry), value &val, value size sizeof (Zval * ) and the target pointer (which we don't care about, and therefore null).

The branch of a non-numeric subscript is slightly more complicated:

Zval key, *key_ptr = *entry;if (z_type_pp (entry)! = is_string) {    key = **entry;    Zval_copy_ctor (&key);    Convert_to_string (&key);    Key_ptr = &key;} Zval_add_ref (&val); Zend_symtable_update (Z_arrval_p (Return_value), z_strval_p (key_ptr), Z_STRLEN_P (key_ptr) + 1 , &val, sizeof (Zval *),             NULL), if (key_ptr! = *entry) {    zval_dtor (&key);}

First, use convert_to_string to convert the key value to a string (unless it is already a string). Prior to this, entry was copied to the new key variable. Key = **entry This line is implemented. In addition, the Zval_copy_ctor function is called, otherwise complex structures (such as strings or arrays) are not copied correctly.

The above copy operation is necessary because the type conversion is guaranteed to not change the original array. Without a copy operation, the cast not only modifies the local variables, but also modifies the values in the array of key values (obviously, this is very surprising to the user).

Obviously, after the loop is over, the copy operation needs to be removed again, and Zval_dtor (&key) is doing the job. The difference between Zval_ptr_dtor and Zval_dtor is that zval_ptr_dtor destroys the Zval variable only when the RefCount variable is 0 o'clock, and Zval_dtor destroys it immediately, rather than relying on the refcount value. That's why you see Zval_pte_dtor use the "normal" variable and zval_dtor use temporary variables, which are not used elsewhere. Moreover, Zval_ptr_dtor will release zval content after destruction and Zval_dtor will not. Because we don't have malloc () anything, so we don't need free (), so Zval_dtor made the right choice in this respect.

Now look at the remaining two lines (the important two lines ^ ^):

Zval_add_ref (&val); Zend_symtable_update (Z_arrval_p (Return_value), z_strval_p (key_ptr), Z_STRLEN_P (key_ptr) + 1 , &val, sizeof (Zval *), NULL);

This is very similar to the operation after the numeric key branch is completed. The difference is that Zend_symtable_update is now called instead of Zend_hash_index_update, and the key value string and its length are passed.

Symbol table

The function of "normal" inserting string key values into a hash table is zend_hash_update, but here the zend_symtable_update is used. What difference do they have?

The symbol table is simply a special type of hash table, which is used in arrays. It differs from the original hash table in how he handles the numeric key values: In the symbol table, "123" and 123 are considered to be the same. So, if you store a value in $array["123"], you can get it later using $array[123].

The bottom layer can be implemented in two ways: either use "123" to save 123 and "123", or use 123来 to save the two key values. Obviously PHP chose the latter (because the integer is faster and takes up less space than the string type).

If you accidentally insert data after using "123" instead of casting to 123, you will find some interesting things in the symbol table. A cast using an array-to-object is as follows:

$obj = new StdClass; $obj->{123} = "Foo"; $arr = (array) $obj; Var_dump ($arr [123]); Undefined offset:123var_dump ($arr ["123"]); Undefined offset:123

Object properties are always saved using string key values, although they are numbers. So the line of $obj->{123} = ' foo ' actually saves the ' foo ' variable to the ' 123 ' subscript. When using an array cast, this value is not changed. However, when both $arr[123] and $arr["123" want to access 123 of the underlying value (not an existing "123" subscript), an error is thrown. So, congratulations, you created a hidden array element.

The next part of the next section will be published again in Ircmaxell's blog. The next article describes how objects and classes work internally.

Support me to translate more good articles, thank you!

Reward the translator

Support me to translate more good articles, thank you!

Choose a payment method

About the Author: hoohack

A rookie who's trying to make a personal homepage · My article · 15 ·

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.