21-Understanding the hash table in Zend
In PHP's Zend engine, there is a data structure is very important, it is everywhere, is the core of PHP data storage, a variety of constants, variables, functions, classes, objects, etc. are used to organize, this data structure is hashtable.
Hashtable is also known as a hash table in the usual data structure textbook. The basic principle is simple (if you are unfamiliar with it, please refer to a random data structure textbook or search online), but PHP implementation has its unique place. Understanding the Hashtable data storage structure, it is very important for us to analyze the source code of PHP, especially the implementation of virtual machine in Zend engine. It helps us to simulate the image of a complete virtual machine in the brain. It is also the basis for an array of other data structures in PHP.
The implementation of Zend Hashtable combines the advantages of two-way linked list and vector (array) data structures, which provides a very efficient storage and querying mechanism for PHP.
Let ' s begin!
Data structure of Hashtable
The implementation code of Hashtable in Zend Engine mainly includes ZEND_HASH.H, zend_hash.c these two files. Zend Hashtable consists of two main data structures, one is bucket (bucket) structure and the other is the Hashtable structure. The bucket structure is a container for storing data, while the Hashtable structure provides a mechanism for managing all of these buckets (or buckets).
typedef struct bucket {ulong h; /* Used for numeric indexing */uint nKeyLength; /* key 长度 */void *pData; /* 指向Bucket中保存的数据的指针 */void *pDataPtr; /* 指针数据 */struct bucket *pListNext; /* 指向HashTable桶列中下一个元素 */struct bucket *pListLast; /* 指向HashTable桶列中前一个元素 */struct bucket *pNext; /* 指向具有同一个hash值的桶列的后一个元素 */struct bucket *pLast; /* 指向具有同一个hash值的桶列的前一个元素 */char arKey[1]; /* 必须是最后一个成员,key名称*/} Bucket;
In Zend Hashtable, each data element (Bucket) has a key name (key), which is unique throughout the hashtable and cannot be duplicated. The data elements in the Hashtable can be uniquely determined based on the key name. The key name is represented in two ways. The first method uses the string Arkey as the key name, and the length of the string is nkeylength. Notice that in the above data structure Arkey is only a 1 character array, but it does not mean that key can only be a single character. The bucket is actually a variable-length structure, and since Arkey is the last member variable of the bucket, a key with a length of nkeylength can be determined by combining Arkey with Nkeylength. This is one of the more common techniques in C language programming. Another key name is represented by the index, at which point the nkeylength is always 0, and the Long integer field h represents the key name of the data element. In simple terms, if nkeylength=0, the key name is H, otherwise the key name is Arkey and the key name is Nkeylength.
When Nkeylength > 0 o'clock, it does not mean that the H value at this point is meaningless. In fact, at this point it holds the hash value corresponding to the Arkey. Regardless of how the hash function is designed, conflicts are unavoidable, meaning that different arkey may have the same hash value. Buckets with the same hash value are stored in the bucket column corresponding to the same index of the Hashtable arbuckets array (see explanation below). This bucket column is a doubly linked list, its forward elements, and the back elements are represented by Plast, Pnext, respectively. The newly inserted bucket is placed at the front of the bucket column.
In buckets, the actual data is stored in a block of memory pointed to by the pdata pointer, which is usually allocated separately by the system. One exception is that, when the bucket holds the data as a pointer, Hashtable will not request the system to allocate additional space to hold the pointer, but instead directly save the pointer to Pdataptr and then point pdata to the address of the member of the struct. This can improve efficiency and reduce memory fragmentation. This allows us to see the subtleties of PHP hashtable design. If the data in the bucket is not a pointer, pdataptr is NULL.
All buckets in the Hashtable through Plistnext, Plistlast constitute a doubly linked list. The newly inserted bucket is placed at the end of this doubly linked list.
Note In general, buckets do not provide information about the size of the data it stores. Therefore, in the implementation of PHP, the data stored in buckets must have the ability to manage their own size.
typedef struct _hashtable {uint nTableSize;uint nTableMask;uint nNumOfElements;ulong nNextFreeElement;Bucket *pInternalPointer;Bucket *pListHead;Bucket *pListTail;Bucket **arBuckets;dtor_func_t pDestructor;zend_bool persistent;unsigned char nApplyCount;zend_bool bApplyProtection;#if ZEND_DEBUGint inconsistent;#endif} HashTable;
In the Hashtable structure, ntablesize specifies the size of the hashtable, and it limits the maximum number of buckets that can be saved in Hashtable, and the larger the number, the more memory the system allocates for Hashtable. In order to improve the computational efficiency, the system automatically adjusts the ntablesize to a minimum of 2 of the entire number of times not less than ntablesize. That is, if you specify a ntablesize that is not an integer number of 2 when initializing Hashtable, the Ntablesize value is automatically adjusted. That
nTableSize = 2ceil(log(nTableSize, 2)) 或 nTableSize = pow(ceil(log(nTableSize,2)))
For example, if you specify Ntablesize = 11,hashtable when initializing hashtable, the initialization program automatically increases the ntablesize to 16.
Arbuckets is the key to Hashtable, the Hashtable initialization program automatically requests a piece of memory, and assigns its address to arbuckets, which can accommodate ntablesize pointers. We can consider arbuckets as an array of size ntablesize, each element of which is a pointer to the bucket that actually holds the data. Of course, at first, each pointer is null.
The value of Ntablemask is always ntablesize–1, and the main purpose of this field is to increase the computational efficiency in order to quickly calculate the index of the bucket key name in the Arbuckets array.
Nnumberofelements records the number of data elements currently saved by Hashtable. When Nnumberofelement is greater than Ntablesize, Hashtable will automatically expand to twice times the original size.
Nnextfreeelement records the next index in the hashtable that can be used to insert data elements in the arbuckets.
Plisthead, Plisttail, respectively, represents the first and last element of the bucket doubly linked list, which are usually arranged in the order in which they are inserted. It can also be rearranged by various sort functions. The pinternalpointer is used to record the current traverse when traversing Hashtable, which is a pointer to the bucket currently traversed, and the initial value is Plisthead.
Pdestructor is a function pointer that is automatically called when the Hashtable is added, modified, and deleted to handle the cleanup of related data.
The persistent flag indicates how bucket memory is allocated. If Persisient is true, the bucket is allocated memory using the memory allocation function of the operating system itself, otherwise PHP's memory allocation function is used. Please refer to the memory management of PHP for details.
The combination of Napplycount and Bapplyprotection provides a mechanism to prevent entering a recursive loop when traversing Hashtable.
The inconsistent member is used for debugging purposes only when PHP is compiled into a debug version. Represents the state of Hashtable, with four states:
Meaning of the status value:
- Ht_is_destroying is deleting all the content, including the arbuckets itself
- Ht_is_destroyed has been removed, including arbuckets itself
- Ht_cleaning is clearing everything that arbuckets points to, but does not include the arbuckets itself
HT_OK normal state, all kinds of data are exactly the same
typedef struct _ZEND_HASH_KEY {
char arkey;/ hash element Key Name */
UINT Nkeylength; /* Hash Element key Length */
ULONG H; /* Key calculates the hash value or the directly specified value subscript */
} Zend_hash_key;
Now it's easier to understand the zend_hash_key structure. It uniquely determines an element in the Hashtable by Arkey, Nkeylength, h three fields.
Based on the interpretation of Hashtable related data structures, we can draw the Hashtable memory structure diagram:
HashTable structure
The realization of Zend Hashtable
This section specifically describes the implementation of Hashtable in PHP. The following functions are taken from ZEND_HASH.C. As long as the above data structures are fully understood, the code implemented by Hashtable is not difficult to understand.
- Hashtable initialization
The
Hashtable provides a Zend_hash_init macro to complete the initialization of the Hashtable. In fact, it is implemented by the following intrinsic function:
Zend_api int _zend_hash_init (HashTable *ht, uint nSize, hash_func_t phashfunction, dtor_func_t pdestructor, Zend_bool per Sistent zend_file_line_dc) {UINT i = 3; Bucket **tmp; Set_inconsistent (HT_OK), if (nSize >= 0x80000000) {/* prevent overflow */ht->ntablesize = 0x80000000;} else {while ( 1U << i) < nSize) {/* Auto adjust ntablesize to 2 N * */i++;} ht->ntablesize = 1 << i; /* I has a minimum value of 3, so the Hashtable size is minimum 8 */} Ht->ntablemask = Ht->ntablesize-1;ht->pdestructor = pdestructor;ht-> Arbuckets = Null;ht->plisthead = Null;ht->plisttail = null;ht->nnumofelements = 0;ht->nNextFreeElement = 0; Ht->pinternalpointer = Null;ht->persistent = Persistent;ht->napplycount = 0;ht->bapplyprotection = 1;/* Allocates arbuckets memory in different ways according to persistent, and initializes all its pointers to null*//* Uses Ecalloc () so, bucket* = = NULL */if (persistent) {TMP = (Bucke T * *) calloc (ht->ntablesize, sizeof (Bucket *)); if (!tmp) {return FAILURE;} Ht->arbuckets = tmp;} else {tmp = (Bucket * *) EcalloC_rel (ht->ntablesize, sizeof (Bucket *)), if (tmp) {ht->arbuckets = tmp;}} return SUCCESS;}
In previous versions, you could use Phashfunction to specify the hash function. But now PHP has forced the use of the djbx33a algorithm, so actually phashfunction this parameter is not used, and is reserved here just to be compatible with the previous code.
2. Adding, inserting, and modifying elements
The key to adding a new element to the Hashtable is to determine where to insert the element into the arbuckets array. Based on the explanation of the key name of the bucket structure, we can know that there are two ways to add a new element to Hashtable. The first method is to insert buckets using strings as key names, and the second is to insert buckets using an index as the key name. The second method can be specifically divided into two situations: Specify an index or do not specify an index, the specified index refers to forcing the bucket to be inserted into the specified index position, and the bucket is inserted into the index position of the nnextfreeelement by not specifying an index. These kinds of methods of inserting data are similar, but the method of locating buckets is different.
The method of modifying data in Hashtable is similar to the method of adding data.
Let's first look at the first method of adding or modifying buckets using a string as a key name:
Zend_api int _zend_hash_add_or_update (HashTable *ht, char *arkey, uint nkeylength, void *pdata, uint ndatasize, void **pde St, int flag zend_file_line_dc) {ulong H;uint nIndex; Bucket *p;is_consistent (HT); Debug information Output if (nkeylength <= 0) {#if zend_debug zend_puts ("Zend_hash_update:can ' t put in empty key\n"), #endif return F Ailure; }/* Use the hash function to calculate the hash value of Arkey */h = zend_inline_hash_func (Arkey, nkeylength); /* The hash value and the Ntablemask are indexed in the arbuckets with the bitwise and epigenetic elements. Making it and * Ntablemask bitwise AND is guaranteed not to produce an array subscript that makes the arbuckets out of bounds. */NIndex = h & ht->ntablemask;p = ht->arbuckets[nindex]; /* Get the bucket pointer corresponding to the index *//* check if the corresponding bucket column contains a data element (key, hash) */while (P! = NULL) {if ((p->h = = h) && (P->nkeyleng th = = nkeylength)) {if (!memcmp (P->arkey, Arkey, nkeylength)) {if (flag & Hash_add) {return FAILURE;//The corresponding data element already exists, Cannot insert operation}handle_block_interruptions (); #if zend_debugif (P->pdata = = PData) {zend_puts ("Fatal Error in Zend_hash_ Update:p->pdata = = pdata\n "); Handle_unblock_interruptions (); reTurn FAILURE;} #endifif (ht->pdestructor) {/* If the data element exists, the original data is refactored */ht->pdestructor (p->pdata);} /* Update the original data with the new data */update_data (HT, p, PData, ndatasize); if (pDest) {*pdest = P->pdata;} Handle_unblock_interruptions (); return SUCCESS;}} p = p->pnext;} /* Hashtable no key corresponding data, add a bucket */p = (bucket *) pemalloc (sizeof (bucket)-1 + nkeylength, ht->persistent); if (!p) { return FAILURE;} memcpy (P->arkey, Arkey, nkeylength);p->nkeylength = Nkeylength;init_data (HT, p, PData, ndatasize);p->h = h;// Add buckets to the corresponding bucket column connect_to_bucket_dllist (P, Ht->arbuckets[nindex]); if (pDest) {*pdest = P->pdata;} Handle_block_interruptions ();//The bucket is added to the Hashtable doubly linked list connect_to_global_dllist (P, HT); Ht->arbuckets[nindex ] = p; Handle_unblock_interruptions (); ht->nnumofelements++;//if the Hashtable is full, resize the Hashtable again. Zend_hash_if_full_do_resize (HT); /* If The Hash table is full, resize it */return SUCCESS;}
Because this function inserts data using a string as the key name, it first checks whether the value of Nkeylength is greater than 0, and exits if it is not. Then calculates the arkey corresponding hash value h, and ntablemask it with a bitwise and followed by an unsigned integer nindex. This nindex is the index position of the bucket to be inserted in the Arbuckets array.
Now that we have an index of the arbuckets array, we know that the data it includes is a pointer to the bucket's doubly linked list. If this doubly linked list is not empty, we first check if this doubly linked list already contains a bucket with the key name specified by the string arkey, and if so, the bucket if it exists, and what we want to do is insert a new bucket (via flag), then it should be an error – Because the key name cannot be duplicated in the Hashtable. If present, and is a modification operation, the data that was pointed to by the original pdata is pdestructor by using the destructor specified in Hashtable, and the modification is successfully returned with the new data replaced by the original data.
If the data specified by the key name is not found in the Hashtable, the data is encapsulated in a bucket and then inserted into the Hashtable. The following two macros are to be noted here:
- Connect_to_bucket_dllist (P, Ht->arbuckets[nindex])
- Connect_to_global_dllist (P, HT)
The former is inserted into the bucket doubly linked list of the specified index, which is inserted into the bucket doubly linked list of the whole Hashtable. The two are inserted in a different way, which is to insert the bucket to the front of the doubly linked list, which is inserted at the very end of the doubly linked list.
Here is the second way to insert or modify buckets, that is, the method of using the index:
Zend_api int _zend_hash_index_update_or_next_insert (HashTable *ht, ulong h, void *pdata, uint ndatasize, void **pdest, int Flag zend_file_line_dc) {UINT NIndex; Bucket *p;is_consistent (HT); if (flag & Hash_next_insert) {h = ht->nnextfreeelement;} NIndex = h & ht->ntablemask;p = ht->arbuckets[nindex];//checks if the corresponding data is included while (P! = NULL) {if (P->nkeylength = = 0) && (p->h = h)) {if (Flag & Hash_next_insert | | Flag & HASH_ADD) {return FAILURE;} ...... Modify bucket data, slightly//if ((long) H >= (long) ht->nnextfreeelement) {ht->nnextfreeelement = h + 1;} if (pDest) {*pdest = P->pdata;} return SUCCESS;} p = p->pnext;} p = (bucket *) Pemalloc_rel (sizeof (bucket)-1, ht->persistent); if (!p) {return FAILURE;} p->nkeylength = 0; /* Numeric indices is marked by making the Nkeylength = = 0 */p->h = h;init_data (HT, p, PData, ndatasize); if (pDest) {* PDest = P->pdata;} Connect_to_bucket_dllist (P, Ht->arbuckets[nindex]); Handle_block_interruptions (); Ht->arbucKets[nindex] = p; Connect_to_global_dllist (P, HT); Handle_unblock_interruptions (); if (long) H >= (long) ht->nnextfreeelement) {ht->nnextfreeelement = h + 1;} ht->nnumofelements++; Zend_hash_if_full_do_resize (HT); return SUCCESS;}
The flag flag indicates whether the current operation is Hash_next_insert (does not specify an index insert or modification), whether Hash_add (specify index insertion) or hash_update (Specify index modification). Since the implementation code for these operations is basically the same, the unification is merged into a function, which is then distinguished by flag.
This function is basically the same as the previous one, and differs if you determine the method of inserting an index into the arbuckets array. If the operation is Hash_next_insert, use nnextfreeelement directly as the inserted index. Note how the value of Nnextfreeelement is used and updated.
3. Accessing elements
Similarly, Hashtable accesses elements in two ways, one that uses the string Arkey Zend_hash_find (), and the other is to use the indexed access Method Zend_hash_index_find (). Because the code is simple to implement, the analysis is left to the reader to complete.
4. Deleting an element
Hashtable delete data is done using the Zend_hash_del_key_or_index () function, and its code is simpler, and is no longer analyzed in detail. It is necessary to pay attention to how to calculate the corresponding subscript according to Arkey or H, and the processing of pointers of two doubly linked lists.
5. Traversing elements
/* This is used to recurse elements and selectively delete certain entries* from a hashtable. apply_func() receives the data and decides if the entry* should be deleted or recursion should be stopped. The following three* return codes are possible:* ZEND_HASH_APPLY_KEEP - continue* ZEND_HASH_APPLY_STOP - stop iteration* ZEND_HASH_APPLY_REMOVE - delete the element, combineable with the former*/ZEND_API void zend_hash_apply(HashTable *ht, apply_func_t apply_func TSRMLS_DC){Bucket *p;IS_CONSISTENT(ht);HASH_PROTECT_RECURSION(ht);p = ht->pListHead;while (p != NULL) {int result = apply_func(p->pData TSRMLS_CC);if (result & ZEND_HASH_APPLY_REMOVE) {p = zend_hash_apply_deleter(ht, p);} else {p = p->pListNext;}if (result & ZEND_HASH_APPLY_STOP) {break;}}HASH_UNPROTECT_RECURSION(ht);}
Because all buckets in the Hashtable can be accessed through a doubly linked list pointed to by Plisthead, the implementation of traversing Hashtable is relatively straightforward. It is worth mentioning that the processing of the currently traversed bucket uses a apply_func_t type of callback function. Depending on the actual needs, the callback function returns one of the following values:
- Zend_hash_apply_keep
- Zend_hash_apply_stop
- Zend_hash_apply_remove
They indicate that they continue to traverse, stop traversing or delete the element, and continue the traversal.
Another problem to be aware of is the problem of preventing recursion when traversing, that is, preventing multiple traversal of the same hashtable at the same time. This is accomplished with the following two macros:
- Hash_protect_recursion (HT)
- Hash_unprotect_recursion (HT)
The main principle is that if the Traverse protection flag Bapplyprotection is true, the Napplycount value is added 1 each time the traversal function is entered, and the Napplycount value is reduced by 1 when exiting the traversal function. If you find Napplycount > 3 before you start the traversal, report the error message directly and exit the traversal.
The apply_func_t above has no parameters. Hashtable also provides a callback method with a parameter or variable parameter, the corresponding traversal function is:
typedef int (*apply_func_arg_t)(void *pDest,void *argument TSRMLS_DC);void zend_hash_apply_with_argument(HashTable *ht,apply_func_arg_t apply_func, void *data TSRMLS_DC);typedef int (*apply_func_args_t)(void *pDest,int num_args, va_list args, zend_hash_key *hash_key);void zend_hash_apply_with_arguments(HashTable *ht,apply_func_args_t apply_func, int numargs, …);
In addition to the several offers provided above, there are many other APIs that operate Hashtable. such as sorting, copying and merging of Hashtable, and so on. As long as you fully understand the data structure of the above Hashtable, it is not difficult to understand the code.
21-Understanding the hash table in Zend