PHP source code analysis: Zend HashTable, zendhashtable

Source: Internet
Author: User

PHP source code analysis: Zend HashTable, zendhashtable

Recently I read an article about hashtable in php. It is used to organize various constants, variables, functions, classes, and objects at the core of PHP Data Storage. Reprinted address

HashTable is also called a hash table in the data structure textbook. Its basic principle is relatively simple (if you are not familiar with it, please refer to any data structure teaching material or search for it on the Internet), but the implementation of PHP has its own unique place. Understanding HashTable's data storage structure is of great help to analyze PHP source code, especially the implementation of virtual machines in Zend Engine. It can help us simulate a complete virtual machine image in the brain. It is also the basis for PHP to implement other data structures such as arrays.

The implementation of Zend HashTable combines the advantages of two-way linked list and vector (array) data structures, providing PHP with a very efficient data storage and query mechanism.

1. HashTable Data Structure

The implementation code of HashTable in Zend Engine mainly includes zend_hash.h and zend_hash.c. Zend HashTable includes two main data structures: Bucket structure and HashTable structure. The Bucket structure is the container used to save data, while the HashTable structure provides a mechanism to manage all these buckets (or Bucket columns.

Typedef struct bucket {ulong h;/* Used for numeric indexing */uint nKeyLength;/* key length */void * pData; /* pointer to the data stored in the Bucket */void * pDataPtr;/* pointer data */struct bucket * pListNext; /* points to the next element in the HashTable bucket column */struct bucket * pListLast;/* points to the previous element in the HashTable bucket column */struct bucket * pNext; /* point to the last element of the bucket column with the same hash value */struct bucket * pLast; /* points to the first element of the Bucket column with the same hash value */char arKey [1];/* must be the last Member, key name */} Bucket;

In Zend HashTable, each data element (Bucket) has a key, which is unique in the entire HashTable and cannot be repeated. Data elements in HashTable can be uniquely identified based on the key name. The key name can be expressed in two ways. The first method uses the string arKey as the key name. The length of the string is nKeyLength. Note that in the above data structure, although the arKey is only an array of 1 characters, it does not mean that the key can only be one character. In fact, a Bucket is a variable length struct. Because arKey is the last member variable of the Bucket, a key with a length of nKeyLength can be determined by combining arKey with nKeyLength. This is a common technique in C programming. Another key name is indexed. In this case, nKeyLength is always 0. The long integer field h indicates the key name of the data element. Simply put, if nKeyLength = 0, the key name is h; otherwise, the key name is arKey and the key name is nKeyLength.

When nKeyLength> 0, it does not mean that the h value is meaningless. In fact, it stores the hash value corresponding to the arKey. No matter how the hash function is designed, conflicts are inevitable. That is to say, different arkeys may have the same hash value. Buckets with the same hash value are stored in the Bucket columns corresponding to the same index in the arBuckets array of HashTable (see the following description. This bucket column is a two-way linked list. Its forward elements and backward elements are represented by pLast and pNext respectively. The newly inserted Bucket is placed at the top of the Bucket column.

In a Bucket, the actual data is stored in the memory block pointed to by the pData pointer. Generally, this memory block is allocated by the system separately. One exception is that when the data stored in the Bucket is a pointer, HashTable will not request the system to allocate space to save the pointer, but directly Save the pointer to pDataPtr, then point pData to the address of the Member in this structure. This improves efficiency and reduces memory fragments. From this we can see the subtlety of the PHP HashTable design. If the data in the Bucket is not a pointer, pDataPtr is NULL.

All buckets in HashTable form a two-way linked list through pListNext and pListLast. The newly inserted Bucket is placed at the end of the two-way linked list.

Note that a Bucket generally does not provide information about the data size it stores. Therefore, in PHP implementation, the data stored in the Bucket must have the ability to manage its own size.

typedef struct _hashtable {uint nTableSize;uint nTableMask;uint nNumOfElements;ulong nNextFreeElement;Bucket *pInternalPointer;Bucket *pListHead;Bucket *pListTail;Bucket **arBuckets;dtor_func_t pDestructor;zend_bool persistent;unsigned char nApplyCount;zend_bool bApplyProtection; #if ZEND_DEBUGint inconsistent;#endif} HashTable;

In the HashTable structure, nTableSize specifies the HashTable size, and it limits the maximum number of buckets that can be saved in HashTable. The larger the number, the more memory allocated to HashTable. To improve the computing efficiency, the system automatically adjusts nTableSize to the integer power of 2, which is not less than nTableSize. That is to say, if you specify a nTableSize not the integer power of 2 when initializing HashTable, the system will automatically adjust the value of nTableSize. That is

NTableSize = 2 ceil (log (nTableSize, 2) or nTableSize = pow (ceil (log (nTableSize, 2 )))

For example, if nTableSize = 11 is specified during HashTable initialization, The HashTable initialization program automatically increases the nTableSize to 16.

ArBuckets is the key to HashTable. The HashTable initialization program automatically applies for a piece of memory and assigns its address to arBuckets. This memory size can accommodate nTableSize pointers. We can regard arBuckets as an array of nTableSize. Each array element is a pointer to the Bucket where data is actually stored. Of course, each pointer is NULL at the beginning.

The value of nTableMask is always nTableSize-1. The main purpose of this field is to increase the computing efficiency and quickly calculate the index of the Bucket key name in the arBuckets array.

NNumberOfElements records the number of data elements currently saved by HashTable. When nNumberOfElement is greater than nTableSize, HashTable will automatically expand to twice the original size.

The next arBuckets index in the nNextFreeElement record HashTable that can be used to insert data elements.

PListHead and pListTail represent the first and last elements of the two-way linked list of the Bucket. These data elements are usually arranged in the order of insertion. You can also sort them by sorting functions. PInternalPointer is used to record the position of the current traversal when traversing HashTable. It is a pointer pointing to the Bucket currently traversed. The initial value is pListHead.

PDestructor is a function pointer that is automatically called when a Bucket is added, modified, or deleted by HashTable to clean up related data.

The persistent flag specifies the Bucket memory allocation method. If persisient is TRUE, use the memory allocation function of the operating system to allocate memory to the Bucket. Otherwise, use the PHP memory allocation function. For details, refer to PHP memory management.

The combination of nApplyCount and bApplyProtection provides a mechanism to prevent recursive loops from traversing HashTable.

The inconsistent member is used for debugging purposes and is only valid when PHP is compiled into a debugging version. Indicates the state of HashTable. There are four states:

Status Value Meaning
HT_IS_DESTROYING is deleting all contents, including the arBuckets
HT_IS_DESTROYED deleted, including arBuckets itself
HT_CLEANING is clearing all content pointed to by arBuckets, but does not include arBuckets itself
HT_ OK is normal, and all data is consistent

Typedef struct _ zend_hash_key {char * arKey;/* hash element key name */uint nKeyLength;/* hash element key length */ulong h; /* the hash value calculated by the key or the specified numerical subscript */} zend_hash_key;

Now it is easier to understand the zend_hash_key structure. It uniquely identifies an element in HashTable through the arKey, nKeyLength, and h fields.

Based on the data structures related to HashTable, we can draw a HashTable memory structure:

Hashtable Structure

II. Implementation of Zend HashTable

This section describes the implementation of HashTable in PHP. The following functions are taken from zend_hash.c. As long as you fully understand the above data structure, the Code implemented by HashTable is not hard to understand.

1 HashTable Initialization

HashTable provides a zend_hash_init macro to initialize HashTable. In fact, it is implemented through the following internal functions:

ZEND_API int _ partition (HashTable * ht, uint nSize, hash_func_t pHashFunction, extends pDestructor, zend_bool persistent listener) {uint I = 3; Bucket * tmp; SET_INCONSISTENT (HT_ OK ); if (nSize & gt; = 0x80000000) {/* prevent overflow */ht-& gt; nTableSize = 0x80000000;} else {while (1U & lt; & lt; I) & lt; nSize) {/* automatically adjusts nTableSize to the Npower of 2 */I ++;} ht-& gt; nTableSize = 1 & lt; & lt; I;/* minimum value of I Therefore, the minimum HashTable size is 8 */} ht-& gt; nTableMask = ht-& gt; nTableSize-1; ht-& gt; pDestructor = pDestructor; ht-& gt; arBuckets = NULL; ht-& gt; pListHead = NULL; ht-& gt; pListTail = NULL; ht-& gt; nNumOfElements = 0; ht-& gt; nNextFreeElement = 0; ht-& gt; pInternalPointer = NULL; ht-& gt; persistent = persistent; ht-& gt; nApplyCount = 0; ht-& gt; bApplyProtection = 1;/* allocate arBuckets memory in different ways based on persistent, and initialize all its pointers to NULL * // * Uses Ecalloc () so that Bucket * = NULL */if (persistent) {tmp = (Bucket **) calloc (ht-& gt; nTableSize, sizeof (Bucket *)); if (! Tmp) {return FAILURE;} ht-& gt; arBuckets = tmp;} else {tmp = (Bucket **) ecalloc_rel (ht-& gt; nTableSize, sizeof (Bucket *); if (tmp) {ht-& gt; arBuckets = tmp ;}} return SUCCESS ;}

In previous versions, you can use pHashFunction to specify the hash function. However, the DJBX33A algorithm has been forcibly used in PHP, so the pHashFunction parameter is not actually used. It is reserved here only for compatibility with the previous code.

2. Add, insert, and modify elements

The most important thing to add a new element to HashTable is to determine the position in the arBuckets array to insert this element. Based on the explanation of the Bucket structure key name, we can know that there are two ways to add a new element to HashTable. The first method is to use a string as the key name to insert a Bucket. The second method is to use an index as the key name to insert a Bucket. The second method can be divided into two situations: specifying an index or not specifying an index means forcibly inserting a Bucket into a specified index; if no index is specified, the Bucket is inserted into the index location corresponding to nNextFreeElement. The implementations of these data insertion methods are similar. The difference is the Bucket locating method.

The method for modifying data in HashTable is similar to that for adding data.

First, let's look at the first method to add or modify a Bucket using a string as the key name:

ZEND_API int _ substring (HashTable * ht, char * arKey, uint nKeyLength, void * pData, uint nDataSize, void ** pDest, int flag ZEND_FILE_LINE_DC) {ulong h; uint nIndex; bucket * p; IS_CONSISTENT (ht); // debug information output if (nKeyLength & lt; = 0) {# if ZEND_DEBUG ZEND_PUTS ("zend_hash_update: can't put in empty key \ n "); # endif return FAILURE;}/* use the hash function to calculate the hash value of arKey */h = zend_inline_hash_func (arKey, nKe YLength);/* generates the index of the hash value and nTableMask in arBuckets after bitwise AND. The bitwise and of * nTableMask ensures that no array subscript excludes arBuckets. */NIndex = h & amp; ht-& gt; nTableMask; p = ht-& gt; arBuckets [nIndex]; /* get the Bucket pointer corresponding to the corresponding index * // * check whether the Bucket column contains data elements (key, hash) */while (p! = NULL) {if (p-& gt; h = h) & amp; (p-& gt; nKeyLength = nKeyLength) {if (! Memcmp (p-& gt; arKey, arKey, nKeyLength) {if (flag & amp; HASH_ADD) {return FAILURE; // the corresponding data element already exists, cannot perform insert operation} HANDLE_BLOCK_INTERRUPTIONS (); # if ZEND_DEBUGif (p-& gt; pData = pData) {ZEND_PUTS ("Fatal error in zend_hash_update: p-& gt; pData = pData \ n "); HANDLE_UNBLOCK_INTERRUPTIONS (); return FAILURE;} # endifif (ht-& gt; pDestructor) {/* If the data element exists, analyze the original data */ht-& gt; pDestructor (p-& gt; pData) ;}/ * use new data to update the original data * /UPDATE_DATA (ht, p, pData, nDataSize); if (pDest) {* pDest = p-& gt; pData ;}handle_unblock_interruptions (); return SUCCESS ;}} p = p-& gt; pNext;}/* No data corresponding to the key in HashTable. Add a Bucket */p = (Bucket *) pemalloc (sizeof (Bucket) -1 + nKeyLength, ht-& gt; persistent); if (! P) {return FAILURE;} memcpy (p-& gt; arKey, arKey, nKeyLength); p-& gt; nKeyLength = nKeyLength; INIT_DATA (ht, p, pData, nDataSize); p-& gt; h = h; // Add the Bucket to the corresponding Bucket column CONNECT_TO_BUCKET_DLLIST (p, ht-& gt; arBuckets [nIndex]); if (pDest) {* pDest = p-& gt; pData;} HANDLE_BLOCK_INTERRUPTIONS (); // Add the Bucket to the HashTable two-way linked list CONNECT_TO_GLOBAL_DLLIST (p, ht ); ht-& gt; arBuckets [nIndex] = p; HANDLE_UNBLOCK_INTERRUPTIONS (); ht-& gt; NNumOfElements ++; // If HashTable is full, re-adjust the HashTable size. ZEND_HASH_IF_FULL_DO_RESIZE (ht);/* If the Hash table is full, resize it */return SUCCESS ;}

Because this function uses a string as the key name to insert data, it first checks whether the nKeyLength value is greater than 0. If not, it exits directly. Then, calculate the hash value h corresponding to the arKey and bitwise it with nTableMask to obtain an unsigned integer nIndex. This nIndex is the index location of the Bucket to be inserted in the arBuckets number group.
Now we have an index of the arBuckets array. We know that the data contained in this index is a pointer to the two-way linked list of the Bucket. If the two-way linked list is not empty, first check whether the two-way linked list contains a Bucket with the key name specified by the string arKey. If such a Bucket exists, in addition, we need to insert a new Bucket (identified by the flag), and an error should be reported-because the key names in HashTable cannot be repeated. If it exists and is a modification operation, you can use the Destructor pDestructor specified in HashTable to analyze the data pointed to by the original pData; then, replace the original data with the new data to return the modification operation.
If the data specified by the key name is not found in HashTable, It is encapsulated into the Bucket and inserted into HashTable. Note the following two macros:
CONNECT_TO_BUCKET_DLLIST (p, ht-> arBuckets [nIndex])
CONNECT_TO_GLOBAL_DLLIST (p, ht)
The former inserts the Bucket into the Bucket two-way linked list of the specified index, and the latter into the Bucket two-way linked list of the entire HashTable. The two are also inserted in different ways. The former is to insert the Bucket to the beginning of the two-way linked list, and the latter is inserted to the end of the two-way linked list.

The second method to insert or modify a Bucket is as follows:

ZEND_API int _ partition (HashTable * ht, ulong h, void * pData, uint nDataSize, void ** pDest, int flag ZEND_FILE_LINE_DC) {uint nIndex; Bucket * p; IS_CONSISTENT (ht); if (flag & amp; HASH_NEXT_INSERT) {h = ht-& gt; nNextFreeElement;} nIndex = h & amp; ht-& gt; nTableMask; p = ht-& gt; arBuckets [nIndex]; // check whether the corresponding data is contained while (p! = NULL) {if (p-& gt; nKeyLength = 0) & amp; (p-& gt; h = h )) {if (flag & amp; HASH_NEXT_INSERT | flag & amp; HASH_ADD) {return FAILURE ;}////...... Modify the Bucket data. // if (long) h & gt; = (long) ht-& gt; nNextFreeElement) {ht-& gt; nNextFreeElement = h + 1 ;} if (pDest) {* pDest = p-& gt; pData;} return SUCCESS;} p = p-& gt; pNext;} p = (Bucket *) pemalloc_rel (sizeof (Bucket)-1, ht-& gt; persistent); if (! P) {return FAILURE;} p-& gt; nKeyLength = 0;/* Numeric indices are marked by making the nKeyLength = 0 */p-& gt; h = h; INIT_DATA (ht, p, pData, nDataSize); if (pDest) {* pDest = p-& gt; pData;} CONNECT_TO_BUCKET_DLLIST (p, ht-& gt; arBuckets [nIndex]); HANDLE_BLOCK_INTERRUPTIONS (); ht-& gt; arBuckets [nIndex] = p; CONNECT_TO_GLOBAL_DLLIST (p, ht); HANDLE_UNBLOCK_INTERRUPTIONS (); if (long) h & gt ;=( long) ht-& gt; nNextFreeElement) {ht-& gt; nNextFreeElement = h + 1 ;}ht-& gt; nNumOfElements ++; ZEND_HASH_IF_FULL_DO_RESIZE (ht); return SUCCESS ;}

The flag indicates whether the current operation is HASH_NEXT_INSERT (index insertion or modification is not specified), HASH_ADD (index insertion), or HASH_UPDATE (index modification is specified ). Since the implementation code of these operations is basically the same, they are merged into a single function and differentiated by flag.
This function is basically the same as the previous one. The difference is that the method for inserting indexes into the arBuckets array is determined. If the operation is HASH_NEXT_INSERT, nNextFreeElement is used directly as the inserted index. Note how nNextFreeElement values are used and updated.
3. Access Element
Similarly, HashTable uses two methods to access elements. One is to use zend_hash_find () of the string arKey, and the other is to use the index access method zend_hash_index_find (). Because the implementation code is very simple, the analysis work is left to the reader.
4. delete an element
HashTable uses the zend_hash_del_key_or_index () function to delete data. Its code is also relatively simple and will not be analyzed in detail here. You need to pay attention to how to calculate the corresponding subscript Based on arKey or h, and to process the pointers of two-way linked lists.
5. Traverse Elements

/* This is used to recurse elements and selectively delete certain entries* from a hashtable. apply_func() receives the data and decides if the entry* should be deleted or recursion should be stopped. The following three* return codes are possible:* ZEND_HASH_APPLY_KEEP   - continue* ZEND_HASH_APPLY_STOP   - stop iteration* ZEND_HASH_APPLY_REMOVE - delete the element, combineable with the former*/ ZEND_API void zend_hash_apply(HashTable *ht, apply_func_t apply_func TSRMLS_DC){Bucket *p; IS_CONSISTENT(ht); HASH_PROTECT_RECURSION(ht);p = ht->pListHead;while (p != NULL) {int result = apply_func(p->pData TSRMLS_CC); if (result & ZEND_HASH_APPLY_REMOVE) {p = zend_hash_apply_deleter(ht, p);} else {p = p->pListNext;}if (result & ZEND_HASH_APPLY_STOP) {break;}}HASH_UNPROTECT_RECURSION(ht);}

Because all buckets in HashTable can be accessed through a two-way linked list directed by pListHead, the implementation of traversing HashTable is also relatively simple. It is worth mentioning that a callback function of the apply_func_t type is used for processing the currently traversed Bucket. Based on actual needs, the callback function returns one of the following denominations:

ZEND_HASH_APPLY_KEEP
ZEND_HASH_APPLY_STOP
ZEND_HASH_APPLY_REMOVE

They indicate that the traversal continues, stop or delete the corresponding elements, and then continue the traversal.

Another issue that needs to be paid attention to is the issue of preventing recurrence, that is, preventing multiple traversal of the same HashTable at the same time. This is implemented using the following two macros:
HASH_PROTECT_RECURSION (ht)
HASH_UNPROTECT_RECURSION (ht)
The main principle is that if the traversal protection flag bApplyProtection is true, the nApplyCount value is added to 1 each time you enter the traversal function, and the nApplyCount value is reduced to 1 when you exit the traversal function. If nApplyCount> 3 is found before the traversal starts, the system reports the error message and exits the traversal.

The above apply_func_t does not contain parameters. HashTable also provides a callback method with a parameter or variable parameter. The corresponding traversal functions are:

typedef int (*apply_func_arg_t)(void *pDest,void *argument TSRMLS_DC);void zend_hash_apply_with_argument(HashTable *ht,apply_func_arg_t apply_func, void *data TSRMLS_DC); typedef int (*apply_func_args_t)(void *pDest,int num_args, va_list args, zend_hash_key *hash_key);void zend_hash_apply_with_arguments(HashTable *ht,apply_func_args_t apply_func, int numargs, …);

In addition to the several provided above, there are many other HashTable APIs. Such as sorting and copying and merging HashTable.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.