Implementation principle of PHP7 hash table

Source: Internet
Author: User
Php Chinese network (www.php.cn) provides the most comprehensive basic tutorial on programming technology, introducing HTML, CSS, Javascript, Python, Java, Ruby, C, PHP, basic knowledge of MySQL and other programming languages. At the same time, this site also provides a large number of online instances, through which you can better learn programming... Introduction

Hash tables are used in almost every C program. Given that the C language only allows integers as the key names of the array, PHP designed a hash table to map the key names of strings to an array of limited sizes through the hash algorithm. This will inevitably lead to a collision. PHP uses the linked list to solve this problem.

There are no perfect implementation methods for many hash tables. Each design focuses on a specific focus. Some reduce the CPU usage, some use the memory more reasonably, and some support line-level expansion.

Hash tables are implemented in a variety of ways, because each implementation method can only be improved on their own concerns, but cannot cover all aspects.

Data structure

Before getting started, we need to declare something in advance:

  • The key name of the hash table may be a string or an integer. When it is a string, we declare the type zend_string; when it is an integer, it is declared as zend_ulong.

  • The hash table follows the insert sequence of the elements in the table.

  • The capacity of the hash table is automatically scaled.

  • Internally, the capacity of the hash table is always a multiple of 2.

  • Each element in the hash table must be zval-type data.

The following is the structure of HashTable:

struct _zend_array {      zend_refcounted_h gc;    union {        struct {            ZEND_ENDIAN_LOHI_4(                zend_uchar    flags,                zend_uchar    nApplyCount,                zend_uchar    nIteratorsCount,                zend_uchar    reserve)        } v;        uint32_t flags;    } u;    uint32_t          nTableMask;    Bucket           *arData;    uint32_t          nNumUsed;    uint32_t          nNumOfElements;    uint32_t          nTableSize;    uint32_t          nInternalPointer;    zend_long         nNextFreeElement;    dtor_func_t       pDestructor;};

This struct occupies 56 bytes.

The most important field is arData, which is a pointer to the Bucket type data. the Bucket structure is defined as follows:

typedef struct _Bucket {      zval              val;    zend_ulong        h;                /* hash value (or numeric index)   */    zend_string      *key;              /* string key or NULL for numerics */} Bucket;

Instead of pointing to a zval type data pointer, the Bucket directly uses the data itself. In PHP7, zval no longer uses heap allocation, because the data to be heap allocated will be stored as a pointer in the zval structure. (Such as a PHP string ).

The structure of arData stored in memory is as follows:

We noticed that all buckets are stored in order.

Insert element

PHP will ensure that the elements of the array are stored in the insert order. In this way, when the foreach loop array is used, it can be traversed in the order of insertion. Suppose we have an array like this:

$a = [9 => "foo", 2 => 42, []];var_dump($a); array(3) {      [9]=>    string(3) "foo"    [2]=>    int(42)    [10]=>    array(0) {    }}

All data is adjacent to the memory.

In this way, the logic of the iterator for processing hash tables becomes quite simple. You only need to traverse the arData array directly. Traversing adjacent data in the memory will greatly use the CPU cache. Because the CPU cache can read the data of the entire arData, the access to each element will be in a subtle level.

size_t i;  Bucket p;  zval val; for (i=0; i < ht->nTableSize; i++) {      p   = ht->arData[i];    val = p.val;    /* do something with val */}

As you can see, data is stored in arData sequentially. To achieve this structure, we need to know the location of the next available node. This position is stored in the nNumUsed field of the array struct.

Every time we add a new data, we save it and execute ht-> nNumUsed ++. When the nNumUsed value reaches the maximum value (nNumOfElements) of all elements in the hash table, the "compression or expansion" algorithm is triggered.

The following is a simple example of how to insert elements into a hash table:

idx = ht->nNumUsed++; /* take the next avalaible slot number */  ht->nNumOfElements++; /* increment number of elements */  /* ... */p = ht->arData + idx; /* Get the bucket in that slot from arData */  p->key = key; /* Affect it the key we want to insert at */  /* ... */p->h = h = ZSTR_H(key); /* save the hash of the current key into the bucket */  ZVAL_COPY_VALUE(&p->val, pData); /* Copy the value into the bucket's value : add operation */

We can see that the insert operation will only insert at the end of the arData array, but will not fill in the deleted nodes.

Delete element

When an element in a hash table is deleted, the hash table does not automatically scale the actual data space, but sets a zval with the value of UNDEF to indicate that the current node has been deleted.

As shown in:

Therefore, when looping array elements, Null nodes need to be judged specially:

size_t i;  Bucket p;  zval val; for (i=0; i < ht->nTableSize; i++) {      p   = ht->arData[i];    val = p.val;    if (Z_TYPE(val) == IS_UNDEF) { /* empty hole ? */        continue; /* skip it */    }    /* do something with val */}

Even for a very large hash table, it is very fast to loop every node and skip those deleted nodes, thanks to the fact that the arData nodes are always adjacent in the memory.

Hash positioning element

When we get the key name of a string, we must use the hash algorithm to calculate the hash value and find the corresponding element in arData through the hash value index.

We cannot directly use the hash value as the index of the arData array, because the elements cannot be stored in the insert sequence.

For example, if the key name I inserted is foo and bar, assume that the foo hash result is 5, and the bar hash result is 3. If we put foo in arData [5] and bar in arData [3], this means that the bar element must be prior to the foo element, which is in the opposite order of our insertion.

Now, when we want to access the elements indicated by foo, use the hash algorithm to obtain the values and perform the modulo operation based on the size of the elements allocated to the hash table, you can get the node index value stored in the conversion table.

As we can see, the node indexes in the conversion table are in the opposite relationship with the node indexes of the array data elements. nTableMask is equal to the negative value of the hash table size, by modulo, we can get the number from 0 to-7, and locate the index value of the required element. In summary, when allocating storage space for arData, we need to use tablesize * sizeof (bucket) + tablesize * sizeof (uint32) to calculate the storage space size.

In the source code, two regions are clearly divided:

#define HT_HASH_SIZE(nTableMask) (((size_t)(uint32_t)-(int32_t)(nTableMask)) * sizeof(uint32_t))#define HT_DATA_SIZE(nTableSize) ((size_t)(nTableSize) * sizeof(Bucket))#define HT_SIZE_EX(nTableSize, nTableMask) (HT_DATA_SIZE((nTableSize)) + HT_HASH_SIZE((nTableMask)))#define HT_SIZE(ht) HT_SIZE_EX((ht)->nTableSize, (ht)->nTableMask)  Bucket *arData;  arData = emalloc(HT_SIZE(ht)); /* now alloc this */

We will expand the macro replacement result:

(((size_t)(((ht)->nTableSize)) * sizeof(Bucket)) + (((size_t)(uint32_t)-(int32_t)(((ht)->nTableMask))) * sizeof(uint32_t)))

Collision

Next, let's take a look at how to solve the collision conflict problem of hash tables. The key name of the hash table may be hashed to the same node. Therefore, when we access the converted node, we need to compare whether the key name is what we are looking. If not, we will use the zval. u2.next field to read the next data in the linked list.

Note that the linked list structure is not stored in the memory as a traditional linked list. We directly read the entire array of arData, instead of getting the pointer with scattered memory addresses through heap.

This is an important improvement in PHP 7 performance. Data locality makes it unnecessary for the CPU to frequently access the slow primary storage, but instead reads all the data directly from the CPU L1 cache.

Therefore, we can see that adding an element to the hash table is as follows:

idx = ht->nNumUsed++;    ht->nNumOfElements++;    if (ht->nInternalPointer == HT_INVALID_IDX) {        ht->nInternalPointer = idx;    }    zend_hash_iterators_update(ht, HT_INVALID_IDX, idx);    p = ht->arData + idx;    p->key = key;    if (!ZSTR_IS_INTERNED(key)) {        zend_string_addref(key);        ht->u.flags &= ~HASH_FLAG_STATIC_KEYS;        zend_string_hash_val(key);    }    p->h = h = ZSTR_H(key);    ZVAL_COPY_VALUE(&p->val, pData);    nIndex = h | ht->nTableMask;    Z_NEXT(p->val) = HT_HASH(ht, nIndex);    HT_HASH(ht, nIndex) = HT_IDX_TO_HASH(idx);

The same rule applies to deleting elements:

#define HT_HASH_TO_BUCKET_EX(data, idx) ((data) + (idx))#define HT_HASH_TO_BUCKET(ht, idx) HT_HASH_TO_BUCKET_EX((ht)->arData, idx) h = zend_string_hash_val(key); /* get the hash from the key (assuming string key here) */  nIndex = h | ht->nTableMask; /* get the translation table index */ idx = HT_HASH(ht, nIndex); /* Get the slot corresponding to that translation index */  while (idx != HT_INVALID_IDX) { /* If there is a corresponding slot */      p = HT_HASH_TO_BUCKET(ht, idx); /* Get the bucket from that slot */    if ((p->key == key) || /* Is it the right bucket ? same key pointer ? */        (p->h == h && /* ... or same hash */         p->key && /* and a key (string key based) */         ZSTR_LEN(p->key) == ZSTR_LEN(key) && /* and same key length */         memcmp(ZSTR_VAL(p->key), ZSTR_VAL(key), ZSTR_LEN(key)) == 0)) { /* and same key content ? */        _zend_hash_del_el_ex(ht, idx, p, prev); /* that's us ! delete us */        return SUCCESS;    }    prev = p;    idx = Z_NEXT(p->val); /* get the next corresponding slot from current one */}return FAILURE;

Conversion table and hash table initialization

As a special tag, HT_INVALID_IDX indicates that the corresponding data node does not have valid data and is skipped directly.

The hash table greatly reduces the overhead of arrays that are created with null values, thanks to its two-step initialization process. When a new hash table is created, we only create two conversion table nodes and grant the HT_INVALID_IDX mark to them.

#define HT_MIN_MASK ((uint32_t) -2)#define HT_HASH_SIZE(nTableMask) (((size_t)(uint32_t)-(int32_t)(nTableMask)) * sizeof(uint32_t))#define HT_SET_DATA_ADDR(ht, ptr) do { (ht)->arData = (Bucket*)(((char*)(ptr)) + HT_HASH_SIZE((ht)->nTableMask)); } while (0) static const uint32_t uninitialized_bucket[-HT_MIN_MASK] = {HT_INVALID_IDX, HT_INVALID_IDX}; /* hash lazy init */ZEND_API void ZEND_FASTCALL _zend_hash_init(HashTable *ht, uint32_t nSize, dtor_func_t pDestructor, zend_bool persistent ZEND_FILE_LINE_DC)  {    /* ... */    ht->nTableSize = zend_hash_check_size(nSize);    ht->nTableMask = HT_MIN_MASK;    HT_SET_DATA_ADDR(ht, &uninitialized_bucket);    ht->nNumUsed = 0;    ht->nNumOfElements = 0;}

Note that you do not need to use heap to allocate memory, but use static memory areas, which is lighter.

Then, when the first element is inserted, the hash table is fully initialized and the space of the conversion table is created (if the size of the array is not determined, the default value is 8 ). In this case, we will use the heap to allocate memory.

#define HT_HASH_EX(data, idx) ((uint32_t*)(data))[(int32_t)(idx)]#define HT_HASH(ht, idx) HT_HASH_EX((ht)->arData, idx) (ht)->nTableMask = -(ht)->nTableSize;HT_SET_DATA_ADDR(ht, pemalloc(HT_SIZE(ht), (ht)->u.flags & HASH_FLAG_PERSISTENT));  memset(&HT_HASH(ht, (ht)->nTableMask), HT_INVALID_IDX, HT_HASH_SIZE((ht)->nTableMask))

The HT_HASH macro can use a negative offset to access nodes in the conversion table. The Mask of the hash table is always negative, because the index value of the node of the conversion table is the opposite of that of the arData array. This is the beauty of C Programming: you can create countless nodes without worrying about the performance of memory access.

The following is a delayed hash table structure:

The compression algorithm traverses all the elements in arData and replaces the UNDEF node with the original value. As follows:

Bucket *p;  uint32_t nIndex, i;  HT_HASH_RESET(ht);  i = 0;  p = ht->arData; do {      if (UNEXPECTED(Z_TYPE(p->val) == IS_UNDEF)) {        uint32_t j = i;        Bucket *q = p;        while (++i < ht->nNumUsed) {            p++;            if (EXPECTED(Z_TYPE_INFO(p->val) != IS_UNDEF)) {                ZVAL_COPY_VALUE(&q->val, &p->val);                q->h = p->h;                nIndex = q->h | ht->nTableMask;                q->key = p->key;                Z_NEXT(q->val) = HT_HASH(ht, nIndex);                HT_HASH(ht, nIndex) = HT_IDX_TO_HASH(j);                if (UNEXPECTED(ht->nInternalPointer == i)) {                    ht->nInternalPointer = j;                }                q++;                j++;            }        }        ht->nNumUsed = j;        break;    }    nIndex = p->h | ht->nTableMask;    Z_NEXT(p->val) = HT_HASH(ht, nIndex);    HT_HASH(ht, nIndex) = HT_IDX_TO_HASH(i);    p++;} while (++i < ht->nNumUsed);

Conclusion

At this point, the implementation basics of the PHP hash table have been introduced, and there are still some advanced contents about the hash table that have not been translated, as I am going to continue to share other knowledge points of the PHP kernel, if you are interested in the hash table, you can move to the original text.

The above is the implementation principle of the PHP 7 hash table. For more information, see PHP Chinese network (www.php1.cn )!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.