The hash tables in PHP are detailed

Source: Internet
Author: User
Tags php source code
in the PHP kernel, one of the most important data structures is hashtable. Our commonly used arrays, in the kernel, are implemented with Hashtable. So, how is PHP Hashtable implemented? Recently looking at Hashtable data structure, but there is no specific algorithm book implementation algorithm, just recently also read PHP source code, so refer to the implementation of PHP Hashtable, Self-realization of a simple version of the Hashtable, summed up some of the experience, the following for everyone to share.

On GitHub, I have a simple version of the Hashtable implementation: Hashtable implementation

In addition, I have a more detailed comment on the PHP source code on GitHub. Interested can be onlookers, to a star. PHP5.4 source annotation. Comments that have been added can be viewed through a commit record.

Introduction of Hashtable

A hash table is an effective data structure that implements dictionary operations.

Defined

Simply put, HashTable (hash table) is the data structure of a key-value pair. Support Insert, find, delete and other operations. Under some reasonable assumptions, the time complexity of all operations in the hash table is O (1) (which is of interest to the relevant proofs can be self-consulted).

Key to implementing a hash table

In a hash table, instead of using a keyword for subscript, the hash function calculates the hash value of key as the subscript, and then finds/deletes the hash value of key to quickly locate where the element is saved.

In a hash table, different keywords may calculate the same hash value, which is called a "hash conflict", which is the same hash value that handles two or more keys. There are many ways to solve the conflict, such as open addressing, zipper, and so on.

Therefore, the key to implementing a good hash table is a good hash function and a method of handling hash collisions.

hash function

The following four definitions are used to determine whether a hashing algorithm is good or bad: > * consistency, the equivalent key must produce the equivalent hash value; > * High efficiency, easy to calculate; > * uniformity, hashes all the keys evenly.

The hash function establishes the corresponding relationship between the key value and the hash value: H = Hash_func (key). See the corresponding relationship:

Designing a perfect hash function is for the expert to do, and we only work with the more mature hash functions that are already in place. The hash function used by the PHP kernel is the time33 function, also called djbx33a, which is implemented as follows:

Static inline ulong Zend_inline_hash_func (const char *arkey, UINT nkeylength) {Register ULONG hash = 5381; /* Variant with the hash unrolled eight times */for (; nkeylength >= 8; nkeylength-= 8) {hash = (            (hash << 5) + hash) + *arkey++;            hash = ((hash << 5) + hash) + *arkey++;            hash = ((hash << 5) + hash) + *arkey++;            hash = ((hash << 5) + hash) + *arkey++;            hash = ((hash << 5) + hash) + *arkey++;            hash = ((hash << 5) + hash) + *arkey++;            hash = ((hash << 5) + hash) + *arkey++;    hash = ((hash << 5) + hash) + *arkey++; } switch (nkeylength) {Case 7:hash = ((hash << 5) + hash) + *arkey++;/* Fallthrough ... */Case 6 : hash = ((hash << 5) + hash) + *arkey++; /* Fallthrough ... */Case 5:hash = ((hash << 5) + hash) + *arkey++; /* Fallthrough ... */Case 4:hash = ((hash << 5) + HASH) + *arkey++; /* Fallthrough ... */Case 3:hash = ((hash << 5) + hash) + *arkey++; /* Fallthrough ... */Case 2:hash = ((hash << 5) + hash) + *arkey++; /* Fallthrough ... */Case 1:hash = ((hash << 5) + hash) + *arkey++;        Break        Case 0:break; Empty_switch_default_case ()} return hash;}

Note: The function is implemented using a 8-cycle +switch, which optimizes the for loop, reduces the number of cycles to run, and then executes the remaining elements in the switch that are not traversed.

Zipper method

The method of saving all elements with the same hash value in a linked list is called a zipper method. The lookup is done by first calculating the hash value corresponding to the key, then finding the corresponding linked list based on the hash value, and then finding the corresponding value in the order of the linked list. The following structure diagram is saved:

PHP's hashtable structure

After simply introducing the data structure of the hash table, continue to see how the hash table is implemented in PHP.

(Image from the Internet, infringement is deleted)

PHP kernel hashtable definition:

typedef struct _HASHTABLE {          uint ntablesize;          UINT Ntablemask;          UINT Nnumofelements;          ULONG Nnextfreeelement;          Bucket *pinternalpointer;          Bucket *plisthead;          Bucket *plisttail;           Bucket **arbuckets;          dtor_func_t Pdestructor;          Zend_bool Persistent;          unsigned char napplycount;          Zend_bool bapplyprotection;          #if zend_debug               int inconsistent;          #endif} HashTable;

The size of the ntablesize,hashtable, growing in multiples of 2

Ntablemask, which is used in the value of the index with which the hash value is done and the operation obtains the hash value, is always nTableSize-1 after initialization of arbuckets

Nnumofelements,hashtable the number of elements currently owned, the Count function returns this value directly

A nnextfreeelement that represents the position of the next numeric index in an array of numeric key values

Pinternalpointer, internal pointer, pointing to the current member, for traversing elements

Plisthead, the first element that points to Hashtable, and the first element of the array

Plisttail, which points to the last element of Hashtable, is also the last element of the array. Combined with the above pointers, it is very convenient to iterate through the array, such as Reset and ENDAPI

Arbuckets, an array of two-way linked lists consisting of buckets, indexed with the hash value of key and Ntablemask and generated by the operation

Pdestructor, delete the destructor used by the element in the hash table

Persistent, which identifies the memory allocation function, if true, uses the memory allocation function of the operating system itself, otherwise uses PHP's memory allocation function

Napplycount, saves the number of times the current bucket is recursively accessed, preventing multiple recursion

Bapplyprotection, identifies whether the hash table is to use recursive protection, the default is 1, to use

Give an example of the combination of a hash and mask:

For example, the true hash value of "foo" (using the djbx33a hash function) is 193491849. If we now have a 64-capacity hash table, we obviously can't use it as an array subscript. Instead, the mask of the hash table is applied, and then only the low of the hash table is taken.

Hash           |        193491849  |     0b1011100010000111001110001001& Mask         | &  | &   0b0000000000000000000000111111----------------------------------------------------------------------= Index        | = 9               | =   0b0000000000000000000000001001

Therefore, in the hash table, foo is stored in the bucket vector labeled 9 in Arbuckets.

Definition of bucket structure

typedef struct BUCKET {     ulong H;     UINT Nkeylength;     void *pdata;     void *pdataptr;     struct bucket *plistnext;     struct bucket *plistlast;     struct bucket *pnext;     struct bucket *plast;     const char *arkey;} Buckets;

H, the hash value (or key of the numeric key value)

Length of Nkeylength,key

PData, pointer to data

PDATAPTR, pointer data

Plistnext, pointing to the next element in the Arbuckets linked list in Hashtable

Plistlast, pointing to the previous element in the Arbuckets linked list in Hashtable

Pnext, point to the next element in a bucket list with the same hash value

PLast, point to the previous element in a bucket list with the same hash value

Name of the Arkey,key

PHP Hashtable is the implementation of vector plus doubly linked list, vector in the arbuckets variable is saved, the vector contains a number of bucket pointers, each pointer to a multi-bucket of two-way linked list, the addition of new elements using the pre-insertion method, That is, the new element is always in the first position of the bucket. As can be seen above, PHP's hash table implementation is quite complex. This is the price it will pay to use the super-flexible array type.

An example diagram of Hashtable in PHP is as follows:

Hashtable related APIs

Zend_hash_init

Zend_hash_add_or_update

Zend_hash_find

Zend_hash_del_key_or_index

Zend_hash_init

function execution Steps

Set Hash Table size

Set initial values for other member variables of the struct (including the destructor for freeing memory Pdescructor)

Detailed code Annotations click: Zend_hash_init source

Note:

1, phashfunction is not used here, PHP hash function is used in the internal zend_inline_hash_func

2, zend_hash_init after execution did not really allocate memory for arbuckets and calculate the size of ntablemask, the real allocation of memory and calculation Ntablemask is the insertion of elements when the Check_init check initialization.

Zend_hash_add_or_update

function execution Steps

Check the length of the key

Check initialization

Calculating hashes and subscripts

The bucket where the hash value is traversed, if the same key is found and the value needs to be updated, the data is updated, otherwise it continues to point to the next element of the bucket until it points to the last position of the bucket

Allocates buckets for newly added elements, sets the property value of the new bucket, and adds it to the hash table

Resize the hash table if the hash table space is full

function Execution Flowchart

Connect_to_bucket_dllist is to add a new element to the BUCKET list with the same hash value.

Connect_to_global_dllist is a doubly linked list that adds new elements to the Hashtable.

For detailed code and annotations, please click on: zend_hash_add_or_update code annotations.

Zend_hash_find

function execution Steps

Calculating hashes and subscripts

Traverse the bucket where the hash value is located, if the bucket where key is located, returns the value, otherwise, point to the next bucket until it points to the last position in the bucket list

For detailed code and annotations, please click on: zend_hash_find code annotations.

Zend_hash_del_key_or_index

function execution Steps

Computes the hash value and subscript of a key

Traverse the bucket where the hash value is located, if you find the bucket where key is located, take the third step, otherwise, point to the next bucket until you point to the last position in the bucket list

If the first element is to be deleted, direct Arbucket[nindex] to the second element, and the rest is to execute the last next of the current pointer with the next one.

Adjust related pointers

Frees up data memory and bucket structure body storage

For detailed code and annotations, please click on: Zend_hash_del_key_or_index code annotations.

Performance analysis

PHP Hash Table Advantages: PHP hashtable for the operation of the array provides a great convenience, whether it is the creation of arrays and new elements or delete elements and other operations, the hash table provides a good performance, but its lack of data in a large amount of time more obvious, from the time complexity and space complexity to see its shortcomings.

Deficiencies are as follows:

The structure that holds the data zval needs to allocate the memory separately, need to manage this extra memory, each zval consumes 16bytes of memory;

In the new bucket, the bucket is also allocated extra, also need 16bytes of memory;

In order to be able to carry out sequential traversal, using a doubly linked list to connect the entire hashtable, a lot more pointers, each pointer also to 16bytes of memory;

On traversal, if the element is at the end of the bucket list, you also need to traverse the complete bucket list to find the value you are looking for

PHP's Hashtable is mainly due to its doubly-linked list of pointers and zval and buckets need to allocate additional memory, resulting in a lot of memory space and find out a lot more time spent.

Subsequent

The above mentioned deficiencies, in the PHP7 is a good solution, PHP7 to the kernel of the data structure made a big transformation, making PHP more efficient, so the recommended PHP developers will develop and deploy the version of the update bar. Take a look at the following PHP code:

<?php$size = POW (2, 16); $startTime = Microtime (true); $array = Array (); for ($key = 0, $maxKey = ($size-1) * $size; $key <= $maxKey; $key + = $s ize) {    $array [$key] = 0;} $endTime = Microtime (true); Echo ' Insert ', $size, ' A malicious element needs ', $endTime-$startTime, ' second ', ' \ n '; $startTime = Microtime (True) ; $array = array (); for ($key = 0, $maxKey = $size-1; $key <= $maxKey; + + $key) {    $array [$key] = 0;} $endTime = Microtime (true); Echo ' Insert ', $size, ' A common element needs ', $endTime-$startTime, ' second ', ' \ n ';

The above demo is a comparison of time consumption when there are multiple hash collisions and no conflict. The author runs this code under PHP5.4, the result is as follows

Inserting 65,536 malicious elements takes 43.72204709053 seconds

Inserting 65,536 normal elements takes 0.009843111038208 seconds

And the result of running on the PHP7:

Inserting 65,536 malicious elements takes 4.4028408527374 seconds

Inserting 65,536 normal elements takes 0.0018510818481445 seconds

It can be seen that the performance of PHP7 has improved a lot in both conflicting and conflict-free array operations, and of course, conflicting performance gains are more pronounced. As to why PHP7 's performance has improved so much, it is worth continuing to dig deeper.

Finally, I have a simple version of Hashtable's implementation on GitHub: Hashtable implementation

In addition, I have a more detailed comment on the PHP source code on GitHub. Interested can be onlookers, to a star. PHP5.4 source annotation. Comments that have been added can be viewed through a commit record.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.