PHP kernel (6) implementation of hash tables and PHP hash tables

Source: Internet
Author: User
Tags key string
In-depth understanding of the implementation of PHP kernel (6) hash tables and PHP hash tables. In-depth understanding of the implementation of the PHP kernel (6) hash table and the PHP hash table, and in-depth understanding of the original article: www. orlion. ga241 1. HashTable provides an in-depth understanding of the implementation of hash tables and PHP hash tables in most dynamic languages.

Link: http://www.orlion.ga/241/

I. hash table (HashTable)

Hash tables are used in most implementations of dynamic languages. a hash table is a type of hash function that maps a specific key to a specific value.

Structure, which maintains a one-to-one correspondence between keys and values.

Key: indicates the operation data, such as the index or string key in the PHP array.

Slot/bucket: a unit used to store data in the hash table, that is, the container where the array is actually stored.

Hash function: a function that maps keys to the location of slots where data is stored.

Hash collision: the hash function maps two different keys to the same index.

Currently, there are two methods to solve hash conflicts: link method and open addressing method.

1. conflict resolution

(1) connection method

The link method uses a linked list to save the slot value to resolve conflicts, that is, when different keys are mapped to a slot, the linked list is used.

To save these values. (This method is used in PHP );

(2) open addressing

The open addressing method is used to store data directly by the slot itself. when inserting data, if the index mapped to the key already has data, this indicates that there is a conflict,

At this time, the next slot will be searched. if The slot is also occupied, continue to look for the next slot until no slot is found. this is also true when searching.

2. implementation of hash tables

The implementation of hash tables is mainly completed in three aspects:

* Implement hash functions

* Conflict resolution

* Operation interface implementation

(1) Data structure

First, we need a container to access our hash table in caocun. the content to be saved in the hash table is mainly the data stored in it. at the same time, in order to conveniently know the number of elements stored in the hash table, you need to save a size field, and the second is the container that saves the data. The following describes how to implement a simple hash table. There are two basic data structures: one for saving the hash table itself and the other for actually saving the data. The definition is as follows:

typedef struct _Bucket{    char *key;    void *value;    struct _Bucket *next; } Bucket; typedef struct _HashTable{    int size;    Bucket* buckets;} HashTable;

The above definition is similar to the implementation in PHP. to simplify the data type of keys as strings, the storage structure can be of any type.

The Bucket struct is a single-chain table to solve hash conflicts. Link conflicting elements when multiple keys are mapped to the same index.

(2) implementation of hash functions

We use a simple hash algorithm to add all the characters in the key string, and then modulo the size of the hash table as a result, in this way, the index will fall within the range of the array index.

Static int hash_str (char * key) {int hash = 0; char * cur = key; while (* (cur ++ )! = '\ 0') {hash + = * cur;} return hash;} // use this macro to obtain the index of the key in the hash table # define HASH_INDEX (ht, key) (hash_str (key) % (ht)-> size)

The hash algorithm used by PHP is called DJBX33A. The following operation functions are defined to operate the hash table:

Int hash_init (HashTable * ht); // initialize the hash table int hash_lookup (HashTable * ht, char * key, void ** result ); // search for the content int hash_insert (HashTable * ht, char * key, void * value) based on the key; // insert the content into the hash table int hash_remove (HashTable * ht, char * key); // delete the int hash_destroy (HashTable * ht) content pointed to by the key );

The following uses the insert and retrieve operation functions as an example:

Int hash_insert (HashTable * ht, char * key, void * value) {// check if we need to resize the hashtable resize_hash_table_if_needed (ht); // the hash table is not fixed, when the inserted content quickly occupies the storage space of the hash table, // The hash table will be resized to accommodate all the elements int index = HASH_INDEX (ht, key ); // find the index Bucket * org_bucket = ht-> buckets [index] mapped to the key; Bucket * bucket = (Bucket *) malloc (sizeof (Bucket )); // apply for a bucket for the new element-> key = strdup (key); // Save the value content. here, we simply point the pointer to the content to be stored, instead Content replication bucket-> value = value; LOG_MSG ("Insert data p: % p \ n", value); ht-> elem_num + = 1; // record the number of elements in the current hash table if (org_bucket! = NULL) {// a collision occurs. place the new element in the head of the linked list LOG_MSG ("Index collision found with org hashtable: % p \ n", org_bucket ); bucket-> next = org_bucket;} ht-> buckets [index] = bucket; LOG_MSG ("Element inserted at index % I, now we have: % I elements \ n ", index, ht-> elem_num); return SUCCESS ;}

First, find the position where the element is located. If an element exists, compare the key of all elements in the linked list with the key to be searched until the consistent element is found, otherwise, the value does not match.

Int hash_lookup (HashTable * ht, char * key, void ** result) {int index = HASH_INDEX (ht, key); Bucket * bucket = ht-> buckets [index]; if (bucket = NULL) return FAILED; // find this linked list to find the correct element. Generally, this linked list should have only one element, that is, there are different cycles. // to ensure this, a suitable hash algorithm is required. While (bucket) {if (strcmp (bucket-> key, key) = 0) {LOG_MSG ("HashTable found key in index: % I with key: % s value: % p \ n ", index, key, bucket-> value); * result = bucket-> value; return SUCCESS;} bucket = bucket-> next ;} LOG_MSG ("HashTable lookup missed the key: % s \ n", key); return FAILED ;}

Arrays in PHP are implemented based on hash tables. when elements are added to arrays sequentially, there is an order between elements. the hash table here is physically close to the average distribution, in this way, these elements cannot be obtained according to the inserted sequence. in PHP implementation, the Bucket struct also maintains another pointer field to maintain the relationship between elements.

II. PHP hash table implementation

1. PHP hash implementation

Hash tables in PHP are very important data interfaces. most language features are based on hash tables, such as the scope of variables and the storage of variables, many of the class implementations and data in the Zend Engine are stored in the hash table.

(1) Data structure and description

Zend uses a two-way linked list to store data in order to save the relationship between data.

(2) hash table structure

The hash table in PHP is implemented in Zend/zend_hash.c. PHP uses the following two data structures to implement the hash table. the HashTable structure is used to save the basic information required for the entire hash table, the Bucket struct is used to save specific data content, as follows:

Typedef struct _ hashtable {uint nTableSize; // hash Bucket size, minimum 8, increase uint nTableMask with 2x; // nTableSize-1, index value optimization uint nNumOfElements; // The number of existing elements in the hash Bucket. the count () function returns the ulong nNextFreeElement directly. // The location of the next digital index Bucket * pInternalPointer; // The pointer Currently traversed (one of the reasons that foreach is faster than for) Bucket * pListHead; // stores the number of header element pointer Bucket * pListTail; // store the array end element pointer Bucket ** arBuckets; // store the hash array dtor_func_t pDestructor; zend_bool persistent; unsigned char nApplyCount; // mark the number of recursive accesses to the current hash Bucket (to prevent multiple recursion) zend_bool bApplyProtection; // when the current hash Bucket cannot be accessed for many times, you can only recursively 3 this # if ZEND_DEBUG int inconsistent; # endif} HashTable;

The nTableSize field is used to indicate the capacity of the hash table. the minimum initialization capacity of the hash table is 8. First, check the initialization function of the hash table:

ZEND_API int _zend_hash_init(HashTable *ht, uint nSize, hash_func_t pHashFunction,                    dtor_func_t pDestructor, zend_bool persistent ZEND_FILE_LINE_DC){    uint i = 3;    //...    if (nSize >= 0x80000000) {        /* prevent overflow */        ht->nTableSize = 0x80000000;    } else {        while ((1U << i) < nSize) {            i++;        }        ht->nTableSize = 1 << i;    }    // ...    ht->nTableMask = ht->nTableSize - 1;     /* Uses ecalloc() so that Bucket* == NULL */    if (persistent) {        tmp = (Bucket **) calloc(ht->nTableSize, sizeof(Bucket *));        if (!tmp) {            return FAILURE;        }        ht->arBuckets = tmp;    } else {        tmp = (Bucket **) ecalloc_rel(ht->nTableSize, sizeof(Bucket *));        if (tmp) {            ht->arBuckets = tmp;        }    }     return SUCCESS;}

For example, if the initial size is set to 10, the above algorithm will adjust the size to 16, that is, always adjust the size to the integer power close to the initial size of 2.

Why is this adjustment? First, let's look at how HashTable maps hash values to slots:

h = zend_inline_hash_func(arKey, nKeyLength);nIndex = h & ht->nTableMask;

We can see from the _ zend_hash_init () function above that the size of ht-> nTableMask is ht-> nTableSize-1. Here, we use & operation instead of modulo operation, because the consumption of the modulo operation and the bitwise operation are much larger.

After setting the size of the hash table, you need to apply for storage space for the hash table. for example, the code initialized above calls different memory application methods based on whether to save the table permanently, what needs to be sustained is described in the previous PHP lifecycle: persistent content can be accessed between multiple requests, however, non-persistent storage will release the occupied space at the end of the request. Details will be explained in memory management

The nNumOfElements field in HashTable is easy to understand. This field is updated every time an element or unset is inserted to delete the element, so that it can be quickly returned when the count () function counts the number of array elements.

The nNextFreeElement field is very useful. first look at the PHP code:

  'Hello');$a[] = 'TIPI';var_dump($a); // ouputarray(2) {  [10]=>  string(5) "Hello"  [11]=>  string(5) "TIPI"}

In PHP, you can add elements to the array without specifying the index value. by default, numbers are used as indexes, which is similar to enumeration in c, the number of indexes for this element is determined by the nNextFreeElement field. If the number key exists in the array, the latest key + 1 is used by default. in the preceding example, 10 is already used as the key element, in this way, the new default index is 11.

Next let's take a look at the structure of the slot data that saves the hash table data:

Typedef struct bucket {ulong h; // the hash value of the char * key, or the user-specified digital index value uint nKeyLength; // The length of the hash keyword, if the array index is a number, this value is 0 void * pData; // point to value, which is generally a copy of user data. if it is pointer data, it points to pDataPtr void * pDataPtr; // if it is a pointer array, this value will point to the real value, and above pData will point to this value struct bucket * pListNext; // The next element of the entire hash table, struct bucket * pListLast; // the previous element struct bucket * pNext of the entire hash table; // The next element struct Bucket * pLast stored in the same hash bucket; // it is stored in the same hash Bu The previous element char arKey [1];/* stores the character index, which must be placed at the end, because only one byte is defined here, the storage actually points to the char * key value, which means you can save the consumption of another value assignment, and sometimes this value is not required, so it also saves space. */} Bucket;

For example, the comment of the above fields. The h field stores the hash value of the hash table key. You can use a string or number as an array index in PHP. Because the index of a number is unique. If you perform another hash, it will be a great waste. The nKeyLength field after the h field is used as the key length identifier. if the index is a number, the nKeyLength is 0. when defining an array in PHP, if the string can be converted to a number, it will also be converted. So in PHP, character indexes such as '10' and '11' are no different from numerical indexes such as 10 and 11.

  • The Bucket struct maintains two-way linked lists. the pNext and pLast pointers point to the linked list of the current slot respectively.

  • The pListNext and pListLast pointers point to the link between all data in the hash table. In the HashTable struct, pListHead and pListTail maintain the header element pointer and the pointer of the last element of the entire hash table.

Operation interface of the hash table:

PHP provides the following operation interfaces:

  • Initialization operations, such as the zend_hash_init () function, are used to initialize the hash table interface and allocate space.

  • This is a common operation.

  • Iteration and loop. these interfaces are used to cyclically operate hash tables.

  • Copy, sort, invert, and destroy operations.

Partition (6) hash table and PHP hash table implementation, a deep understanding of the original link: http://www.orlion.ga/241/ 1, hash table (HashTable) most of the dynamic language implementation are made...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.