Data structure and algorithm analysis: Hash table

Last Update:2015-09-25 Source: Internet

Author: User

Tags truncated

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Here is a summary of the hash table after reading the introduction to the algorithm:

Hash table is also called hash list, which is an effective data structure to implement dictionary operation. Hash tables are highly efficient and can be accessed once without a conflict (described later), and in an ideal case, the average time to find an element is O (1) (the worst-case hash list finds an element in the same time as the linked list: O (n), But the actual performance of the general hash table is relatively good.

A hash table is a data structure that describes the mapping problem of the key-value pair, with a more detailed description of establishing a deterministic correspondence between the location where the record is stored and its keywords h , so that each keyword corresponds to the only storage location in the hash table. We call this correspondence f a hash/hash function, the storage structure is a hash/hash list.

I. Direct addressing table

UDirect addressing is a simple and efficient technique when the keyword's whole domain is compared, and its hash function is simple: f(key) = key the keyword size is directly equal to the number of the table position where the element resides. If the keyword is not an integer, we need to convert it to an integer by some means, such as a keyword that converts the character keyword to its ordinal number in the alphabet. The advantage of the direct addressing method is that there is no case where two keywords correspond to the same address (that is, what does not happen f（key1） = f(key2) ), so there is no need to handle conflicts. However, the direct addressing table also has a natural limitation, that is, if the global U is large, in a standard computer available memory capacity, to store a table of size U is not practical.

Second, the hash list

As mentioned above, when dealing with the actual data, the whole domain u tend to be very large, then in a standard computer available memory capacity, to store a table of size u may be less practical, at this time the actual need to store the set of keywords K may be relatively U small, The hash table requires much less storage space than the direct tables.

Hash T calculates the position of the keyword in the table by a hash function f key , which is referred to as a "slot". The hash function f maps the keyword field U to the T[0...m-1] slot of the hash table. Because the number of keywords is greater than the number of slots, there is a problem: Several keywords may be mapped to the same position in the table, we call this situation as a conflict. We want the hash table to be close to its performance while saving space, so avoid conflicts as much as possible, and O(1) the usual strategy is to design a better hash function as far as possible f , and map the keyword to every location of the hash as randomly as you can. Because the number of keywords here is |U| definitely greater than the number of slots in the hash table, m at least two keywords are mapped to the same slot, relying only on the hash function to f avoid the conflict completely.

Third, hash function

There are many ways to construct a hash function, preferably for any of the keywords in a keyword combination, the probability that a hash function maps to any address in the address collection is equal, that is, the keyword is hashed to get a random address, so that the hash address of a set of keywords is evenly distributed throughout the address space, thus reducing conflict. Similarly, since most hash functions are assumed to be the whole domain of the keyword as the natural number set n={0, 1, 2 ...}, if the given keyword is not a natural number, it is necessary to convert it to a natural number first. Let's look at the usual hash functions.

1. Direct Addressing method

corresponding to the preceding direct addressing table, the keyword has a one by one correspondence with the address in the hash table, so there is no need to handle conflicts.

2. Division Hashing method

hash function:f(key) = key % m

The function takes the remainder of the given keyword key , which m must not be greater than the length of the hash table len , and usually has m a prime number that is not too close to 2 for an integer power .

3. Multiplication Hashing method

Use the keyword key to multiply A（0 < A < 1） , take out its fractional part, and then use the value multiplied by the decimal point, then rounding m down, the function is written as:f(key) = floor(m * (key * A % 1))

Among them, (key * A % 1) is the (key * A) number of decimal parts, the function relates to the value of parameter a problem. Knuth thinks, A = (sqrt(5)-1)/2 = 0.6180339877... is a relatively ideal value, in fact, this point is the Golden Section point.

Division hashing and multiplication hashing are more commonly used hashing function design methods, in fact there are many kinds of methods, such as the whole domain hashing method , folding method , Digital analysis method , etc. For more information, please refer to the relevant books.

Iv. Conflict Management

As mentioned earlier, in order to save space, the number of slots in the table is less than the number of keywords, but it is not possible to completely avoid collisions through a well-designed hash function. Here are two ways to resolve conflicts: linking and open addressing .

1. Linking method (chaining)

The idea of linking is simple: If multiple keywords are mapped to the same location in the hash table (they are called synonyms), the synonyms are recorded in the same linear list, which has a pointer to the linked list header that stores all the elements that are hashed to the slot, as shown in:

In the diagram, the keyword is mapped to the same location k1 k4 as the hash table and k5、k2、k7 mapped to the same location in the hash table. The introduction to algorithms mentions that in order to delete an element more quickly, the linked list can be designed as a doubly linked list. If the table is single-linked, in order to delete element x, the element must first be found in the table T[h(x,key)] x , and then x next removed from the linked list by changing the attributes of the forward element x . In the case of a single-linked list, the delete and find operations have the same asymptotic run time. Another example is as follows:

2. Open addressing Method (open addressing)

In open addressing, all elements are stored in a hash table, that is, each item or element containing a dynamic collection, or empty, no longer uses the linked list. The slot T in the hash table is open not only to the synonym of the hash function value equal to T, but also to the record that the hash function value is not equal to T, allowing the hash address to be obtained in a "preemption" manner.

The method uses the following formula to recall the memory:

F(key,i) = (f(key) + di) % len

where F (key) is a hash function, Len is a hash table length, di is an incremental sequence, it may have the following three cases:

di = 1,2,3...m-1
di = 1，-1，4，-4，9，-9...k^2，-k^2
di为伪随机序列

Using the first sequence called linear detection and re-hashing , the second sequence called two- time detection and re -hashing, using the third sequence called random detection and re-hashing . Plainly, in the event of a conflict, the position of the keyword should be placed forward or backward to move a number of positions, such as the first sequence, if you encounter a conflict, move back one position to detect, if there is a conflict, continue to move backwards until an empty slot is encountered, the keyword is inserted at that location.

Linear probing is relatively easy to implement, but it has a problem, called a cluster. As the number of continuously occupied slots increases and the average lookup time increases, the clustering phenomenon is easy to occur because when I have a full slot in front of an empty slot, the slot is the next probability of being occupied (di+1) Len.

Using the same two-probe approach, two clusters are generated because each time a conflict is encountered, the search for the insertion position is jumping forward or backward, so this is relatively mild relative to a cluster.

Five or one some examples

Known keyword sequence: 26，36，41，38，44，15，68，12，06，51，25 .

The hash function is constructed by the division hashing method, and the linear detecting and hashing method is used to resolve the conflict. Now you need: ① to build a hash table; ② find the average search length (ASL) for success and failure.

Solution: The number of keywords in the problem n = 11 , set the loading factor, α = 0.75 m = n / α taken m as Prime 13. hash function: The f(key) = key % 13 steps to build a hash table are given in slices:

Lookup performance analysis for a hash table:

VI. implementation of the Code

The following code implements a hash table, where the link method is used to handle the conflict, the code describing the data structure of the header file is as follows:

Divisor in #define M 7 //hash function must be less than or equal to the length of the table typedef intElemtype;//The hash table uses the link method to solve the conflict problemtypedef structNode {//NODE data structures for linked list nodesElemtype data;structNode *next;} Node,*pnode;//Hash table data structure for each slottypedef structhashnode{Pnode first;//First node pointing to the list}hashnode,*hashtable;//create hash tableHashTable create_hashtable (int);//Find data in a hash tablePnode search_hashtable (HashTable, elemtype);//Inserting data into a hash tableBOOLInsert_hashtable (Hashtable,elemtype);//delete data from the hash tableBOOLDelete_hashtable (Hashtable,elemtype);//Destroy hash tablevoidDestroy_hashtable (HashTable,int);

First set up an empty hash table, and then perform the INSERT, delete, query, and so on, and finally destroy the hash table, the implementation code of the hash table is as follows:

#include <stdio.h>#include <stdlib.h>#include "data_structure.h"//Create a hash table of n slotsHashTable create_hashtable (intN) {intI HashTable HashTable = (HashTable)malloc(nsizeof(Hashnode));if(!hashtable) {printf("Hashtable malloc Faild,program exit ...");Exit(-1); }//place hash table empty     for(i=0; i<n;i++) Hashtable[i].first = NULL;returnHashtable;}//Find data in hash table, hash function is H (key) =key% M//Find success returns the position in the linked list//Lookup is unsuccessful returns nullPnode search_hashtable (HashTable HashTable, elemtype data) {if(!hashtable)returnNULL; Pnode pcur = Hashtable[data%m].first; while(Pcur && Pcur->data! = data) Pcur = pcur->next;returnPcur;}//Shanhashi Insert data in table, hash function is H (key) =key%m//If data already exists, return Fasle//Otherwise, insert the last of the corresponding list and return TrueBOOLInsert_hashtable (HashTable hashtable,elemtype data) {//If it already exists, returns false    if(Search_hashtable (Hashtable,data))return false;//Otherwise allocate space for insert dataPnode pnew = (pnode)malloc(sizeof(Node));if(!pnew) {printf("Pnew malloc faild,program exit ...");Exit(-1);    } pnew->data = data; Pnew->next = NULL;//Insert the node at the end of the corresponding linked listPnode pcur = Hashtable[data%m].first;if(!pcur)//Insert position as the first node of the listHashtable[data%m].first = pnew;Else    //Insert position is not the first node of the list{//Only use Pcur->next to connect the pnew node to the linked list,        //With Pcur not linked to the list, but connected to the Pcur        //Pcur Although it eventually points to a node in the list, it is not in the linked list         while(pcur->next) Pcur = pcur->next;    Pcur->next = pnew; }return true;}//delete data from hash table, hash function h (key) =key% M//If data does not exist, return Fasle,//Otherwise, delete and return TrueBOOLDelete_hashtable (HashTable hashtable,elemtype data) {//If not found, returns false    if(!search_hashtable (Hashtable,data))return false;//Otherwise, delete dataPnode pcur = Hashtable[data%m].first; Pnode ppre = pcur;//The previous node of the truncated point, with the same initial value as Pcur    if(Pcur->data = = data)//The truncated point is the case of the first node of the linked listHashtable[data%m].first = pcur->next;Else{//The truncated point is not the case of the first node         while(Pcur && Pcur->data! = data)            {ppre = Pcur;        Pcur = pcur->next;    } Ppre->next = pcur->next; } Free(Pcur); Pcur =0;return true;}//hash table with number of destroyed slots NvoidDestroy_hashtable (HashTable HashTable,intN) {intI//Release on a list-by-chain basis     for(i=0; i<n;i++) {Pnode pcur = Hashtable[i].first; Pnode Pdel = NULL; while(pcur)            {Pdel = Pcur; Pcur = pcur->next; Free(Pdel); Pdel =0; }    }//Finally free hash table     Free(Hashtable); Hashtable =0;}

Final Test code:

#include <stdio.h>#include "data_structure.h"intMain () {intLen = the;//hash table length, i.e. the number of slots    printf("We Set the length of Hashtable%d\n", Len);//Create a hash table and insert dataHashTable HashTable = create_hashtable (len);if(Insert_hashtable (HashTable,1))printf("Insert 1 success\n");Else         printf("Insert 1 fail,it is already existed in the hashtable\n");if(Insert_hashtable (HashTable,8))printf("Insert 8 success\n");Else         printf("Insert 8 fail,it is already existed in the hashtable\n");if(Insert_hashtable (HashTable,3))printf("Insert 3 success\n");Else         printf("Insert 3 fail,it is already existed in the hashtable\n");if(Insert_hashtable (HashTable,Ten))printf("Insert Ten success\n");Else         printf("Insert Fail,it is already existed in the hashtable\n");if(Insert_hashtable (HashTable,8))printf("Insert 8 success\n");Else         printf("Insert 8 fail,it is already existed in the hashtable\n");//Find DataPnode pFind1 = search_hashtable (HashTable,Ten);if(PFIND1)printf("Find%d in the hashtable\n", Pfind1->data);Else         printf("not find ten in the hashtable\n"); Pnode PFind2 = search_hashtable (HashTable,4);if(PFIND2)printf("Find%d in the hashtable\n", Pfind2->data);Else         printf("not find 4 in the hashtable\n");//Delete data    if(Delete_hashtable (HashTable,1))printf("Delete 1 success\n");Else         printf("Delete 1 fail"); Pnode PFind3 = search_hashtable (HashTable,1);if(PFIND3)printf("Find%d in the hashtable\n", Pfind3->data);Else         printf("not find 1 in the hashtable,it have been deleted\n");//Destroy hash tableDestroy_hashtable (Hashtable,len);return 0;}

Vii. references

The third edition of the Introduction to the algorithm (cormen,t.h), Coleman, Yingjianping
Http://blog.csdn.net/ns_code

Data structure and algorithm analysis: Hash table

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More