Data structure and algorithm analysis: Hash table

Last Update:2018-07-26 Source: Internet

Author: User

Tags hash relative truncated

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Here is a summary of the hash table after reading the introduction to the algorithm:

Hash table is also called hash list, which is an effective data structure to implement dictionary operation. Hash tables are highly efficient and can be accessed once without a conflict (described later), and in an ideal case, the average time to find an element is O (1) (the worst-case hash list finds an element in the same time as the linked list: O (n), But the actual performance of the general hash table is relatively good.

A hash table is a data structure that describes the mapping problem of the key-value pair, with a more detailed description of establishing a deterministic correspondence between where the record is stored and its keywords, so that each keyword corresponds to the only storage location in the hash table. We call this correspondence relationship f a hash/hash function, which is a hash/hash list.

I. Direct addressing table

When the keyword's global u compares hours, direct addressing is a simple and efficient technique, and its hash function is simple: f (key) = key, that is, the keyword size is directly equal to the table position ordinal where the element resides. If the keyword is not an integer, we need to convert it to an integer by some means, such as a keyword that converts the character keyword to its ordinal number in the alphabet. The advantage of direct addressing is that there is no case of two keywords corresponding to the same address (that is, f (key1) = f (key2) is not present), so there is no need to handle conflicts. However, the direct addressing table also has a natural limitation, that is, if the global U is large, in a standard computer available memory capacity, to store a table of size U is not practical.

Second, the hash list

As mentioned above, when dealing with the actual data, the whole domain u tend to be very large, then in a standard computer available memory capacity, to store a table of size u may be less practical, at this time the actual need to store the keyword set K may be small relative to u, when the hash table needs more storage space than the direct tables much less.

Hash table T calculates the position of the keyword key in the tables by the hash function f, which is referred to as a "slot". The hash function f maps the keyword field u to the slot of the hash table t[0...m-1]. Because the number of keywords is greater than the number of slots, there is a problem: Several keywords may be mapped to the same position in the table, we call this situation as a conflict. We want the hash table to be close to O (1) while saving space, so you need to avoid collisions as much as possible, and the usual strategy is to design a better hash function f as far as possible, and map the keyword to every location of the hash as randomly as you can (why, as much as possible. Because the number of keywords here | U| is definitely larger than the number of slots in the hash table, so at least two keywords are mapped to the same slot, relying only on the hash function f to avoid the conflict completely.

third, hash function

There are many ways to construct a hash function, preferably for any of the keywords in a keyword combination, the probability that a hash function maps to any address in the address collection is equal, that is, the keyword is hashed to get a random address, so that the hash address of a set of keywords is evenly distributed throughout the address space, thus reducing conflict. Similarly, since most hash functions are assumed to be the whole domain of the keyword as the natural number set n={0, 1, 2 ...}, if the given keyword is not a natural number, it is necessary to convert it to a natural number first. Let's look at the usual hash functions.

1. Direct Addressing Method

corresponding to the preceding direct addressing table, the keyword has a one by one correspondence with the address in the hash table, so there is no need to handle conflicts.

2. Division Hashing Method

hash function: f (key) = key% M

function to the given keyword key to the remainder, where m must not be greater than the length of the hash table Len, usually m desirable a not too close to 2 of the whole number of the power of the prime .

3. Multiplication Hashing Method

Use the keyword key to multiply a (0 < a < 1), take out its fractional portion, and then multiply the value of the decimal by M, and then take the integer down, which is written as: f (key) = Floor (M * (Key * A% 1))

WHERE (key * A% 1) is the decimal part of (key * a), which involves the problem of the value of parameter A. Knuth that A = (sqrt (5)-1)/2 = 0.6180339877 ... is an ideal value, in fact, the point is the golden partition.

Division hashing and multiplication hashing are more commonly used hashing function design methods, in fact there are many kinds of methods, such as the whole domain hashing method , folding method , Digital analysis method , etc. For more information, please refer to the relevant books.

Iv. Conflict Management

As mentioned earlier, in order to save space, the number of slots in the table is less than the number of keywords, but it is not possible to completely avoid collisions through a well-designed hash function. Here are two ways to resolve conflicts: linking and open addressing .

1. Linking method (chaining)

The idea of linking is simple: If multiple keywords are mapped to the same location in the hash table (which is called synonyms), these synonyms are recorded in the same linear list, which has a pointer to the linked list header that stores all the elements that are hashed to the slot, as shown in the following illustration:

In the diagram, the keywords K1 and K4 are mapped to the same location in the hash table, K5, K2, and K7 are mapped to the same location in the hash table. The introduction to algorithms mentions that in order to delete an element more quickly, the linked list can be designed as a doubly linked list. If the table is single-linked, in order to delete element x, you must first find the element x in table T[h (X,key)], and then delete the X from the linked list by changing the next property of the x to the element. In the case of a single-linked list, the delete and find operations have the same asymptotic run time. Another example is as follows:

2. Open addressing Method (open addressing)

In open addressing, all elements are stored in a hash table, that is, each item or element containing a dynamic collection, or empty, no longer uses the linked list. The slot T in the hash table is open not only to the synonym of the hash function value equal to T, but also to the record that the hash function value is not equal to T, allowing the hash address to be obtained in a "preemption" manner.

The method uses the following formula to recall the memory:

F (key,i) = (f (key) + di)% len

where F (key) is a hash function, Len is a hash table length, di is an incremental sequence, it may have the following three cases:

DI = 1,2,3...m-1

DI = 1,-1,4,-4,9,-9...k^2,-k^2

Di is a pseudo-random sequence

Using the first sequence called linear detection and re-hashing , the second sequence called two- time detection and re -hashing, using the third sequence called random detection and re-hashing . Plainly, in the event of a conflict, the position of the keyword should be placed forward or backward to move a number of positions, such as the first sequence, if you encounter a conflict, move back one position to detect, if there is a conflict, continue to move backwards until an empty slot is encountered, the keyword is inserted at that location.

Linear probing is relatively easy to implement, but it has a problem, called a cluster. As the number of continuously occupied slots increases and the average lookup time increases, the clustering phenomenon is easy to occur because when I have a full slot in front of an empty slot, the slot is the next probability of being occupied (di+1) Len.

Using the same two-probe approach, two clusters are generated because each time a conflict is encountered, the search for the insertion position is jumping forward or backward, so this is relatively mild relative to a cluster.

five or one some examples

Known keyword sequence: 26,36,41,38,44,15,68,12,06,51,25.

The hash function is constructed by the division hashing method, and the linear detecting and hashing method is used to resolve the conflict. Now you need: ① to build a hash table; ② find the average search length (ASL) for success and failure.

Solution: The number of keywords in the title n = 11, set the loading factor α= 0.75,m = n/α, take m for the Prime 13. hash function: f (key) = key% 13, the following picture shows the steps to build a hash table:

Lookup performance analysis for a hash table:

VI. Implementation of the Code

The following code implements a hash table, where the link method is used to handle the conflict, the code describing the data structure of the header file is as follows:

#define M 7     //The divisor in the hash function must be less than or equal to the table long
typedef int ELEMTYPE;

The hash table uses the link method to resolve the conflict problem

typedef struct node 
{   //node is the data structure of the linked list node
    elemtype data;
    struct Node *next;
} Node,*pnode;

Hash table data structure for each slot
typedef struct HASHNODE
{   
    pnode first;    The first node of the list is
}hashnode,*hashtable;

Create a hash table
HashTable create_hashtable (int);

Find data in the hash table
Pnode search_hashtable (HashTable, elemtype);

Insert data into hash table
bool Insert_hashtable (hashtable,elemtype);

Delete data from the hash table
bool Delete_hashtable (hashtable,elemtype);

Destroy hash table
void destroy_hashtable (Hashtable,int);

First set up an empty hash table, and then perform the INSERT, delete, query, and so on, and finally destroy the hash table, the implementation code of the hash table is as follows:

#include <stdio.h> #include <stdlib.h> #include "data_structure.h"//create a hash table of n slots HashTable create_hashtable (
    int n) {int i;
    HashTable HashTable = (HashTable) malloc (n*sizeof (Hashnode));
        if (!hashtable) {printf ("Hashtable malloc Faild,program exit ...");
    Exit (-1);

    }//The hash table is empty for (i=0;i<n;i++) Hashtable[i].first = null;
return Hashtable;  }//Find the data in the hash table, the hash function is H (key) =key% M///Find success returns the position in the linked list//The lookup is unsuccessful returns null Pnode search_hashtable (HashTable HashTable, Elemtype

    Data) {if (!hashtable) return NULL;
    Pnode pcur = Hashtable[data%m].first;

    while (pcur && pcur->data! = data) Pcur = pcur->next;
return pcur; }//Shanhashi Insert data in table, hash function is H (key) =key%m//If data already exists, return fasle//Otherwise, insert the last of the corresponding list and return true bool Insert_hashtable (HashTable HashTable

    , elemtype data) {//If it already exists, returns False if (Search_hashtable (Hashtable,data)) return false; Otherwise, allocate space for insert data pnode pnew = (pnode) malloc (sizeof (NODE));
        if (!pnew) {printf ("Pnew malloc Faild,program exit ...");
    Exit (-1);
    } pnew->data = data;

    Pnew->next = NULL;
    Inserts the node into the corresponding linked list of the last pnode pcur = Hashtable[data%m].first;
    if (!pcur)//Insert position is the case of the first node of the list hashtable[data%m].first = Pnew; else//The insertion position is not the first node of the list {//Only use Pcur->next to connect the pnew node to the list,//Pcur not linked to the list, but to the pcur//Pcur although
        It eventually points to a node in the list, but it is not in the linked list while (pcur->next) pcur = pcur->next;
    Pcur->next = pnew;
} return true; }//delete data from hash table, hash function h (key) =key% M//If data does not exist, return fasle,//Otherwise, delete and return true bool Delete_hashtable (HashTable hashtable,elemt
    Ype data) {//If not found, return False if (!search_hashtable (Hashtable,data)) return false;
    Otherwise, delete data pnode pcur = Hashtable[data%m].first;  Pnode ppre = pcur; The previous node of the truncated point, the initial value is the same as pcur if (pcur->data = = data)//The truncated point is the case of the first node of the list Hashtable[data%m].first = Pcur->nex
    T
   Else {//A truncated point is not the case of the first node while (pcur && pcur->data! = data) {Ppre = Pcur;
        Pcur = pcur->next;
    } Ppre->next = pcur->next;
    } free (pcur);
    pcur = 0;
return true;
    }//hash table void destroy_hashtable (HashTable hashtable,int n) {int i;
        Release for (i=0;i<n;i++) {Pnode pcur = Hashtable[i].first on a chain-by-list basis first;
        Pnode Pdel = NULL;
            while (pcur) {Pdel = Pcur;
            Pcur = pcur->next;
            Free (Pdel);
        Pdel = 0;
    }}//finally freed hash table free (hashtable);
Hashtable = 0; }

Last Test code:

#include <stdio.h> #include "data_structure.h" int main () {int len = 15;

    hash table length, i.e. the number of slots in the table printf ("We set the length of Hashtable%d\n", Len);
    Create a hash table and insert data HashTable HashTable = create_hashtable (len);
    if (insert_hashtable (hashtable,1)) printf ("Insert 1 success\n");
    else printf ("Insert 1 fail,it is already existed in the hashtable\n");
    if (insert_hashtable (hashtable,8)) printf ("Insert 8 success\n");
    else printf ("Insert 8 fail,it is already existed in the hashtable\n");
    if (insert_hashtable (hashtable,3)) printf ("Insert 3 success\n");
    else printf ("Insert 3 fail,it is already existed in the hashtable\n");
    if (insert_hashtable (hashtable,10)) printf ("Insert success\n");
    else printf ("Insert fail,it is already existed in the hashtable\n");
    if (insert_hashtable (hashtable,8)) printf ("Insert 8 success\n"); else printf ("Insert 8 fail,it is already exIsted in the hashtable\n ");
    Find data Pnode pFind1 = search_hashtable (hashtable,10);      
    if (pFind1) printf ("Find%d in the hashtable\n", pfind1->data);
    else printf ("Not find ten in the hashtable\n");
    Pnode PFind2 = search_hashtable (hashtable,4);      
    if (pFind2) printf ("Find%d in the hashtable\n", pfind2->data);

    else printf ("not find 4 in the hashtable\n");
    Delete Data if (delete_hashtable (hashtable,1)) printf ("Delete 1 success\n");
    else printf ("Delete 1 fail");
    Pnode PFind3 = search_hashtable (hashtable,1);      
    if (pFind3) printf ("Find%d in the hashtable\n", pfind3->data);

    else printf ("not find 1 in the hashtable,it have been deleted\n");
    Destroy hash table destroy_hashtable (Hashtable,len);
return 0;
 }

Vii. References

The third edition of the Introduction to the algorithm (cormen,t.h), Coleman, Yingjianping
Http://blog.csdn.net/ns_code

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More