Hash table and perfect hash

Source: Internet
Author: User

We know thatDirect addressing)To access any element in the array within the O (1) time. Therefore, if the bucket permits, an array can be provided to reserve a location for each possible keyword, and direct addressing technology can be applied.

Hash table)Is the promotion of the general array concept. When the number of actually stored keywords is smaller than the total number of possible keywords, using a hash table is more effective than using direct array addressing. Because the array size used in a hash table is proportional to the number of keywords to be stored.

A hash table is a dynamic set of data structures. Under some reasonable assumptions, the expected time for finding an element in a hash table is O (1 ).

  • Hash table (hashtable)
    • Hash Conflict Resolution Policy: Open addressing)
      • Linear probing)
      • Quadratic probing)
      • Binary Hash (rehashing)/double hash (double hashing)
    • Hash Conflict Resolution Policy: chaining)
  • Design of Hash Functions
    • The Division Method)
    • The multiplication method)
    • Universal hashing)
  • Perfect hashing)
Hash table (hashtable)

Now let's assume that we want to use the employee's social security number as a unique identifier for storage. The format of the social security number is DDD-DD-DDDD (d in the range of 0-9 ).

If you use array to store employee information and want to query employees with the Social Security Number 111-22-3333, you will try to traverse all the positions in the array, that is, the execution time is O (n). A better way is to sort the Social Security numbers so that the query time is reduced to O (log (n )). However, in ideal cases, we prefer to set the query time to O (1 ).

One solution is to create a large array ranging from 000-00-0000 to 999-99-9999.

The disadvantage of this solution is a waste of space. If we only need to store the information of 1000 employees, we only use 0.0001% of the space.

The second solution is to useHash Function)Compression sequence.

We choose to use the last four digits of the social security number as the index to reduce the span of the interval. The range is from 0000 to 9999.

In mathematics, this method of converting from nine digits to four digits is calledHashing). You can compress the index space of an array to the corresponding hash table ).

In the preceding example, the input of the hash function is a nine-digit Social Security number, and the output result is the last four digits.

H(x) = last four digits of x

It also describes a common behavior in hash function compute:Hash collisions). It is possible that the last four digits of the two social security numbers are 0000.

When you want to add a new element to hashtable, hash conflicts are a cause of Operation damage. If no conflict occurs, the element is successfully inserted. If a conflict occurs, you need to determine the cause of the conflict. Therefore,Hash conflicts increase the operation cost. hashtable is designed to minimize conflicts..

There are two ways to handle hash conflicts: avoiding and solving, namely, the conflict avoidance mechanism (Collision Avoidance) and the conflict resolution mechanism (Collision Resolution ).

One way to avoid hash conflicts is to select an appropriate hash function. The probability of conflicts in hash functions is related to the distribution of data. For example, if the last four digits of the social security number are immediately distributed, it is appropriate to use the last four digits. However, if the last four digits are allocated by the year of birth of the employee, the birth year is obviously not evenly distributed, and the last four digits will cause a large number of conflicts. We call this method of selecting an appropriate hash function collision avoidance ).

Many policies can be implemented when dealing with conflicts. These policies are called Collision Resolution ). One method is to place the elements to be inserted into another block space because the same hash location is occupied.

Hash Conflict Resolution Policy: Open addressing)

Generally, the Conflict Resolution Policy isOpen addressing)All elements are stored in the array in the hash table, without the use of additional data structures.

One of the simplest implementations of the open addressing method isLinear probing)The steps are as follows:

  1. When a new element is inserted, the hash function is used to locate the element position in the hash table;
  2. Check whether the specified position in the hash table already exists. If the content in this position is empty, insert and return. Otherwise, move to step 3.
  3. If the position is I, check whether I + 1 is null. If it is occupied, check I + 2, and so on until a location where the content is null is found.

Now, if we want to insert the information of five employees into the hash table:

  • Alice (333-33-1234)
  • Bob (444-44-1234)
  • Cal (555-55-1237)
  • Danny (000-00-1235)
  • Edward (111-00-1235)

The inserted hash table may be as follows:

Element insertion process:

  • Alice's social security number is hashed to 1234, so it is stored at location 1234.
  • Bob's social security number is hashed to 1234, but because Alice's information is already stored at location 1234, check that the next location is 1235,1235 is empty, then Bob's information is put to 1235.
  • Cal's social security number is hashed to 1237,1237 is empty, so Cal is placed at 1237.
  • If Danny's social security number is hashed to 1235,1235 is occupied, check whether location 1236 is empty and location 1236 is empty, so Danny is put to 1236.
  • Edward's social security number is hashed to 1236, 1237 is occupied, check 1238 is also occupied, and then check until is detected, this location is blank, as a result, Edward was placed at the 1238 position.

Although the linear probing method is simple, it is not the best method to solve the conflict, because it will lead to the clustering of similar Hash (primary clustering ). As a result, conflicts still exist when you search for a hash table. For example, in the hash table in the preceding example, if we want to access Edward's information, because Edward's Social Security Number 111-00-1235 is hashed to 1235, but Bob is found at location 1235, so I searched for 1236 again, but found Danny, and so on until I found Edward.

An improved method isQuadratic probing)That is, the step of each check location space is a square multiple. That is to say, if location S is occupied, first check S + 12, then check S-12, S + 22, S-22, S + 32 and so on, instead of using S + 1, S + 2 like linear profiling... mode growth. However, secondary profiling also causes similar hash clustering problems (secondary clustering ).

Another improved open addressing method is calledBinary Hash (rehashing)(Or calledDouble hashing)).

The principle of binary hash is as follows:

There is a set of hash functions h1... hn. To add or retrieve elements from a hash table, use the hash function H1. If a conflict occurs, use H2, and so on until hn. All hash functions are very similar to H1. The difference is that they use a multiplication factor ).

In. net, the hash function hk of the hashtable class is defined as follows:

Hk(key) = [GetHash(key) + k * (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1)))] % hashsize

When binary hash is used, it is important that after hashsize exploration, each location in the hash table is accessed only once. That is to say, for a given key, the same position in the hash table will not use hi and HJ at the same time. In the hashtable class, the binary hash formula is used, and it is always maintained (1 + (gethash (key)> 5) + 1) % (hashsize-1 )) and hashsize are mutually prime numbers (two numbers are mutually prime numbers, indicating that there is no common prime factor between the two ).

A binary hash uses a half (m2) probe sequence, while a linear probing and a quadratic probing use a half (m) probe sequence, therefore, dual hash provides a better policy to avoid conflicts.

When adding new elements to hashtable, check to ensure that the ratio of elements to space size does not exceed the maximum. If the value exceeds, the hash tablespace is expanded. The procedure is as follows:

  • The location space of the hash table is almost doubled. To be precise, the position space value increases from the current prime value to the next largest prime value.
  • During binary hash, all element values in the hash table depend on the location space values of the hash table. Therefore, all values in the hash table need to be re-hashed.

From this we can see that the expansion of the hash table will be at the cost of performance loss. Therefore, we should estimate the number of elements most likely to be contained in the hash table in advance, and construct appropriate values when initializing the hash table to avoid unnecessary expansion.

Hash Conflict Resolution Policy:Chaining)

Chaining)Is a conflict resolution strategy (Collision Resolution Strategy ). In the link method, all elements in the same slot are hashed to a linked list.

When probing is used, if a conflict occurs, the next location in the list will be attempted. If binary Hash (rehashing) is used, all the hash values are recalculated. WhileChaining)An additional data structure will be used to handle conflicts, and each location (slot) in the hash table will be mapped to a linked list. When a conflict occurs, the conflicting element is added to the bucket list, and each bucket contains a linked list to store the same hash element.

The hash table in contains eight buckets, that is, the top-down position of the yellow background. If a new element is to be added to the hash table, it is added to the bucket corresponding to the hash of its key. If an element already exists at the same position, the new element is added to the front of the list.

Adding Elements Using the link technology involves hash calculation and linked list operations, but it is still a constant and the progressive time is O (1 ). The average time for query and deletion operations depends on the number of elements and the number of buckets. Specifically, the running time is O (N/m). Here N is the total number of elements, and m is the number of buckets. However, the implementation of hash tables usually makes n = O (m) almost always, that is, the total number of elements will never exceed the total number of buckets, so O (N/m) it also becomes the constant O (1 ).

Design of Hash Functions

A good hash function should satisfy the assumption that every keyword may be hashed to any one of M slots, it is irrelevant to the slot to which other keywords have been hashed. Unfortunately, it is usually not possible to check whether this condition is true, because people rarely know the probability distribution that the keywords match, and the keywords may not be completely independent of each other. In practice, heuristic techniques are often used to construct good hash functions. For example, you can use restricted information about keyword distribution during design.

Division hashing and multiplication hashing are heuristic methods, while global hashing uses randomization technology to obtain good performance.

The Division Method)

A good hash method is to export the hash value in any mode that may exist in the data independently. For example, the Division hash method uses a specific prime number to divide the given keyword, and the resulting remainder is the hash value of the keyword.

Division hash functions can be expressed:

hash(key) = key mod m

Key indicates the key word to be hashed, M indicates the size of the hash table, and mod indicates the remainder operation. Assuming that the selected prime number is irrelevant to any pattern in the keyword distribution, this method can often give good results.

The multiplication method)

The multiplication hash function can be expressed:

hash(key) = floor( m * ( A * key mod 1) )

Floor indicates the expression to be rounded down, constant A has a value range of (0 <A <1), M indicates the size of the hash table, and mod indicates the remainder operation. [A * Key mod 1] indicates that the key is multiplied by a value ranging from 0 ~ The number between 1 and the fractional part of the product. This expression is equivalent to [A * Key-floor (A * Key)].

One advantage of the multiplication hash method is that there is no special requirement on M selection. Generally, it is a power of 2, this is because we can implement this hash function more conveniently on most computers.

Although this method applies to any a value, it works better for some values. The best choice is related to the features of the data to be hashed. Don knuth thinks a ≈ (√ 5-1)/2 = 0.618 033 988... is better, it can be called a golden split point.

Universal hashing)

When an element is inserted into a hash table, if all the elements are hashed to the same bucket, data storage is actually a linked list, the average search time is hour (n ). In fact, this worst case may occur for any particular hash function. The only effective improvement method is to randomly select the hash function so that it is independent of the elements to be stored. This method is calledUniversal hashing).

The basic idea of global hash is to randomly extract a hash function from a group of hash functions at the beginning of execution. Just like in fast sorting, randomization ensures that no input will always cause the worst case. At the same time, randomization also makes algorithms run differently for the same input. This ensures that the algorithm has a good average running condition for any input.

hasha,b(key) = ((a*key + b) mod p) mod m

P is a large enough prime number, so that every possible keyword key falls within the range of 0 to P-1. M is the number of slots in the hash table. Any A, {1, 2, 3 ,..., P-1, B, {0, 1, 2 ,..., P-1 }.

Perfect hashing)

When the set of keywords is a constant static set, the hash technique can also be used to obtain excellent performance in the worst case. If the memory access times in the worst case of a hash technique for searching is O (1), it is calledPerfect hashing).

The basic idea of the perfect hash design is to use two-level hash policies, and each level uses the Univeral hashing ).

The first level is basically the same as the hash table using the link technology (chaining). Using a random function H selected from a global hash function family, hash n keywords into M slots.

At this time, unlike the linked list structure used in the connection technology, a small quadratic hash table SJ is used, and the hash function associated with it is HJ. By randomly selecting the hash function HJ, we can ensure that there is no hash conflict on the second level.

If n keywords are stored in a hash table whose size is M = n2 using the hash function H randomly selected from a global hash function family, the collision probability is less than 1/2.

To ensure that there is no hash conflict on Level 2, the MJ size of the hash table SJ must be the square of the key word number NJ in the hash to the sloj. MJ's secondary dependency on NJ seems to have a high requirement on the overall storage, but by selecting the first hash function as appropriate, the expected total storage space is still O (n ).

If the number of keywords N is equal to the number of slots m, the hash function is calledMinimal perfect hash function).

References

  • Hash Function
  • Universal hashing
  • Perfect Hash Function
  • A Practical minimal perfect hashing method
  • An approach for minimal perfect hash functions for very large databases
  • Common Data Structures and complexity
  • Hash
  • Perfect Hash Function)
  • Gperf-GNU perfect hash function generator User Manual)
  • Use gperf to implement efficient C/C ++ COMMAND LINE PROCESSING

In this article, "hash table and perfect hash" is published by Dennis Gao on the blog site. It is a hooligans to repost any human or crawler without the permission of the author.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.