Hash table and Perfect hash

Source: Internet
Author: User

We know that any element in an array can be accessed within an O (1) time by direct addressing via an array (directly addressing). Therefore, if the storage space allows, you can provide an array, for each possible keyword to maintain a location, you can apply the direct addressing technology.

A hash table (hash table) is a generalization of the concept of ordinary arrays. When the number of key words actually stored is smaller than the total amount of possible keywords, then using a hash table is more efficient than using direct array addressing. Because the hash table typically uses an array size that is proportional to the number of key words to be stored.

A hash table is a dynamic collection of data structures, and, under some reasonable assumptions, the desired time to find an element in a Hashtable is O (1).

    • Hash table (Hashtable)
      • Hash conflict resolution Policy: Open addressing method (opening addressing)
        • Linear probing (Linear probing)
        • Two probes (quadratic probing)
        • Two-degree hash (rehashing)/double hash (double Hashing)
      • Hash conflict resolution Policy: link technology (chaining)
    • Design of hash function
      • Except Fahahi (The Division method)
      • Multiplication Hash method (the multiplication method)
      • Global Hash Method (Universal Hashing)
    • Perfect Hash (Perfect Hashing)
Hash table (Hashtable)

Now suppose we want to use the employee's social Security number as a unique identifier for storage. The social security number is in the form DDD-DD-DDDD (the range of D is number 0-9).

If you use an array to store employee information, to query an employee with a social security number of 111-22-3333, you will attempt to traverse all locations of the array, that is, a query operation with a progressive time of O (n). A better approach would be to sort the social Security numbers so that the query progressive time is reduced to O (log (n)). But ideally, we would prefer the query to have a progressive time of O (1).

One option is to create a large array, ranging from 000-00-0000 to 999-99-9999.

The disadvantage of this scheme is that it wastes space. If we only need to store 1000 employees ' information, we only use 0.0001% of the space.

The second option is to compress the sequence with a hash function .

We chose to use the latter four digits of the social Security Number as an index to reduce the span of the interval. Such ranges will range from 0000 to 9999.

Mathematically, this conversion from 9-digit to 4-digit is called a hash conversion (Hashing). You can compress an array of index spaces (indexers space) into the appropriate hash table.

In the above example, the input of the hash function is a 9-digit social Security number, and the output is the latter 4 bits.

H (x) = last four digits of X

Also shows a common behavior in hash function calculations: hash Collisions (hash collisions). It is possible that the latter 4 digits of the two social Security numbers are 0000.

When you add a new element to Hashtable, a hash conflict is a factor that causes the operation to be corrupted. If no conflict occurs, the element is successfully inserted. If a conflict occurs, you need to determine the cause of the conflict. As a result, hash collisions increase the cost of operations, and Hashtable's design goal is to minimize the occurrence of conflicts .

There are two ways to handle hash collisions: avoidance and resolution, which is the conflict avoidance mechanism (collision avoidance) and the conflict resolution mechanism (collision Resolution).

One way to avoid hash collisions is to select the appropriate hash function. The probability of a conflict occurring in a hash function is related to the distribution of the data. For example, if the latter 4 digits of the social Security number are immediately distributed, then it is appropriate to use the latter 4 digits. However, if the latter 4 are allocated in the year of birth of the employee, it is obvious that the year of birth is not evenly distributed, then the choice of the latter 4 will cause a lot of conflicts. We refer to this method of choosing the appropriate hash function as the collision avoidance mechanism (collision avoidance).

There are a number of policies that can be implemented when dealing with conflicts, called conflict resolution mechanisms (collision Resolution). One way to do this is to put the inserted element in another block space, because the same hash position is already occupied.

Hash conflict resolution Policy: Open addressing method (opening addressing)

The commonly used conflict resolution strategy is open addressing, where all elements are stored in an array within the hash table without the use of additional data structures.

One of the simplest implementations of open addressing is linear probing (Linear probing)with the following steps:

    1. When a new element is inserted, the hash function is used to position the element in the hash table;
    2. Checks if the element exists in the hash table for that position. If the location content is empty, insert and return, otherwise turn to step 3.
    3. If the location is I, check if i+1 is empty, check i+2 if it is already occupied, and so on, until you find a location where the content is empty.

Now if we want to insert information from five employees into a hash table:

    • Alice (333-33-1234)
    • Bob (444-44-1234)
    • Cal (555-55-1237)
    • Danny (000-00-1235)
    • Edward (111-00-1235)

Then the inserted hash table might look like this:

The insertion process of the element:

    • Alice's Social Security number is hashed to 1234, so it is stored in position 1234.
    • Bob's Social Security number is hashed to 1234, but since location 1234 has already stored Alice's information, check the next location 1235,1235 is empty, then Bob's information is put to 1235.
    • Cal's Social Security number is hashed to the 1237,1237 location, so the CAL is placed at 1237.
    • Danny's Social Security number is hashed as 1235,1235 has been occupied, then check whether 1236 bit is empty, 1236 is empty, so Danny is put to 1236.
    • Edward's Social Security number is hashed as 1235,1235 has been occupied, check 1236, also occupied, and then check 1237, until the check to 1238, the location is empty, so Edward was placed in 1238 position.

The linear probing (Linear probing) approach is simple, but not the best strategy for resolving conflicts, because it causes the aggregation of homogeneous hashes (Primary clustering). This causes the conflict to persist when searching the hash table. For example, the hash table in the example above, if we want to access Edward's information, because Edward's social security number 111-00-1235 hash is 1235, but we found in 1235 position is Bob, so then search 1236, find Danny, and so on until found E Dward.

An improved way is two probes (quadratic probing), that is, the step size of each check position space is a square multiplier. That is, if position s is occupied, first check for S + 12, then check S-12,s + 22,s-22,s + 32 and so on, instead of S + 1,s + 2 as linear probing ... Way to grow. However, two probes can also cause similar hash aggregation problems (secondary clustering).

Another improved open addressing method is known as a two-degree hash (rehashing)(or double hash (double Hashing)).

The two-degree hash works as follows:

There is a H1 that contains a set of hash functions ... A collection of Hn. When you need to add or get elements from a hash table, first use the hash function H1. If this causes a conflict, try using H2, and so on, until Hn. All hash functions are very similar to H1, but the multiplication factor (multiplicative factor) they choose is different.

The hash function of the Hashtable class in. NET Hk is defined as follows:

Hk (key) = [Gethash (key) + K * (1 + ((Gethash (key) >> 5) + 1)% (hashsize–1))]% hashsize

When using a two-degree hash, it is important that each location in the hash table is accessed only once after the Hashsize probe has been performed. That is, for a given key, the same position in the hash table does not use both the Hi and the Hj. The two-degree hash formula is used in the Hashtable class, and it is always maintained (1 + (((Gethash (key) >> 5) + 1)% (hashsize–1) and hashsize each other as prime numbers (two of the mutual prime numbers mean that they do not have a common quality factor).

The two-degree hash uses the θ (m2) probe sequence, while linear probing (Linear probing) and two probes (quadratic probing) use the θ (m) probe sequence, so a two-degree hash provides a better strategy for avoiding conflicts.

When you add a new element to Hashtable, you need to check to ensure that the ratio of elements to space does not exceed the maximum scale. If this is exceeded, the hash table space will be expanded. The steps are as follows:

    • The hash table's position space is almost doubled. accurately, the position space value increases from the current prime value to the next largest prime value.
    • Because of the two-degree hash, all the element values in the hash table will depend on the location space value of the Hashtable, so all values in the table also need to be re-two-degree hashes.

As a result, the expansion of the hash table will be at the expense of performance loss. Therefore, we should pre-estimate the number of elements in the hash table that are most likely to fit, and construct the appropriate values when initializing the hashtable to avoid unnecessary expansions.

Hash Conflict resolution policy: Link technology (chaining)

link Technology (chaining) is a conflict resolution strategy (Collision Resolution strategy). In the chaining method, all the elements that are hashed into the same slot are placed in a linked list.

When using profiling technology (probing), if a conflict occurs, the next location in the list is attempted. Using a two-degree hash (rehashing) causes all hashes to be recalculated. The link technology (chaining) uses an additional data structure to handle the conflict, which maps each location (slot) in the hash table to a linked list. When a conflict occurs, the elements of the conflict are added to the bucket list, and each bucket contains a linked list to store the same hash element.

The hash table in contains 8 buckets (buckets), which is the position of the top-down yellow background. If a new element is to be added to the hash table, it will be added to the bucket corresponding to the hash of its Key. If an element already exists in the same location, the new element will be added to the front of the list.

The operation of adding elements using link technology involves hashing and linked list operations, but it is still constant, with a progressive time of O (1). When querying and deleting operations, the average time depends on the number of elements and the number of buckets (buckets). Specifically, the run time is O (n/m), where n is the total number of elements, and M is the number of buckets. But usually the implementation of the hash table almost always makes n = O (m), that is, the total number of elements will never exceed the total number of buckets, so O (n/m) also becomes a constant O (1).

Design of hash function

A good hash function should satisfy the assumption that each keyword is likely to be hashed into any one of the M slots, regardless of which slot the other keywords have been hashed to. Unfortunately, it is not always possible to check whether this condition is true, because people rarely know the probability distribution that the keyword fits, and the keywords may not be completely independent of each other. In practice, heuristic techniques are often used to construct a good hash function. For example, in the design, you can take advantage of the keyword distribution of the restrictive information and so on.

Except Fahahi method and multiplication hashing belong to heuristic method, and the whole domain hashing uses the randomization technique to obtain the good performance.

Except Fahahi (The Division method)

A good hashing practice is to export a hash value in a way that is independent of any pattern that might exist in the data. For example, in addition to Fahahifa a given keyword with a specific prime number, the resulting remainder is the hash value of that keyword.

The except Fahahi function can be represented as:

Hash (key) = key mod m

where key denotes the keyword being hashed, m represents the size of the hash table, and the MoD is the fetch operation. Assuming that the selected prime number is irrelevant to any pattern in the keyword distribution, this approach often gives good results.

Multiplication Hash method (the multiplication method)

The multiplication hash function can be expressed as:

Hash (key) = Floor (M * (A * key mod 1))

Where floor represents the next rounding of the expression, constant A takes the range (0<a<1), M is the size of the hash table, and mod is the take-over operation. [A * key mod 1] indicates that the key is multiplied by a number in the 0~1 and the decimal portion of the product, which is equivalent to [A*key-floor (A * key)].

One advantage of the multiplication hashing method is that there is no special requirement for m selection, it is generally chosen to be a power of 2, because we can implement the hash function more conveniently on most computers.

Although this method applies to any of the a values, it is better for some values and the best choice is related to the characteristics of the data to be hashed. Don Knuth thinks a≈ (√5-1)/2 = 0.618 033 988 ... Better, can be called the Golden Section point.

Global Hash Method (Universal Hashing)

When inserting an element into a hash table, if all the elements are hashed into the same bucket, the data is actually stored as a linked list, then the average lookup time is Θ (n). In fact, any particular hash function is likely to have this worst case, and the only effective improvement is to randomly select the hash function independent of the element to be stored. This method is called a global hash (Universal Hashing).

The basic idea of a global hash is to randomly extract a hash function to be used from a set of hash functions at the beginning of execution. Just as in fast sorting, randomization guarantees that no input will always cause the worst case to occur. At the same time, randomization also makes the algorithm different at every execution, even for the same input. This ensures that the algorithm has a good average operating condition for any input.

Hasha,b (key) = ((A*key + b) mod p) mod m

Where P is a prime number large enough to allow every possible keyword key to fall within the range of 0 to p-1. M is the number of slots in the hash table. Any a∈{1,2,3,..., p-1},b∈{0,1,2,..., p-1}.

Perfect Hash (Perfect Hashing)

The hashing technique can also be used to obtain excellent worst-case performance when the collection of keywords is a constant static set (static). If a certain hash technique is being searched, its worst-case memory access Count is O (1), which is called the Perfect hashing (Perfect Hashing).

The basic idea of designing a perfect hash is to use a two-level hashing strategy, with a global hash (univeral Hashing) at each level.

The first level is basically the same as a hash table using link technology (chaining), which hashes the n keywords into m slots, using a function h chosen randomly from a family of hash functions of a whole domain.

At this point, unlike the link technology in the use of the linked list structure, but a smaller two hash table Sj, and its related hash function for the HJ. By randomly selecting the hash function HJ, you can ensure that no hash collisions occur at the second level.

If the N keyword is stored in a hash table of size M = n2 using a randomly selected hash function h from a family of hash functions of a whole domain, then the probability of collisions is less than 1/2.

To ensure that no hash collisions occur at the second level, you need to make the hash table Sj size MJ Hash to the number of key words in Slot J of NJ squared. MJ's two-time dependency on NJ may seem to make the overall storage requirement large, but the total storage space expected to be used is still O (n) by selecting the first hash function appropriately.

If the number of keywords is n equals the number of slots m, then the hash function is called the minimal perfect hash function (Minimal Perfect hash functions).

Resources

    • Hash function
    • Universal hashing
    • Perfect hash function
    • A practical Minimal Perfect Hashing Method
    • An approach to Minimal Perfect Hash Functions for Very Large Databases
    • Common data structures and complexity
    • Talking about Hash
    • Perfect hash functions (Perfect hash function)
    • Gperf--gnu Perfect hash Function Generator user manual (translator)
    • Using Gperf for efficient c/+ + command-line processing
    • Universal and Perfect Hashing

Hash table and Perfect hash

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.