Hash (algorithm introduction notes)

Source: Internet
Author: User

Hash

Direct addressing table

An array T [0 .. each location in m-1] corresponds to a keyword in the global U, and the slot K points to an element with the K keyword in the set. If there is no element with the K keyword in the set, T [k] = Nil

Global u = {0, 1 ,..., Each keyword in 9} corresponds to a lower value in the table. The set K = {2, 3, 5, 8} consisting of the actual keyword determines some slots in the table, these slots contain pointers to elements, while other slots contain Nil

The technical disadvantage of direct addressing is very obvious: if the global U is very large, a table t with the size of | u | should be stored in the available memory capacity of a standard computer, it is even impossible. In addition, the actually stored keyword set K may be very small relative to U, so that most of the space allocated to T will be wasted. In this case, you can use the hash to improve.

Hash

In direct addressing mode, elements with a keyword K are stored in the slot K. In hash mode, the elements are stored in H (k; hash function H is used to calculate the slot location by the keyword K. Here, function H maps the global U of the keyword to the hash list T [0 ,.., m-1] slot.

H: U {0 ,..., M-1}

Here, the size of the hash is generally much smaller than that of | u |. It can be said that an element with the keyword K is hashed to the slot H (k ).

 

 

There is a disadvantage here: two keywords may be mapped to the same slot. This situation is called conflict, and there are many methods to resolve the conflict.

Conflict Resolution

In the link method, all elements hashed to the same slot are put in a linked list, as shown in. The slot J contains a pointer, it points to the table header of the linked list that stores all elements hashed to J. If such an element does not exist, Nil is in slot J.

 

The linked list can be a single-chain table, but the deletion of the double-chain table will be faster.

Analysis (search for a keyword)

Given a hash T with M slots that can store n elements, the load factor α of T is defined as N/m, that is, the average number of elements stored in a linked list, α can be greater than, equal to, or less than 1.

The worst case of hash by means of link is poor performance: All n keywords are hashed to the same slot to generate a chain table with a length of N, in the worst case, the search time is θ (n), and the time used to calculate the hash function is similar to that used to link all elements with a linked list.

The average performance of the hash method depends on the hash function H selected, and the degree of uniformity of all key word sets distributed on M slots.

On average, there are two results for finding a keyword: successful search and unsuccessful search.

In the case of simple and even hash, any key word K that has not been stored in the table may be hashed to any one of the M slots. Therefore, when a keyword K is searched, the expected time of the search is the expected time at the end of the chain table t [H (k, the expected length of this time is α, so the average length of Alpha elements to be checked for an unsuccessful search, and the total time required (including the time for calculating H (k) θ (1 + α)

When the search is successful, the average time required is θ (1 + α). For details, refer to 146 pages of the Chinese version of Introduction to algorithms.

Summary

The above analysis means that if the number of slots in the hash list is at least proportional to the number of elements in the table (for example, when the number of elements to be hashed increases, the number of slots in the hash table t must also increase in the same proportion.

N = random (M), and α = N/m = random (m)/m = random (1), so the average time of the search operation requires constant time. If the number of hash elements increases but the number of hash slots does not increase, n = Merge (m) is not true, the operation time of the hash table is different from that of the previous one.

Open addressing Method

In the open addressing method, all elements are stored in the hash, that is, each table item or an element that contains a dynamic set, or contains nil. When searching for an element, you must systematically check all table items to find the required element, or finally find that the element is not in the table. Unlike the link method, there is neither a linked list nor an element stored outside the hash list. Therefore, in the open addressing method, the hash list may be filled up so that no new elements can be inserted, therefore, the load factor α = N/m will never exceed 1, that is, the number of elements to be hashed will never exceed the number of slots.

Probe Method Linear Exploration

Given a common hash function H': U → {0, 1 ,..., M-1} (called the auxiliary hash function). The hash function used by the linear exploration method is:

H (K, I) = (H '(k) + I) mod m, I = 0, 1 ,..., M-1

Given a keyword K, the first exploration slot is t [H '(k)], that is, the slot given by the auxiliary hash function. Next we will explore the slot [H '(k) + 1],..., Until the slot [m-1], then it goes around the slot [0], t [1],... Until the last probe slot T [H '(k)-1]. In the linear probe method, the entire sequence is determined by the initial probe position, so there are only m different probe sequences.

The linear probe method is easy to implement, but it has a problem called a single cluster. With the passage of time, the number of slots continuously occupied increases, and the average search time also increases. Clustering is very prone, because when I have a full slot before an empty slot, the probability that the empty slot will be occupied as the next slot is (I + 1) /M. The sequence of the continuously occupied slots will become longer and longer, so the average query time will also increase.

Secondary Exploration

The secondary exploration uses the following form of hash function:

H (K, I) = (H '(k) + C1 I + c2i2) mod m

H is an auxiliary hash function. C1 and C2 are auxiliary constants (not equal to 0), I = 0, 1 ,..., M-1. The initial probe location is t [H '(k)], and an offset must be added to the subsequent probe location. The offset is dependent on probe id I in a quadratic manner. This probe method works much better than linear probe, but if the initial probe locations of the two keywords are the same, their probe sequence is the same because h (K1, 0) = H (K2, 0) contains H (K1, I) = H (K2, I ). This type of feature can lead to a mild cluster, called a secondary cluster. There are only m different probe sequences in the secondary probe.

Double hash

Double hash is one of the best methods for open addressing. It uses the following form of hash function:

H(K,I) = (H1(K) +I H2(K) ModM

WhereH1AndH2Auxiliary hash function. The initial probe location isT[H1(K)], And the offset is added to the subsequent probe location.H2(K) ModeM.

To find the entire hash, the valueH2(K) Size of the tableMMutual quality. One way to ensure that this condition is true is to takeMIs a power of 2, and designH2. Another method is to obtainMIs a prime number, and a design always produces a greaterMSmall positive integer functionH2. For example, you can use m as the prime number, and m is slightly smaller than m, as shown below:

H1 (K) = K mod m, H2 (K) = 1 + (K mod m ')

The double hash method is used.Bytes(M2.

Analysis

Compared with the link method, the advantage of the open addressing method is that pointers are not needed, but the Slot Sequence is calculated. Therefore, there is no space to save without storing pointers, this allows more slots to be provided in the same space, potentially reducing conflicts and improving the retrieval speed.

It is difficult to delete an element from the hash of the open addressing method. When deleting a keyword from slot I, you cannot just put nil in it to identify it as null, if you do this, the problem will occur: When you insert the keyword K and find that the slot I is occupied, K will be inserted to the following position. After you delete the keyword in slot I, the keyword K cannot be retrieved. One solution is to set a specific value deleted in slot I to replace nil to mark the empty slot. When you use a special value deleted, the query time no longer depends on the load factor. Therefore, in applications where keywords must be deleted, the more common method is to use the link method to resolve conflicts.

Given an open addressing hashed list with a loading Factor of α = N/m, α <1, and assuming it is a uniform hashed column, an unsuccessful search is performed, the expected number of probes is 1/(1-α ). For more information, see p155.

1/(1-α) = 1 + α 2 + α 3 +... There is an intuitive explanation in this field. In any case, you always need to perform the first probe. The first probe discovers a occupied slot, and you must perform the second probe, the probability of the second probe is approximately α. When the slots found in the first two probes are occupied, the probability of the third probe is approximately α 2.

If α is a constant, the running time of an unsuccessful query is half (1). For example, if half of the hash is full, the average number of failed searches is 1/(1-0.5) = 2. If the hash 90% is full, the average number of probes is 1/(1-0.9) = 10

Assume that the uniform hash is used. On average, a 1/(1-α) probe is required to insert as many elements into an open addressing hash table with a load factor of α, because an unsuccessful search is required to insert a keyword, the time for inserting an element is the same as the time for an unsuccessful exploration.

For an open addressing hash with a load factor of α <1, the expected number of probes in a successful search is (1/α) ln (1/(1-α )). For more information, see p155. if the hash is half full, the expected number of probes is less than 1.39 during a successful profiling. If the hash is 90% full, the expected number of probes is less than 2.56.

Hash function division hash Method

In the divisor hash methodKDividedMReturns the remainderKMapMTo a slot. That is, the hash function is:

H(K) =KMoDM

Note the following when deduplicated applications:M(Optional)MThe value is usually a prime number that is not very close to the integer power of 2.

Multiplication hash

The multiplication hash method consists of two steps. Step 1: Use keywordsKMultiplication constantA(0 <A<1), and extractK. Then, useMMultiply by this value, and then round down. The hash function is:

H(K) = Floor (M(KMoD 1 ))

The floor () function is an integer down. One advantage of the multiplication method is thatMThere are no special requirements for the choice, generally it is the power of 2 (M=2 P,PIs an integer ).

For example, H (K) = (A * k mod 2 W) RSH (W-R), where W is the number of digits of the Computer (32 or 64-bit ), M is the number of slots, RSH (W-R) means shift to the right w-r bit, A is an odd number, and 2w-r <A <2 w, M = 2R

Global hash

Any hash function may have the worst-case state, that isNKeywords are hashed to the same slot so that the average retrieval time isBytes(N): The only effective improvement method is to randomly select a hash function to make it independent of the keywords to be stored. This method is called Global hash.

The basic idea of global hashing is to randomly select a hash function from a group of carefully designed functions at the beginning of execution. Randomization ensures that no input always leads to the worst-case state. Meanwhile, randomization makes the algorithm behave differently for each execution even for the same input. This ensures that the algorithm has a good average condition for any input.

SetHIs a finite group of hash functions that map the given keyword FieldsUMap to {0, 1 ,...,M-1 }. Such a group of functions are called global functions.K,LεU, MeetH(K) =H(L).HεHThe maximum number is |U|/M. In other words, ifHRandomly select a hash function, when the keywordK=LThe probability of two collisions is no greater than 1/M, This is exactly from the set {0, 1 ,...,M-1} randomly and independently selectedH(K) AndH(L) The probability of a collision.

IfHSelect a group of global Hash Functions and use themNKey words are hashed to a sizeMUsing the Link Method to Solve the collision tableT. If the keywordKIf it is not in the tableKThe expected length of the linked list to which it is hashed. E [NH (k)] To mostlyα. If the keywordKThe table contains keywords.KThe expected length of the linked list E [NH (k)Up to 1 +α.

ForMThe global hash and link method are used to solve the collision.Bytes(N).NOperation Sequence of insert, search, and delete operations, which containsO(M.

Design a global hash function class

1. Select a prime number, represented by m

2. Split K into numbers of R + 1 bits, K = <K0, K1, K2 ,... KR> 0 <= Kr <= m-1

3. Select a number a = <A0, A1 ,... , Ar>, every AI starts from {0, 1 ,... M-1}

4. HA (K) = mod m

Full hash

Two levels of hash can be used to design a full hash scheme. Global hash is used at each level, as shown in:

The hash function of the outer layer is H (K) = (ak + B) mod p) mod m. A level-2 hash list contains all the keywords hashed to the slot J, its size is mj = nj2, and the relevant hash function is HJ (K) = (AJK + BJ) mod p) mod MJ

 

Level 1 is basically the same as the hashed list with connections: using a hash function H carefully selected from a global group, hash n keywords into M slots.

However, a small secondary hash SJ and related hash function HJ are used instead of all the keywords in the hash J to create a chain table, using the well-selected hash function HJ, you can ensure that there is no conflict on the second level.

To ensure that no conflict exists on Level 2, the MJ size of the hash SJ should be the square of the Key number NJ in the hash J, although MJ's secondary dependency on NJ seems to have a high overall storage requirement, you can select the first-level hash function as appropriate, you can limit the expected total storage space to limit (n ).

Probability of second-level hash conflicts

If the hash function H is randomly selected from a global hash function class, the N keywords are stored in a hash list with the size of M = n2, the probability of conflict in the table is less than 1/2.

The meaning of the above theorem: For a hash function H randomly selected from H, it is likely that there will be no conflict, given the set K of N keywords to be hashed, just a few random attempts are required to easily find a hash function H without conflict.

Space required for full hash

If a hash function H is randomly selected from a global hash function class, it is used to store N keywords in a hash list with a size of M = n, set the size of each secondary hash to mj = nj2. In a full hash solution, the expected total storage capacity required for storing all secondary hash tables is less than 2n.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.