Vernacular algorithm (6) Hash Table (from theory to practice)

Source: Internet
Author: User

Do not use the link method. Is there any other way to handle the collision? I am afraid to ask this question. The link method is so natural and direct that I can't believe there are other (or even better) methods. Those who want to promote technological progress will always be those who dare to ask questions that are more naive and more external than the layman, and are good at finding new possibilities with rich imagination, people who have the ability to use scientific methods to practice.
If you do not need a linked list, you can use the space occupied by the saved linked list pointer as a blank slot to reduce the chance of collision and increase the search speed.

Use the open addressing method to handle collisions

There is no additional linked list or any other extra data structure, and only one array is used. What should I do when a collision occurs? The answer is only to find another empty slot! This isOpen addressing). But isn't that irresponsible? Imagine a train with a correct number. Assume it has only one carriage, and a passenger with a seat on the 7 th will arrive. After a while, another passenger came up with a fake ticket and a seat 7. What should I do? The conductor thought about it and asked passengers with fake tickets to take seats on the 8 th. After a while, the passenger in the 8 th seat should come up. The conductor told him that there is already someone in the 8 th seat. Go to the 9 Seat. Oh? Someone already exists on the 9th? Is there anyone on the 10th? Then you can sit on the 11th. It can be imagined that, as the number of empty seats increases, the chance of a collision increases, making it harder to find them. But what if the arrival rate of a train is only 50% or less? Maybe the passengers who actually take seats 8 will never get on the bus, so it is a good strategy to let the passengers with fake tickets take seats 8. Therefore, this isSpace Change TimeGame. The key to playing this game is to make passengersScatteredPlace in the carriage. How can we achieve this? The answer is that different exploration sequences are used for different passengers. For example, for Passenger A, probe seats 7, 8 ...... Until a vacant space is found; for passenger B, probe seats, 3 ...... Until a vacant space is found. If you have m seats, you can use m of <0, 1, 2,..., S-1> for each passenger! . Obviously, it is better to reduce the number of passengers using the same probe sequence. That is to say, we hope to try every passengerScatteredMap to m! Probe sequence. In other words, ideally, if every passenger on the bus can use m! The possibility of any of the probe sequences is the same.Consistent hash. (The word "random" is not used here, because it is impossible to randomly obtain a probe sequence, because the same probe sequence is also used when searching for this passenger ).
True consistent hashing is hard to implement. In practice, some of its approximation methods are often used. Commonly used methods to generate probe sequences include linear probe, secondary probe, and dual probe. None of these methods can achieve consistent hashing, because the number of different exploration sequences they can produce cannot exceed m2 (Consistent hashing requires m! Probe sequence ). Among the three methods, the double hash can produce the most number of probe sequences, so it can give the best results (Note:. net framework HashTable is the double hash method used ).
In the previous article, we implemented a function h (k). Its task is to map the value k into an array (as scattered as possible) address. This time, we use the development search method and need to implement a function h (k, I). Its task is to map the value k into an address sequence, the first address of the sequence is h (k, 0), and the second address is h (k, 1 )...... Each address in the sequence should be scattered as much as possible.

Linear Exploration

You can use 10 slots to save 0 ~ IntSet1 of int. MatValue (but cannot handle collision:

public class IntSet1{    private object[] _values = new object[10];    private int H(int value)    {        return value % 10;    }    public void Add(int item)    {        _values[H(item)] = item;    }    public void Remove(int item)    {        _values[H(item)] = null;    }    public bool Contains(int item)    {        if (_values[H(item)] == null)            return false;        else            return (int)_values[H(item)] == item;    }}

How can we transform it to deal with collisions using the open addressing method? The simplest method is to check whether values [9] is empty if values [8] is occupied. If values [9] is also occupied, let's see if values [0] is empty. The complete description is to first use the H () function to obtain the first address of k. If this address is occupied, we will find the next address next to it. If it is still unavailable, the next address is located next to it. If the address reaches the end of the array, It is wound to the beginning of the array. If the empty slot is still not found after m times, the array is full, this isLinear probing). The implementation code is:

Public class IntSet2 {private object [] _ values = new object [10]; private int H (int value) {return value % 10;} private int LH (int value, int I) {return (H (value) + I) % 10;} public void Add (int item) {int I = 0; // number of slots that have been probe do {int j = LH (item, I); // the address to be probe if (_ values [j] = null) {_ values [j] = item; return;} else {I + = 1 ;}}while (I <= 10); throw new Exception ("set overflow ");} public bool Contains (int item) {int I = 0; // number of slots that have been explored int j = 0; // do {j = LH (item, I); if (_ values [j] = null) return false; if (int) _ values [j] = item) return true; else I + = 1;} while (I <= 10); return false;} public void Remove (int item) {// not easy to handle }}

In the Add () function, first probe LH (value, 0), which is equal to H (value). If a collision occurs, continue to probe LH (value, 1 ), it is the next address of H (value), "... % 10 "indicates that the next slot in the last slot of the array is the first slot. In the Contains () function, use the same probe sequence as the Add () function. If an item is found, true is returned. If null is encountered, the item is not in the array.
The trouble is the Remove () function. You cannot set the slot to be deleted to null, which will cause Contains () errors. For example, if you add 3, 13, 23 to IntSet2 in sequence, _ values [3] = 3, _ values [4] = 13, _ values [5] = 23. Then, Remove (13) and execute _ values [4] = null. Then, call Contains (23) to check _ values [3], _ values [4], and _ values [5] until 23 is found or null is encountered, since _ values [4] has been set to null, Contains (23) returns false. One way to solve this problem is to set _ values [4] to a special value (for example,-1) When removing (23) instead of null. In this way, Contains (23) will not return false due to null in _ values [4. In addition, null or-1 in Add () is considered as an empty slot. The modified code is as follows:

Public class IntSet2 {private object [] _ values = new object [10]; private readonly int DELETED =-1; private int H (int value) {return value % 10 ;} private int LH (int value, int I) {return (H (value) + I) % 10;} public void Add (int item) {int I = 0; // do {int j = LH (item, I ); // if (_ values [j] = null | (int) _ values [j] = DELETED) {_ values [j] = item; return;} else {I + = 1 ;}} while (I <= 10); throw new Exception ("set overflow");} public bool Contains (int item) {int I = 0; // number of slots that have been probe int j = 0; // do {j = LH (item, I) address to be probe ); if (_ values [j] = null) return false; if (int) _ values [j] = item) return true; else I + = 1 ;} while (I <= 10); return false;} public void Remove (int item) {int I = 0; // number of slots that have been explored int j = 0; // do {j = LH (item, I); if (_ values [j] = null) return; if (int) _ values [j] = item) {_ values [j] = DELETED; return;} else {I + = 1 ;}} while (I <= 10 );}}

However, this method of implementing the Remove () function has a big problem. Imagine adding 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 in sequence, then Remove 0, 1, 2, 3, 4, 5, 6, 7, and 8, then call Contains (0 ), this function checks _ values [0], _ values [1]... _ values [9],This is totally unacceptable.! Let's take a look at this issue. In the next article, we will continue to discuss how to solve it.
Although the linear probe method is relatively easy to implement, it is calledPrimary clustering). As discussed in the beginning of this article, if the seats 7, 8, and 9 are occupied, the next passenger on the bus, whether his ticket is 7, 8 or 9, the next passenger on the bus will be arranged to take the 10th seat. whether his ticket is 7, 8, 9 or 10, he will be arranged to take the 11th seat ...... If I have a slot that is continuously occupied, the probability of the next empty slot being occupied will be (I + 1)/m, just like a thrombosis. Once blocked, it will become more congested. In this way, using the linear probe method, it is easy to generate a long string of continuously occupied slots, resulting in slow speed of the Contains () function.
For the linear probe method, the entire probe sequence is determined by the initial position LH (k, 0) = H (k), so there are only m different probe sequences.

Secondary Exploration

In the case of a collision, the next adjacent slot is not explored as a linear probe, but is offset more to ease the problem of a cluster. Quadratic probing:
H (k, I) = (h '(k) + c1i + c2i2) mod m
C1 and c2 are constants not 0. For example, if c1 = c2 = 1, the hash function of the secondary exploration is:

private int QH(int value, int i){    return (H(value) + i + i * i) % 10;}

The probe sequence given for values 7, QH () is 7, 9, 3, 9 ...... Since the initial position QH (k, 0) = H (k) determines the entire probe sequence, there are only m different probe sequences in the secondary probe. By making the next probe position offset by the square of I, it is not easy to connect the occupied slots into one piece as a linear probe. However, as long as the initial position of the probe is the same, the probe sequence will be identical, so it will be connected into a small part, a small part, this nature leads to a mild clustering phenomenon, calledSecondary cluster (secondary clusering).

Double hash

The main cause of the clustering phenomenon caused by the linear probe method and the secondary probe method is that the entire probe sequence is the same once the initial probe location is the same. In this way, once a collision occurs, things will become worse. What makes the entire probe sequence the same once the initial probe location is the same? This is because the linear probe method and the secondary probe method allow subsequent probe locations to be offset backward based on the initial probe location (that is, H (k), and this offset, whether linear or quadratic, all are just functions of I, but k is different, right? Therefore, you must find a way to set the offset to k. Taking linear exploration as an example, we need to find a way to make LH (k, I) a function of k and I, rather than the function of H (k) and I. Just do it. Let's try linear exploration.
H (k) = k % 10
LH (k, I) = (H (k) + I) % 10
Modify it. First, try to multiply k to I, that is
H (k) = k % 10
LH (k, I) = (H (k) + I * k) % 10
Does this work? Unfortunately,
LH (k, I) = (H (k) + I * k) % 10
= (H (k) + I * (k % 10) % 10
= (H (k) + I * H (k) % 10
= (H (k) * (1 + I) % 10
The result is LH (k, I), H (k), and I functions.
Then try to add k to I, that is
H (k) = k % 10
LH (k, I) = (H (k) + I + k) % 10
How about this?
LH (k, I) = (H (k) + I + k) % 10
= (H (k) + I + k % 10) % 10
= (H (k) + I + H (k) % 10
= (2 * H (k) + I) % 10
Unfortunately, LH (k) is still a function of H (k) and I. It seems that I can't do anything about it, unless I convert H (K) into a multiplication hash method, or useDouble hashing)Method:
H (k, I) = (h1 (k) + I * h2 (k) mod m
H1 (k) and h2 (k) are two different hash functions. For example
H1 (k) = k mod 13
H2 (k) = k mod 11
H (k, I) = (h1 (k) + I * h2 (k) mod 10
In this way, the probe sequence produced by h (7, I) is 7, 4, 1, 8, 5 ......
The probe sequence produced by h (20, I) is 7, 6, 5, 4, 3 ......
At last, the initial probe location is the same, but the subsequent probe location is different.
H2 (k) is very well designed. If it is difficult to find every empty slot. Taking h (k, I) as an example, h (6, I) the probe sequence of is "6, 2, 8, 4, 0, 6, 2, 8, 4, 0 ", if the positions 6, 2, 8, 4, and 0 in the array are occupied, this will cause the program to throw a "set overflow" exception when there is still a blank slot. To avoid this situationH2 (k) and m must communicate with each other. Let's take a look at why we cannot explore all the slots of the array if h2 (k) and m are not mutually qualitative. For example, h2 (6) = 6 and 10 have a common divisor 2, and place them into h (k, I ):
H (6, I) = (h1 (6) + I * h2 (6) mod 10
= (6 + I * 6) mod 10
= (6 + (I * 6) mod 10) mod 10
= (6 + 2 * (I * 6) mod 5) mod 10
Since (I * 6) mod 5) has only five different values, h (6, I) has only five values. H (16, I) = (3 + 5 * (I * 5) mod 2) mod 10 has only two values, which is really bad.
There are two ways to ensure the interconnectivity between h2 (k) and m. One method is to set m to the power of 2, and design an h2 (k) that always produces an odd number, using the principle that m power of an odd number and 2 is always mutual. Another method is to make m a prime number and design an h2 (k) that always produces a positive integer smaller than m ). The latter method can be implemented in this way: first use the GetPrime () function implemented in the previous article to obtain an appropriate prime number as m, and then let
H1 (k) = k mod m
H2 (k) = 1 + (k mod m-1 ))
In h2 (k), the reason for adding 1 to (k mod (m-1) is to make h2 (k) Never 0. If h2 (k) is 0, I will not be able to work. Once h1 (k) is collided, the next empty slot cannot be obtained.
This is a complete sample code. We will continue to improve it in the next article:

Public class IntSet4 {private object [] _ values; private readonly int DELETED =-1; public IntSet4 (int capacity) {int size = GetPrime (capacity ); _ values = new object [size];} // prime number table private readonly int [] primes = {3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89,107,131,163,197,239,293,353,431,521,631,761,919,110 3, 1327,159 7, 1931,233 3, 2801,337 1, 4049,486 1, 5839,701 3, 8419,101 03, 1214 3, 14591,175 19, 21023,252 29, 30293,363 53, 43627,523 61, 62851,754 31, 90523,108 631, 130363,156 437, 187751,225 307, 270371,324 449, 389357,467 237, 560689,672 827, 807403,968 897, 1162687,139 5263, 1674319,200 9191, 2411033,289 3249, 3471899,416 6287, 4999559,599 9471, 7199369}; // judge whether candidate is a prime number private bool IsPrime (int candidate) {if (candidate & 1 )! = 0) // It is an odd number {int limit = (int) Math. sqrt (candidate); for (int divisor = 3; divisor <= limit; divisor + = 2) // divisor = 3, 5, 7... the square root of candidate {if (candidate % divisor) = 0) return false;} return true;} return (candidate = 2); // except for 2, others are all not prime numbers} // If min is a prime number, min is returned; otherwise, the prime number that is slightly larger than min is returned. private int GetPrime (int min) {// query the prime number for (int I = 0; I <primes. length; I ++) {int prime = primes [I]; if (prime> = min) return prime;} // when min exceeds the range of the prime number table, explores every odd number after min until the next prime number is found for (int I = (min | 1); I <Int32.MaxValue; I ++ = 2) {if (IsPrime (I )) return I;} return min;} int H1 (int value) {return value % _ values. length;} int H2 (int value) {return 1 + (value % (_ values. length-1);} int DH (int value, int I) {return (H1 (value) + I * H2 (value) % _ values. length;} public void Add (int item) {int I = 0; // number of slots that have been explored do {int j = DH (item, I ); // if (_ values [j] = null | (int) _ values [j] = DELETED) {_ values [j] = item; return;} else {I + = 1 ;}} while (I <=_ values. length); throw new Exception ("set overflow");} public bool Contains (int item) {int I = 0; // Number of troughs that have been explored int j = 0; // do {j = DH (item, I) address to be explored ); if (_ values [j] = null) return false; if (int) _ values [j] = item) return true; else I + = 1 ;} while (I <= _ values. length); return false;} public void Remove (int item) {int I = 0; // number of slots that have been explored int j = 0; // do {j = DH (item, I); if (_ values [j] = null) return; if (int) _ values [j] = item) {_ values [j] = DELETED; return;} else {I + = 1 ;}} while (I <=_ values. length );}}

Is there a better method besides the Link Method and the open addressing method? Human beings will never stop questioning, but this article must end. Next, we will refer to the source code of. net framework to discuss some important details about implementing the hash.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.