How Hash Tables work

Last Update:2014-10-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Introduction
Hash table has been used in Noi for nearly two years. As an efficient data structure, it is playing an increasingly important role in the competition.
The biggest advantage of a hash table is that it takes much time to store and search data, which can almost be regarded as a constant time. The cost is only a large amount of memory consumption. However, given the increasing amount of memory available, it is worthwhile to change the space for time. In addition, encoding is easy and one of its features.
A hash table is also called a hash table, which can be divided into "Open hash" and "Closed hash ". Considering that most people generally do not use dynamic storage structures during the competition, the "hash table" in this article only refers to "Closed hash". For more information, see other books.

2. Basic operations
2.1 Basic Principles
We use an array with a large subscript range to store elements. You can design a function (a hash function, also called a hash function) so that the keywords of each element correspond to a function value (that is, an array subscript, therefore, this array unit is used to store this element. It can also be simply understood as "classification" for each element based on the keyword ", store this element in the corresponding "class.
However, it is not guaranteed that the keywords of each element correspond to the function values one by one. Therefore, it is very likely that the same function value is calculated for different elements, in this way, "Conflict" occurs. In other words, different elements are divided into the same "class. Next we will see a simple way to solve the "conflict.
In general, "direct addressing" and "resolving conflicts" are two major features of a hash table.

2.2 function construction
Common constructor methods (in the following example, H (k) indicates the function value corresponding to the element whose keyword is k ):
A) division method:
Select an appropriate positive integer p so that H (K) = K mod p. Here, if p is a relatively large prime number, the effect is better. This method is very easy to implement, so it is the most commonly used method.
B) digital selection method:
If the number of keywords is large and the number of digits exceeds the Long Integer Range, you can select multiple digits with even distribution, the new value is used as a keyword or directly as a function value.

2.3 conflict handling
The linear re-Hash technology is easy to implement and can better achieve the goal. If the number of elements in the array is s, when H (k) has already stored the elements, it probes (H (k) + I) mod S, I = 1, 2, 3 ...... Until an empty storage unit is found (or no empty unit is found after scanning from start to end. This means that the hash table is full and an error has occurred. Of course, this can be avoided by expanding the array range ).

2.4 operation supported
Hash Tables support the following operations: initialization (makenull), hash function value (h (x), insert, and member ). Set the keyword of the inserted element to X, and a to the storage array. Initialization is easy, for example:

[CPP] View plaincopy

Const empty = maxlongint; // a very large integer indicates that no elements are stored at this position.
P = 9997; // table size
Procedure makenull;
VaR I: integer;
Begin
For I: = 0 to P-1 do
A [I]: = empty;
End;

The operation of the hash function value varies according to the function, for example, an example of the division method:

[CPP] View plaincopy

Function H (X: longint): integer;
Begin
H: = x mod P;
End;

We noticed that insertion and search both need to locate this element first, that is, if this element exists, where should it be stored, so we add a locating function locate.

[CPP] View plaincopy

Function locate (X: longint): integer;
VaR orig, I: integer;
Begin
Orig: = h (x );
I: = 0;
While (I <S) and (A [(orig + I) mod S] <> X) and (A [(orig + I) mod S] <> Empty) Do
INC (I );
// When this loop stops, either an empty storage unit is found or
// The unit of the prime storage, or the table is full
Locate: = (orig + I) mod S;
End;

Insert element

[CPP] View plaincopy

Procedure insert (X: longint );
VaR posi: integer;
Begin
Posi: = locate (x); // return value of the positioning function
If a [posi] = empty then a [posi]: = x
Else error; // error indicates that an error has occurred, which can be avoided.
End;

Check whether the element is already in the table.

[CPP] View plaincopy

Procedure member (X: longint): Boolean;
VaR posi: integer;
Begin
Posi: = locate (X );
If a [posi] = x then Member: = true
Else Member: = false;
End;

These are common basic operations built on hash tables.

Preliminary conclusion:
When the data size is close to the upper or lower bound of the hash table, the hash table cannot reflect the efficiency, or even worse than the general algorithm. However, if the scale is in the center, its efficient features can be fully reflected. The experiment shows that when the elements are full of 90% of the hash table, the efficiency has begun to decrease significantly. This gives us a prompt: If you are sure to use a hash table, you should try to increase the size of the array, but it is time-consuming to operate the largest array, and you need to find a balance point. Generally, the capacity is at least 120% of the biggest requirement of the question, and the effect is relatively good (this is only experience and there is no strict proof ).

4. Application Example
4.1 simple principles of application
When will hash tables be used? If you find that you want to solve this problem, always ask: "is an element in a known set ?", That is, if you need efficient data storage and search, it is best to use a hash table! So what is worth noting in the process of applying a hash table?
The Design of hash functions is very important. A bad hash function is a situation that causes many conflicts. It can be seen from the previous example that resolving conflicts will waste a lot of time. Therefore, our goal is to try our best to avoid conflicts. As mentioned above, when using the Division remainder method, H (K) = K mod p, p is preferably a large prime number. This is to try to avoid conflicts. Why? If P = 1000, the hash function classification standard is actually classified by the last three digits. In this way, a maximum of 1000 classes may cause many conflicts. Generally, if the number of P is greater, the probability of conflict increases.
A Simple Proof: Assume that p is a number with a large number of approx. At the same time, Q in the Data satisfies gcd (p, q) = D> 1, that is, P = A * D, Q = B * d, then Q mod p = Q-P * [q Div p] = Q-p * [B Div A]. ① The value range of [B Div A] is a positive integer that does not exceed [0, B. That is to say, the value of [B Div A] is only B + 1, while P is a predetermined number. Therefore, the value of formula ① is only B + 1. In this way, although the remainder after mod operation is still within [0, P-1], its value is limited to the values that may be obtained by ①. That is to say, the distribution of the remainder becomes uneven. It is easy to see that the more the number of P, the more frequent the distribution of the remainder, and the higher the chance of conflict. The approximate number of prime numbers is the least, so we choose a large prime number. Remember that "prime number is our right assistant ".
On the other hand, it is not good to blindly pursue a low conflict rate. Theoretically, we can design a function that is almost perfect and has almost no conflict. However, this is obviously not worth doing, because such function design is a waste of time and the encoding must be very complicated. Instead of spending so much energy designing functions, it is better to use a function with more conflicting but simple encoding. Therefore, functions need to be easy to code, that is, easy to implement.
To sum up, it is critical to design a good hash function. The "good" standard is a low conflict rate and easy implementation.
In addition, using a hash table does not mean that the previous basic operations can be retained. Sometimes, the structure of the hash table needs to be improved according to the requirements of the question. Some simple improvements can bring great convenience.
These are just general principles. The actual situation is ever-changing when you really encounter questions. You need to analyze specific problems.

How Hash Tables work

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How Hash Tables work

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support