1. Order
This article describes the introduction of hash, the design of hash functions, and the methods for dealing with conflicts. A simple sample code is provided.
2. Introduction of hash
Given a set of keywords u = {0, 1 ...... M-1}, there are no more than m elements in total. If M is not very large, we can define an array T [0... m-1)]: maps u to array T. Each Location corresponds to a keyword in U. If U does not have an element whose keyword is K, t [k] = NULL. T is called a direct addressing table. It takes only O (1) Time to insert, delete, and search. But pay attention to the premise when "m is not very large". Obviously, this premise is very restrictive. When M is very large, it will inevitably waste a lot of space. What should we do? So there is a hash: given n elements and M storage locations (also known as slots), the hash function is used to associate keywords with storage locations, match each keyword with a unique storage location in the structure. Therefore, when searching, you can find the keyword Location Based on the hash function and obtain the record to be searched without comparison. A hash table is also called a hash table.
3. Hash Functions
There are many hash functions. Good hash functions are characterized by: (similar) Satisfying simple consistent hash, that is, any keyword in the keyword set U, the probability that the hash function maps to any address in the address set is equal. In this case, it can be called the uniform hash function, that is to say, a "random address" is obtained through the hash function of the keywords so that the hash addresses of a group of keywords are evenly distributed across the entire address range to reduce conflicts. Most of the time, we interpret keywords as natural numbers. Below are several common hash functions.
(1) Division hash
Hash function: H (key) = Key % P. The value of P is very important. The hash address value obtained by this function does not exceed p, the P value can also be selected as a prime number that is not very close to the integer power of 2. When the storage location M is large, P should not be too small.
(2) multiplication hash
Hash function H (key) = [p * (key * A-(INT) Key * A)]. 0 <A <1. (Key * A-(INT) Key * A) is to take the fractional part of key * a, and then multiply the constant P. The final value is rounded down. Generally, P selects a power of 2 and has no special requirements on P selection.
(3) Global hash
At the beginning of execution, select a random hash function from a family of carefully designed functions. Here, the random selection is for a hash application, rather than a simple insert or search operation. The certainty of the hash function ensures that the search operation is correctly executed. The global hash method ensures that when key1! = Key2, the probability of collision between the two is not greater than 1/m. The method for designing a global hash function class is as follows. In this method, the size of m in the hash list is arbitrary.
Design Method of global hash function class: select a large enough prime number P so that every possible keyword falls within the range of 0 to 1. Set ZP to indicate the set {0, 1 ,..., P-1}, ZP * indicates the set {1, 2 ,..., P-1 }. For any A, ZP * and any B, ZP, the hash function ha, B (K) = (ak + B) mod P) is defined) mod m all function families composed of such hash functions are: HP, M = {ha, B: a, ZP *, and B, ZP} because there is a p option for A and B, there is a p option. Therefore, there is a total of P (PM) in HP and M) hash functions. In a hash application, A and B are randomly generated in a certain range. For example, if p = 17, M = 6, A = 3 and B = 4 are randomly generated in the hashed application, H3, 4 (8) = 5.
4. Conflict Handling Methods
When H (key1) = H (key2), two different keywords correspond to the same hash address, so there is a conflict. A good hash function can only avoid conflicts, conflicts cannot be completely eliminated. How can we deal with conflicts? Below are several common methods.
(1) Open addressing
In the development addressing method, all elements are stored in the hash. When inserting an element, You can continuously check items in the hash list until an empty slot is found to hold the keywords to be inserted. For the development address method, it is required that for each keyword K, the probe sequence must be an arrangement of (0, 1... m-1), that is, all the slots can be detected.
H = (H (key) + d) mod m where M indicates the length of the hash table, D indicates the incremental sequence, and H (key) indicates the hash function.
Linear detection and re-partitioning: when the value of D in the formula is 1, 2, 3... m, it is called linear detection and re-partitioning. In this method, the position of the initial probe determines the entire probe sequence. For example, if the first probe position T [1], then the next position is t [2]. then T [3]... therefore, there are only m different detection sequences. With the passage of time, the number of slots continuously occupied increases, and the average search time increases. This phenomenon is called a cluster phenomenon.
Secondary probe and then hash: D = 1 ^ 2, (-1) ^ 2 ^ 2, (-2) ^ 2 ...... in this case, the secondary probe is re-hashed. The initial probe position determines the entire probe sequence, so there are only m different probe sequences. However, the probability of cluster phenomena is much lower.
Pseudo-Random detection and re-Hash: D = pseudo-random number sequence.
Double hash: function used: H (K, I) = (H1 (k) + I H2 (k) mod m, I = 0, 1 ,..., M-1
To search for the entire hash, the value H2 (k) must be in the same quality as the table size M. One way to ensure that this condition is true is to take m as the power of 2 and design a H2 that always produces an odd number. Another method is to take M as a prime number and design H2, which always produces a positive integer smaller than M. For example, you can use m as the prime number, H1 (K) = K mod m, H2 (K) = 1 + (K mod m'), M' = S-1.
(2) link address Method
Place all elements hashed to the same slot in a linked list. Compared with the open address method, the storage space may be increased.
(3) Establish a public overflow Zone
If a conflict occurs, store the key in the public overflow zone.
5. Full hash
If the memory access count of a Hash technology in the worst case is O (1) when searching (no conflict occurs ), it is called Perfect hashing ). Generally, a two-level hash scheme is used. Global hash is used for each level, and a secondary hash SJ is used to store all the keywords hashed to the slot J, it is like changing the linked list in the Link Method to a hash list. To ensure that there is no collision on Level 2, the MJ size of the Level 2 hash table SJ must be the square of the number of key words hashed to the slot J. If the hash function H randomly selected from a global hash function class is used to store N keywords in a hash list with a size of M = N, set the size of each secondary hash to mj = nj2.
(J = 0, 1 ,..., M-1 completely hashed solution, the expected total storage capacity required to store all secondary hashes is less than 2n.
6. Hash Table Performance Analysis
Fill Factor A = the number of records in the table/the length of the hash list. A Indicates the full size of the hash. The average length of successful and unsuccessful searches in the hash list is complex. The link processing time for conflicting inserts and deletions is O (1), which is convenient, suitable for hash tables that often have records deleted. The link method depends heavily on the hash function. If the hash function is not good, it may waste a lot of space. When you delete a record with the open address method, you can assign a special value to the deleted location to identify that the record has been deleted. This will not affect the insertion and search of other records.
7. Appendix
Reference books: Introduction to algorithms and data structure
Hash Table application example:
/** Question: a dictionary composed of all strings. All strings are composed of uppercase letters. Write a password for each string. The * method is to give a n-digit string. The correspondence between uppercase letters and numbers is based on the telephone keyboard: * 2:, b, C 5: J, K, L 8: T, U, V * 3: D, E, F 6: M, N, O 9: w, x, y, z * 4: G, H, I 7: P, Q, R, S * The number of 1-12 digits is given, find all the strings that appear in the dictionary and the password is the number. The number of strings in the dictionary cannot exceed 5000. ** Train of thought: 1. trace back to find all possible strings * 2. Search for this string in the dictionary. (Hash table storage is used for dictionary storage) **/# include <stdio. h> # include <stdlib. h> # include <string. h ># define hashtable_length 5001 // hash table length # define string_length 13 // maximum word length // string typedef struct {char STR [string_length]; int length;} hstring; hstring string = {'\ 0', 0}; // Save the possible hstring hashtable [hashtable_length]; // hash table // hash function, construct a hash table void createhashtable (char * Str) {int I, key, step = 1; I = Key = 0; while (STR [I]) {key + = STR [I ++]-'A';} key % = hashtable _ Length; while (1) {If (hashtable [Key]. length = 0) {hashtable [Key]. length = strlen (STR); strcpy (hashtable [Key]. STR, STR); break;} key = (Key + step + hashtable_length) % hashtable_length; // handle the conflict, and then scatter the column if (Step> 0) after linear detection) step =-step; else {step =-step; Step ++ ;}}// read the dictionary void readstring () {int I; char STR [string_length] From the file; char ch; file * FP; If (FP = fopen ("document/dictionary.txt", "R") = NULL) {printf ("can not open file! \ N "); exit (0);} I = 0; while (CH = GETC (FP ))! = EOF) {If (CH = '\ n') {// read a string STR [I] =' \ 0'; createhashtable (STR); I = 0; continue;} STR [I ++] = CH;} If (fclose (FP) {printf ("can not close file! \ N "); exit (0) ;}// check whether the string exists in the hash table. If the string exists, return 1. If the string does not exist, return 0int search (char * Str) {int I, key, step = 1; I = Key = 0; while (STR [I]) {key + = STR [I ++]-'A';} key % = hashtable_length; while (1) {If (hashtable [Key]. length = 0) return 0; If (strcmp (hashtable [Key]. STR, STR) = 0) {return 1;} key = (Key + step + hashtable_length) % hashtable_length; // handle conflicts, if (Step> 0) step =-step; else {step =-step; Step ++ ;}} return 0 ;} // obtain all possible strings void getstring (char * num) {int I, digit, Max; If (* num = 0) {// recursive exit, the string has reached the end of string. STR [String. length] = '\ 0'; If (search (string. str) // This string exists in the dictionary and outputs puts (string. str); return;} digit = * num-'0'; // convert it to a number if (digit> = 2 & digit <= 6) {I = (digit-2) * 3 + 'a'; max = (digit-2) * 3 + 'A' + 3;} else if (digit = 7) {I = 'P'; max = 'P' + 4;} else if (digit = 8) {I = 'T'; max = 'T' + 3 ;} else if (digit = 9) {I = 'W'; max = 'W' + 4;} For (I; I <Max; I ++) {string. STR [String. length ++] = I; getstring (Num + 1); // recursive string. length -- ;}} void main () {char num [string_length]; // use a string to store readstring () because the input number is out of the unsigned long range (); // read the dictionary from the file into the memory printf ("Please inputer an number (1--12 bits, cannot have 0 or 1) \ n"); scanf ("% s ", num); getstring (Num );}