Document directory
- 1. Direct addressing
- 2. Digital Analysis
- 3. China and France
- 4. Folding Method
- 5. Except the remaining remainder
- 6. Random Number Method
- 1. Open address Method
- 2. rehash
- 3. link address Method
- 4. Create a public overflow Zone
Hash function construction method
Common Methods for constructing hash functions include:
1. Direct addressing
A linear function value of a keyword or keyword is a hash address. That is:
H (key) = key or H (key) = A · key + B
Where A and B are constants (such hash functions are called their own functions ).
[Example]: There is a statistical table of population numbers from 1 to 100 years old. Age is used as the keyword, And the hash function obtains the keyword itself. In this way, to query the number of people aged 25, you only need to check the table for 25th items.
[Another example]: There is a population survey table born after liberation. The keyword is year. The hash function uses the keyword to add a constant H (key) = Key + (-1948 ), to check the number of people born in 1970, you only need to check item (1970-1948) = 22.
The size of the address set directly obtained from the address set is the same as that of the keyword set. Therefore, different keywords are not conflicted. However, this type of hash function is rarely used.
2. Digital Analysis
Assume that the keyword is an R-based number (for example, a 10-based decimal number). If all the keywords in the hash table are known in advance, several digits of the keyword can be used to form a hash address.
[For example] There are 80 records whose keywords are 8-digit decimal numbers. Assume that the hash table length is 10010, a two-digit decimal number can be used to form a hash address. Which two do you need? The principle is that the obtained hash address should avoid conflicts as much as possible, and the 80 keywords should be analyzed.
In the analysis of the entire keyword, we found that the ① bits are "8 1", and the ③ bits can only take 1, 2, 3, or 4, the Nth digit can only be 2, 5, or 7. Therefore, none of the four digits can be obtained. By
The four digits in the middle can be regarded as almost random, so any two of them can be taken, or the overlay sum of the two and the other two can be obtained as the hash address.
3. China and France
The number of digits in the middle after the square of the keyword is the hash address. This is a common method for constructing hash functions.
Generally, when selecting a hash function, you may not be able to know all the keywords. Which of the following digits is not necessarily suitable, the number of digits in the middle after a square sign is related to each digit of the number, so that the hash address obtained by the random distribution keyword is also random. The number of digits is determined by the table length.
[Example]: Create a hash table for the identifier in the basic source program. Assume that the identifier in basic is a letter, a letter, and a number. In a computer, two Octal numbers can be used to represent letters and numbers. If the table length is 512 = 29, the intermediate 9-bit binary number following the square of the keyword is used as a hash address. For example, figure 9.23 (B) lists some identifiers and Their hash addresses.
4. Folding Method
Divide keywords into several parts with the same number of digits (the number of digits in the last part can be different), and use the overlay sum (rounding) of these parts as the hash address, this method is called folding ). The hash address can be obtained by using the folding method when the number of each digit in a keyword is evenly distributed. For example, the hash address of the international standard library No. 0-442-20586-4 is as follows:
5. Except the remaining remainder
The remainder obtained after the keyword is divided by a number p that is not greater than the length of m in the hash table is the hash address. That is
H (key) = Key mod p ≤ m
This is the simplest and most commonly used method for constructing hash functions. It can not only directly modulo the keywords (MOD), but also take the modulo after the fold and square take the moderate operation. It is worth noting that the P option is very important when division remainder is used. If P is not selected, synonyms are easily generated.
It is learned from experience that, in general, P can be selected as a prime number or a combination of factors smaller than 20.
6. Random Number Method
Select a random function. The random function value of the keyword is its hash address, that is, H (key) = random (key), where random is a random function. Generally, it is more appropriate to use this method to construct a hash function when the length of a keyword is not equal.
In actual work, different hash functions should be used according to different situations. Generally, the following factors are taken into account:
(1) Time required to calculate the hash function (including hardware instructions );
(2) Length of keywords;
(3) Size of the hash table;
(4) keyword distribution;
(5) record query frequency.
The hash function for conflicting methods can reduce conflicts, but cannot avoid them. Therefore, it is essential to create a hash table.
Assume that the address set of the hash table is 0 ~ (N-1), a conflict means that records exist at the position where the hash address obtained by the keyword is J (0 ≤ j ≤ N-1, "processing conflict" means to find another "null" hash address for the record of this keyword. An Address Sequence hi, I = 1, 2,… may be obtained in the process of conflict processing ,..., K, (HI, [0, n-1]). That is, when dealing with hash address conflicts, if the obtained hash address H1 still conflicts, find the next address H2. If H2 still conflicts, then obtain H3. And so on until HK does not conflict, HK is the address recorded in the table. The following methods are usually used to handle conflicts:
1. Open address Method
Hi = (H (key) + DI) mod m I = 1, 2 ,..., K (K ≤ m-1) (9-25)
H (key) is the hash function, M is the length of the hash table, Di is the incremental sequence, three methods can be used:
(1) di = 1, 2, 3 ,..., M-1, which is called linear detection and then hashed;
(2) di = 12 ,..., ± K2, (k ≤ m/2) is called secondary detection and then hashed;
(3) di = pseudo-random number sequence, which is called pseudo-random detection and then hashed. [Example] in the hash table with a length of 11, there is a record with a keyword of 17, 60, and 29 respectively (hash function H (key) = Key mod11). There is a fourth record, the keyword is 38.
We can see from the above process that the linear test is re-hashed: When the I, I + 1, I + 2 position in the table has been filled with records, the next record with the hash address I, I + 1, I + 2, And I + 3 will be filled in the position of J + 3, this kind of competition between records with different first hash addresses in the Process of processing conflicts for the same next hash address is called "secondary clustering ", that is, non-synonym conflicts are added in the process of dealing with synonym conflicts. Obviously, this phenomenon is not good for searching. On the other hand, linear detection and hash processing can ensure that, as long as the hash table is not filled up, a non-conflicting address HK can always be found, the secondary probe re-hash is only possible when the hash table's long M is a prime number in the form of 4j + 3 (J is an integer). The random probe and re-hash are determined by the pseudo-random series. 2. rehash
Hi = Rhi (key)
Rhi is a different hash function, that is, when a synonym generates an address conflict, calculate the address of another hash function until the conflict does not occur. This method is not easy to generate "clustering", but increases the computing time.
3. link address Method
Store all records whose keywords are synonyms in the same Linear Linked List. Assume that the hash address generated by a hash function is in the range [0 .. s-1], a pointer vector chain chainhash [m] is set up. The initial state of each component is a null pointer. All records whose hash address is I are inserted into the linked list whose header pointer is chainhash [I. The insert position in the linked list can be at the header or the end of the table. It can also be in the middle to keep the synonyms sorted by keywords in the same Linear Linked List.
[Example 9-3] if a group of keywords (,) are known, the hash function H (key) is used) = Key mod13 and the link address method are used to process the Hash Table 9.26 resulting from conflicting structures. 4. Create a public overflow Zone
This is also a way to handle conflicts. Assume that the value range of the hash function is [0 .. m-1, set the vector hashtable [0 .. m-1 is the basic table, each component stores one record, and another vector overtable [0 .. v] is an overflow table. All the records whose keywords are synonymous with the keywords in the basic table, no matter what their hash addresses are obtained by the hash function, fill in the overflow table in the event of a conflict. The process of searching and analyzing a hash table is basically the same as that of creating a hash table. Given the K value, the hash address is obtained based on the hash function set during table creation. If there is no record in this position in the table, the query fails. Otherwise, the comparison keyword is equal to the given value, the query is successful. Otherwise, find the "next address" based on the Conflict Resolution method set during table creation ", until a location in the hash table is "null" or the keyword of the record entered in the table is equal to the given value. The process of searching a hash table is as follows:
1) Although a hash table establishes a direct image between the key word and the storage location of the record, due to the "Conflict, so that the Query Process of the hash table is still a process of comparing the given value and keywords. Therefore, the average query length must be used as a measure of the query efficiency of hash tables.
2) The number of keywords that need to be compared with the given value during the search process depends on the following three factors: hash function, method for dealing with conflicts, and loading Factor of the hash table.
// The storage structure of the open address hash table int hashsize [] = {997 ,..}; // hash table Capacity Increment table, a proper prime number sequence typedef struct {elemtype * ELEM; // data element storage base address, Dynamic Allocation array int count; // number of current data elements int sizeindex; // hashsize [sizeindex] is the current capacity} hashtable; # define success 1 # define unsuccess 0 # define duplicate-1 status searchhash (hashtable H, keytype K, Int & P, Int & C) {// find the element with the key code K in the hash table H. if the search is successful, use P to indicate the insertion position, and return success; otherwise, use P to indicate the insertion position, and return unsuccess. // C is used to calculate the number of conflicts. Its initial value is zero, which is used for reference when table creation is inserted. P = hash (k); // obtain the hash address while (H. ELEM [p]. key! = Nullkey &&! Eq (K, H. ELEM [p]. key) // this location is filled with records and the keyword is not equal to collision (p, ++ C); // obtain the next probe address P if eq (K, H. ELEM [p]. key) return success; // the query is successful. P returns the location of the data element to be queried. Else return unsuccess; // the query is unsuccessful (H. ELEM [p]. key = nullkey) // P returns the insert position}
You can call a search algorithm to insert an open hash table.
Status inserthash (hashtable & H, elemtype e) {// when the query fails, insert the Data Element E to the open address Hash Table H and Return OK. If the number of conflicts is too high, then the hash table C = 0 is rebuilt; If (searchhash (H, E. key, P, c) return duplicate; // The else if (C
This article is excerpted from the data structure of Tsinghua University Press (C language edition) Yan Weimin, edited by Wu Weimin. I am very grateful to the original author.