Data Structure-Hash Function
Hash search
In the previous search algorithm, the time complexity is O (n) or O (2n). The efficiency depends on the number of "comparisons.
Even if the search results of a table using the sort tree structure are different from those of the data element, there are two results: "greater than" or "less, so there are two possible directions in the next step, so O (2n) is already the best.
The Hash Table adopts another algorithm. the time complexity of searching for a Hash Table can reach O (1) as quickly as possible. That is, if a keyword is given, this element can be found immediately.
Hash Tables are widely used in practical applications. Many programming languages have built-in Hash Lookup methods, such as HashTable and HashSet in java.
Basic Ideas
Establish a definite correspondence relationship between the record's storage address and Its keywords. In this way, the search method of the queried elements can be obtained after one access without comparison.
1. Hash function f: keyword → storage location. That is, the hash function refers to the correspondence between the keyword and the storage location (hash address). If a keyword is given, the storage location can be obtained through this function.
For example, an array containing 10 elements is given.
[0 1 2 3 4 5 6 7 8 9]
If the Hash function f (K) = K mod 10 is given, the array subscript of the element corresponding to the key value can be obtained immediately. If the Hash function f (K) = K mod 2 is given, however, the search range is reduced, but the result of the above function cannot be reached.
Hash function conflict
If different key values are mapped to the same Hash address by the Hash function, it is called a conflict (ki? Kj, H (ki) = H (kj) Synonyms: two different keywords with the same function value, called synonyms of the hash function. The Hash function cannot completely avoid conflicts. It can only reduce conflicts as much as possible, or a good Hash function can map keywords to obtain the Hash address and distribute the Hash table as evenly as possible, in addition to using good Hash functions, it is also important to effectively handle conflicts.
Hash table:
Based on the set hash function H (key) and the method for dealing with conflicts, a set of keywords are mapped to a finite continuous address range, the keyword image is used as the storage location of the record in the table. The stored Location value obtained from a hash table or a hash table (also called a hash table) is called a hash address or hash address.
Construction of Hash Functions
A hash function is an image with flexible settings, as long as the hash function value of any keyword falls within the range allowed by the table length. The main evaluation factors of the hash function "Good or Bad" are:
◆ The hash function is easy to construct;
◆ After the keyword is mapped by the Hash function, the obtained address is of equal probability, that is, "even". Such a Hash function is called "Even Hash function"
A good Hash function can reduce keyword conflicts.
Direct addressing: a linear function that obtains a keyword or keyword as the Hash address, that is, H (key) = key H (key) = a × key + B, a and B are constants: the size of the address set obtained by the direct addressing method is equal to that of the keyword set, and there is no conflict. However, it is rarely used in practice because you need to know the distribution of keywords in advance. Digital Analysis Method: obtain the number of digits or combinations of keywords as the hash address square. Use the number of digits in the middle after the keyword square as the hash address folding method: divide keywords into several parts with the same number of digits (the last part can be different), and then overlay these parts and use them as the hash address to remove the remainder (Modulo) modulo a certain number of p keywords (not greater than the table length m of the hash table), and use the obtained number as the hash address H (key) = key MOD p, the key to p≤m using this method is p's selection. p's selection is not good and synonyms are easy to generate. Based on experience: If the hash table length is m, p is usually the prime number smaller than or equal to the table length (preferably close to the table length m), or does not contain a combination of less than 20 quality factors (e.g: 23*29 ). The random number method H (key) = random (key) uses the generated random number as the hash address. This method is suitable when the keyword length in the hash list is not equal.
Conclusion: different hash functions are used based on the specific problem, based on the following:
Hash function calculation time
Keyword Length
Hash table size (hash address range)
Keyword Distribution
Record query frequency
Conflict Handling MethodConflict: the location of the hash address obtained by the keyword is recorded.
Hash Functions can only evenly distribute hash addresses, but cannot avoid conflicts. Therefore, conflict processing is inevitable.
The basic idea of dealing with conflicts: In the process of processing, you may obtain an Address Sequence Hi, I = ,..., K, that is, each time a hash address Hi is obtained, if there is still a conflict, then the corresponding method will get the next hash address Hi + 1 until a non-conflicting hash address is obtained.
When a conflict occurs, find another storage location for the conflict element.
Open address Method
The conflict handling function is
Hi = (H (key) + di) MOD m,
I = 1, 2 ,..., K, k ≤ m-1
Hi is the hash function, m is the table length of the hash table, di is the incremental Sequence
Di Selection Method
Di = 1, 2, 3 ,..., Expressed as "linear detection and re-partitioning"
Di = 12,-12, 22,-22 ,..., K2,-k2, k ≤ m/2, known as "secondary detection and then hash"
Di is a pseudo-random number sequence, which is called "pseudo-random detection and re-partitioning"
RehashConstruct several hash functions. When a conflict occurs, use different hash functions to calculate the next hash address until no conflict occurs. That is, Hi = RHi (key) I = 1, 2 ,..., K RHi: a set of different hash functions. When a conflict occurs for the first time, RH1 is used for calculation. When a conflict occurs for the second time, RH2 is used for calculation... So far, we know that a certain Hi will not conflict with each other.
◆ Advantages: it is not easy to generate conflicting "aggregation;
◆ Disadvantage: the computing time increases.
Link address MethodMethod: store all records with the same keywords as synonyms (with the same hash address) in a single-chain table and use a one-dimensional array to store the head pointer of the linked list.
Set the hash length to m and define a one-dimensional pointer array:
RecNode * linkhash [m],
The RecNode is a node type, and the initial value of each component is null.
All records with the hash address k are inserted into the linked list with the linkhash [k] As the header pointer.
The Query Process of the hash table is basically the same as that of the construction process. During the Query Process, the hash function and the conflict function are used until the query fails or the query is successful.
Hash search and analysisWe can see from the hash search process:
The hash creates a direct image between the keyword and the recorded storage address;
Due to "Conflict", hash tables also have keyword comparisons. The performance bottleneck depends on the number of comparisons.
Still use ASL to evaluate hash search efficiency (average search length)
Hash Functions
The quality of hash functions affects the frequency of conflicts. A uniform hash function produces the same possibility of conflict with a set of keywords, which is not the decisive factor affecting ASL.
Conflict Handling Method
For the introduced several conflict handling methods (linear detection, secondary detection, random detection, re-detection, and link address), their ASL is different.
Fill Factor (to measure the filling degree of the hash table) affects the ASL of the hash table
? = Number of records in the table/length of the hash table
The smaller the fill factor, the smaller the possibility of conflict, and the larger the vice versa (the more times the comparison is required)
Conclusion: The average search length (ASL) of the hash table is related to the filling factor, but not to the number of records.
The design of the hash function and the conflict processing function is two core issues in the design of the hash table. If it is well designed, the hash table is widely used due to other queries.