One, hash table
1. Concept
A hash table (hash table) is also known as a hash table, which is a data structure that is accessed directly based on key value. It accesses records by mapping key code values to a location in the hash table to speed up lookups. This mapping function does a hash function, and the array that holds the record is called the hash table.
2, the basic idea of hashing storage
The value of the function is computed by the hash function h (k) as the argument K of each element in the data, and the function value is used as the cell address of a contiguous storage space to store the element in the corresponding cell of the function value.
3. Time complexity of hash table lookup
The hash table stores the key-value pairs, and the time complexity of the lookup is irrelevant to the number of elements, and the hash table locates the elements by calculating the hash code value to locate the element's position and thus directly accessing the element, so the time complexity of the hash table lookup is O (1).
Second, commonly used hash function
1. Direct Addressing method
Take a keyword or a keyword of a linear function value as a hash address, that is, H (key) =key or H (key) =a*key+b (A, B is an integer), this hash function is also called its own function. If the hash address of H (key) already has a value, then go to the next position, know to find H The position of (Key) has no value and puts the element in it.
2. Digital Analysis method
Analyzing a set of data, such as the birth date of a group of employees, we find that the first few digits of the birth date are generally the same, so the probability of a conflict is very large, but we find that the number of months and days of the month and day is very different, if you use the following numbers to construct the hash address, The odds of the conflict are significantly reduced. Therefore, the digital analysis method is to find out the laws of numbers, as far as possible to use this data to construct a lower probability of conflict hash address.
3. The method of square take
Take the middle of the keyword squared as the hash address. The sum of the squares of a number is related to each of the two. Therefore, the hash address obtained by the square-fetch method is related to each bit of the keyword, and the hash address has a good dispersion. This method applies to cases where each bit in the keyword is not sufficiently dispersed or the number of bits dispersed is less than the number of bits required by the hash address.
4. Folding method
The folding method divides the keywords into the same number of bits, the last part of the bits can be different, and then takes the overlay of these parts and (Note: overlay and time-out carry) as hash addresses. There are two methods of superposition and overlap. The shift overlay is the lowest bit of each part after the split is aligned and then added; The overlapping bounds are folded back and forth from one end to the other, and then the addition is aligned.
5. Random number method
Select a random number to go to the random value of the keyword as the hash address, usually used for different keyword lengths.
6. Residual remainder method
The remainder of the keyword that is not greater than the number p of the hash table length m is a hash address. That is, the H (key) =key MOD p,p<=m. Not only can the keyword directly be modeled, but also in the folding, square take the medium operation after the modulo. The choice of P is very important, generally take prime or m, if p is not selected well, it is very easy to create a conflict. The general P value is the length of the table tablesize.
Three, the processing method of hash conflict
1. Open addressing Method--linear detection
The address increment of the linear probing method di = 1, 2, ..., m-1, where I is the number of probes. This method detects the next address at a time, knows that an empty address is inserted, and if the entire space cannot find a spare address, an overflow occurs.
Linear detection is prone to "aggregation" phenomena. When some keywords are already stored in the table I, i+1, i+2, the next keyword with a hash address of I, i+1, i+2, and i+3 will attempt to fill the location of the i+3, which is called "aggregation" for the phenomenon of multiple hash addresses with different keywords vying for the same subsequent hash address. Aggregation has a significant impact on search efficiency.
2. Open Address Method--two times detection
The address increment sequence for the two-time detection method is di = -12, -22, Q2,-Q2 (q <= m/2). Two probes can effectively avoid the "aggregation" phenomenon, but not all the storage units on the hash table can be detected, but at least half of them can be detected.
3. Chain Address method
The chain address method also becomes the Zipper method. The basic idea is to connect all data elements of different keywords with the same hash address into the same single linked list. If the selected hash table is of length m, the hash table can be defined as an array of pointers with an M-head pointer t[0..m-1], and all data elements with a hash address of I are inserted in the form of a node into a single linked list of t[i] as the first pointer. And the new element is inserted into the front end of the list, not only because it is convenient, but also because of the fact that the newly inserted element best may soon be accessed.
Chain Address method Features
(1) The Zipper method to deal with the conflict is simple, and no accumulation phenomenon, that is, non-synonym will never conflict, so the average search length is short;
(2) because of the dynamic application of the node space on each linked list in the Zipper method, it is more suitable for the case that the table length cannot be determined before watchmaking;
(3) Open addressing method to reduce the conflict, requires the loading factor α is small, so when the node scale is large, it will waste a lot of space. While the Zipper method is preferable to α≥1, and when the node is larger, the additional pointer field in the Zipper method is negligible, thus saving space;
(4) In a hash list constructed with the Zipper method, the operation of deleting nodes is easy to implement. Simply delete the corresponding node on the list. In the case of a hash table constructed by the open address method, the deletion node cannot simply empty the space of the deleted node, otherwise it will truncate the lookup path of the synonym node of the hash table after it. This is because in various open address laws, empty address units (that is, open addresses) are the criteria for finding failures. Therefore, the delete operation is performed on the hash list that handles the conflict with the open address method, and the deletion mark can only be done on the deleted node, instead of the node being actually deleted.
Iv. filling factor of a hash table
Reload factor = (number of records in the hash table)/(length of Hashtable)
The filling factor is the marker factor for the full extent of the hash table. The larger the value, the more data elements are filled into the table, and the more likely the conflict is to occur.
V. Average lookup length for different handling conflicts
Cases:
Assuming that the hash list is 13, the three-column function is h (k) = K 13, the given keyword sequence is {32, 14, 23, 01, 42, 20, 45, 27, 55, 24, 10, 53}. In this paper, we draw a hash table which is constructed by using linear detection method and zipper method to solve the conflict, and find out the average finding length of the two methods in equal probability case.
(1) Linear detection method:
The number of lookups when a lookup succeeds equals the number of comparisons when inserting an element. find the average lookup length for success:
ASL = (1+2+1+4+3+1+1+3+9+1+1+3)/12 = 2.5
Lookups when finding success: Nth position is not successful when the number of comparisons is, the nth position to the 1th distance without data location: As the No. 0 position is 1, the 1th position is a value of 2.
Find the average number of unsuccessful lookups:
ASL = (1+2+3+4+5+6+7+8+9+10+11+12)/13 = 91/13
(2) Chain address method
Average look-up length when finding success:
ASL = (1*6+2*4+3*1+4*1)/12 = 7/4
Average lookup length when finding unsuccessful:
ASL = (4+2+2+1+2+1)/13
Note: When the lookup succeeds, the denominator is the number of hash table elements, and when the lookup is unsuccessful, the denominator is the hash table length.
Hash table--linear probing method, chain address method, finding success, finding unsuccessful average length