Data Structure hash Summary 1: Theoretical Study
Data Structure hash Summary 2: program learning Data Structure hash Summary 3: Practice basics Data Structure hash Summary 4: Program advanced data structure hash Summary 5: hash in nginx (Version
0.1) for reprint, please indicate the source: http://blog.csdn.net/yankai0219/article/details/8185796, (program learning), and go back to theoretical and practical articles 1.
1. Hash DefinitionHash definition: Set
Arbitrary LengthOf
Input, Through
Hash Algorithm,
Fixed LengthOf
OutputThe output is the hash value. Hash function: a function that compresses messages of any length to a fixed-length message digest. Hash Table: a data structure that facilitates data search and does not occupy too much content space.
2. Hash usage1) hash is mainly used in encryption algorithms in the information security field to convert information of different lengths into messy 128-bit encoding. 2) It is widely used in massive data processing.
3. Common hash functions (hash algorithms)1) Division hash method 2) Square hash method 3) Fibonacci hash Method
4. Hash Table introductionHash Table nature: classification method. Hash Table: A new data structure with the ease of addressing arrays and the ease of insertion and deletion of linked lists. Hash Table Image Description: if there are a pile of books, it will be like a linked list and a linear table. It is messy and difficult to find out. If the numbers are numbered, the binary method is used, it will soon be available if it can be classified by engineering, science, and liberal arts, it will be faster to find. Hash Table Implementation: there are multiple implementations, the most common is the zipper method.
5. ExamplesIn the following string hash function, the hash value is obtained by traversing the string through the hash = 31 * hash + * P expression. Many hash functions perform this operation, but their expressions (hash algorithms) are different.
Unsigned int yk_simple_hash (char * STR, int str_len) { Register unsigned int hash; Register unsigned char * P; Int I;For (hash = 0, I = 0, P = (unsigned char *) STR; * P & I <str_len; P ++, I ++) Hash = 31 * hash + * P; Return (hash & 0x7fffffff ); } |
Ii. Conflict-related1. Conflict definition: assume that the address set of the hash table is 0 ~ (N-1), conflict means that the hash address obtained by the keyword is J (0 <= j <= N-1.
There is a record on the hash address obtained by the keyword, so it is called a conflict.2. handling conflicts: It is to tie the record of this keyword to another "null" hash address. That is, when the hash address conflict is processed, if the obtained hash address H1 still conflicts, the next address H2 is obtained. If H2 still conflicts, the next address H3 is obtained, if HK does not conflict with each other, HK is the recorded address in the table. How to handle conflicts:
1) Open address MethodHi = (H (key) + DI) mod m I = 1, 2 ,... K (k <= m-1) where H (key) is the hash function, M is the hash table length, Di is the incremental sequence. There are 3 incremental sequences: 1) linear detection and then hash: di = 1, 2, 3 ,..., m-1 2) Secondary probe and hash: di = 1 ^ 2,-1 ^ 2 ^ 2,-2 ^ 2 ,.... +-K ^ 2 (k <= m/2) 3) pseudo-random detection and re-Hash: di = pseudo-random number sequence
Disadvantages:
We can see that when there is a record in the I, I + 1, I + 2 positions in the table, the next hash address is I, I + 1, the records of I + 2 and I + 3 will be filled in the position of I + 3, this kind of phenomenon that two records with different first hash addresses compete for the same next hash address in the Process of processing conflicts is called"
Secondary Aggregation", That is, a non-synonym conflict is added in the process of dealing with synonym conflicts. On the other hand, linear detection and hash processing can ensure that, as long as the hash table is not filled up, a non-conflicting address HK can always be found. Secondary detection and re-hash are only possible when the hash table length m is a prime number in the form of 4j + 3 (J is an integer.
That is, the open addressing method will cause secondary aggregation, which is not conducive to searching.
2) rehashHi = Rhi (key), I = 1, 2 ,... k Rhi is a different hash function, that is, when the synonym generates an address conflict, calculate the address of another hash function until there is no conflict. This method is not easy to generate clustering, but increases the computing time. Disadvantage: the computing time is increased.
3) link address method (Zipper method)Store all records whose keywords are synonyms in the same Linear Linked List.
4) create a public overflow ZoneAssume that the value of the hash function is [0 M-1], then set the vector hashtable [0... m-1 is the basic table, each component stores one record, and another vector overtable [0 .... v] is an overflow table. All the records whose keywords are synonymous with the keywords in the basic table, no matter what their hash address is obtained by the hash function, fill in the overflow table in the event of a conflict. Advantages of the zipper method: ① The zipper method is simple to deal with conflicts without accumulation, that is, non-synonyms will never conflict, so the average search length is short;
② Because the Node space on each linked list in the zipper method is dynamically applied, it is more suitable for situations where the table length cannot be determined before table creation;
③ In order to reduce conflicts, the open addressing method requires a small filling factor α, which wastes a lot of space when the node size is large. In the zipper method, α ≥ 1 is recommended, and when the knots are large, the pointer domain added in the zipper method is negligible, thus saving space;
④ The delete node operation is easy to implement in the hash list constructed by the zipper method. Simply delete the corresponding node on the linked list. For the hash list constructed by the open address method, the space of the deleted knots cannot be empty simply by deleting nodes, otherwise, the search path of the synonym node in the hash list is truncated. This is because in various open address methods, empty address units (that is, open addresses) are the conditions for failed search. Therefore
When you use the open address method to delete a conflicting hash, you can only mark the deleted knots, but not delete the knots. the disadvantages of the zipper method are: pointers require extra space. Therefore, when the node size is small, the open addressing method is more space-saving. If you use the saved pointer space to expand the size of the hash, the loading factor can be reduced, this reduces conflicts in the open addressing method and increases the average search speed.
3. Search:From the process of searching the hash table, we can see that: 1) Although the hash table directly creates a direct image for the storage location of keywords and records, due to the "Conflict, so that the Query Process of the hash table is still a process of comparing the given value and keywords. Therefore
Average search lengthAs a measure of the query efficiency of a hash table. 2) The number of keywords that need to be compared with the given value during the search process depends on the following three factors: hash function, method for dealing with conflicts, and loading Factor of the hash table. In general, the average length of a hash table with the same conflicting methods depends on the loading Factor of the hash table.
Fill Factor = (number of records entered in the table)/(length of the hash table). The smaller the fill factor, the smaller the possibility of conflict. The larger the fill factor, the more records already filled in the table. When you fill in the record again, the higher the possibility of conflict, when searching, the number of keywords that need to be compared with the given value is more.
4. Hash usage and learningFirst, you must understand the zipper method of the hash table. Because many applications use the hash table zipper method. We can understand the zipper method as an array of linked lists. 1) each element of the array is a linked list. 2) All nodes in a linked list have the same hash value. The hash value is the subscript of the array element. Second, you must learn three operations: Hash Table initialization, inserting elements, and searching elements. Reprinted please indicate the source http://blog.csdn.net/yankai0219/article/details/8185796