Part II: A detailed analysis of the hash table algorithm
What is a hash?
Hash, the general translation to do "hash", there is a direct transliteration of "hash", is the arbitrary length of the input (also known as pre-mapping, pre-image), through the hash algorithm, transformed into a fixed-length output, the output is the hash value. This conversion is a compression map, that is, the space of the hash value is usually much smaller than the input space, the different inputs may be hashed to the same output, but not from the hash value to uniquely determine the input value. Simply, a function that compresses messages of any length to a message digest of a fixed length.
Hash is mainly used in the field of information security encryption algorithm, it has a number of different lengths of information into a cluttered 128-bit encoding, these encoded values are called hash values. It can also be said that the hash is to find a data content and data storage address between the mapping relationship.
The characteristics of the array are: easy addressing, insertion and deletion difficulties, and the list is characterized by: difficult to address, insert and delete easy. So can we combine the characteristics of both, make an easy to address, insert delete also easy data structure. The answer is yes, this is the hash table we are going to mention, the hash table has a number of different implementations, and I will explain the most common method-the Zipper method, which we can understand as "array of linked lists", as shown:
On the left is obviously the array, each member of the arrays consists of a pointer to the head of a linked list, and of course the list may be empty, or there may be many elements. We assign elements to different linked lists according to some of the characteristics of the elements, and we find the correct linked list based on these characteristics, and then we find this element from the list.
The method that the element feature transforms the subscript is the hash method. The hashing method is of course more than one, the following list three kinds of more commonly used:
1, Division hash Method
The most intuitive one, the above figure uses this hashing method, the formula:
index = value% 16
learned that the assembly is known that the modulus is actually obtained through a division operation, so called "division hashing method."
2, square hash method
The index is very frequent operation, and the multiplication operation is more time-saving than division (for the current CPU, it is estimated that we do not feel it), so we consider dividing the division into multiplication and a displacement operation. Formula:
Index = (value * value) >> ( shift right, divided by 2^28. Notation: Shift left to large, is multiply. Right shift to small, is except. )
This method can get good results if the values are evenly distributed, but the values of the individual elements of the graph I drew above are 0--very unsuccessful. Perhaps you have a question, if value is large, value * value does not overflow. The answer is yes, but our multiplication does not care about overflow, because we are not at all to get the multiplication result, but to get index.
3, Fibonacci (Fibonacci) hash method
The disadvantage of the square hashing method is obvious, so can we find an ideal multiplier instead of using value itself as a multiplier? The answer is yes.
1, for 16-bit integers, this multiplier is 40503
2, for 32-bit integers, this multiplier is 2654435769
3, for 64-bit integers, this multiplier is 11400714819323198485
How did these "ideal multipliers" come out? This is related to a law, called the golden Rule, and the most classical expression describing the golden rule is undoubtedly the famous Fibonacci sequence, that is, sequences of this sort: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144,233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, .... In addition, the value of the Fibonacci sequence coincides with the orbital radius of the eight planets in the solar system.
For our common 32-bit integers, the formula:
Index = (value * 2654435769) >> 28
If you use this Fibonacci scatter FPT, the above diagram will look like this:
It is obvious that the Fibonacci hashing method is much better than the original method of fetching and hashing.
Scope of application
Quick Find, delete the basic data structure, usually requires the total amount of data can be put into memory.
Fundamentals and key points
hash function selection, for strings, integers, permutations, the specific corresponding hash method.
Collision treatment, one is open hashing, also known as Zipper method, the other is closed hashing, also known as the Address law, opened addressing.
Extended
D-left hashing in D is a number of meanings, we first simplify this problem, take a look at 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of equal length, called T1 and T2 respectively, with a hash function for T1 and T2, H1 and H2. When a new key is stored, it is calculated with two hash functions, resulting in two addresses H1[key] and H2[key]. At this point you need to check the H1[key] position in the T1 and the H2[key] position in the T2, which location has been stored (collision) key more, and then store the new key in a low-load location. If the two sides are the same, for example, two positions are empty or all store a key, the new key is stored in the left T1 sub-table, 2-left also come. When looking for a key, you must make a hash of two times and find two positions.
Problem Example (mass data processing)
We know that hash table in the massive data processing has a wide range of applications, below, please look at another Baidu interview questions:
Title: Massive log data, extract a day to visit Baidu the most times the IP.
Scenario: The number of IP is still limited, up to 2^32, so you can consider using a hash of the IP directly into memory, and then statistics.
The third part, the fastest hash table algorithm
Next, let's take a concrete look at one of the fastest HASB table algorithms.
Let's start with a simple question: There's a huge array of strings, and then give you a separate string that lets you find out if you have this string from this array and find out what you're going to do. There is a way to the simplest, honestly from the tail, a comparison, until found, I think as long as the people who have learned the program design can make such a process, but if there are programmers to give such a program to the user, I can only use no language to evaluate, maybe it really can work, but ... This is the only way.
The most appropriate algorithm is the use of Hashtable (hash table), the first introduction of the basic knowledge, the so-called hash, is generally an integer, through an algorithm, you can put a string "compressed" into an integer. Of course, in any case, a 32-bit integer cannot correspond back to a string, but in the program, the two strings calculated by the hash value of equal may be very small, the following look at the hash algorithm in MPQ:
function One, the following function produces a length of 0x500 (10 in Number: 1280) crypttable[0x500]
Void preparecrypttable () { unsigned long seed = 0x00100001, index1 = 0, index2 = 0, i; for ( index1 = 0; index1 < 0x100; index1++ ) { for ( index2 = index1, i = 0; i < 5; i++, index2 += 0x100 ) { unsigned long temp1, temp2; seed = (seed * 125 + 3) % 0x2aaaab; temp1 = (Seed &&NBSP;0XFFFF) << 0x10; seed = (seed * 125 + 3) % 0x2AAAAB; temp2 = (seed & &NBSP;0XFFFF); crypttable[index2] = ( temp1 | temp2 ); } } }
function Two, The following function calculates the hash value of the lpszFileName string, where Dwhashtype is the type of hash, in the following function three , Gethashtablepos function call this function two, It can take a value of 0, 1, 2; the function returns the hash value of the lpszFileName string:
Unsigned long <strong>hashstring</strong> ( char *lpszfilename, unsigned long dwHashType ) { unsigned char *key = (unsigned char *) lpszfilename; unsigned long seed1 = 0x7fed7fed; unsigned long seed2 = 0xeeeeeeee; int ch; while ( *key != 0 ) { ch = toupper (*key++); seed1 = crypttable[(DwHashType << 8) + ch] ^ (seed1 + seed2); seed2 = ch + seed1 + seed2 + (Seed2 << 5) + 3; } return seed1; }
Blizzard's algorithm is very efficient, known as the "one-way hash" (a one-way hash is a algorithm that's constructed in such A means that deriving th E original string (set of strings, actually) is virtually impossible). For example, the string "UNITNEUTRALACRITTER.GRP" results from this algorithm is 0XA26067F3.
is not the first algorithm to improve, to compare the Hash value of the string can be, the answer is, far from enough, to get the fastest algorithm, you can not do one-by-two comparisons, usually constructs a hash table (hash table) to solve the problem, hash table is a large array , the capacity of this array is defined according to the requirements of the program, for example, 1024, each hash value by the modulo operation (MoD) corresponds to a position in the array, so as long as the comparison of the hash value of the string corresponding to the position is not occupied, you can get the final result, think of what this is speed. Yes, it is the fastest O (1), now take a closer look at this algorithm:
typedef struct {int Nhasha; int NHASHB; Char bexists; ...... } Somestructrue;
A possible definition of a struct body.
function Three, the following function for the hash table to find whether the target string, there is the return to find the string hash value, no, return-1.
Int <strong>gethashtablepos</strong> ( har *lpszstring, somestructure * lptable ) //lpszstring the string to find in the hash table, lptable is the hash table that stores the hash value of the string. { int nhash = hashstring (lpszstring);