[Algorithm] thoroughly parses the hash table algorithm from start to end

Last Update:2018-12-04 Source: Internet

Author: User

Tags blizzard

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Note: This article is divided into three parts,
The first part is a detailed explanation of the top K algorithm of Baidu interview questions; the second part is a detailed description of the hash table algorithm; the third part is to build the fastest hash table algorithm.
------------------------------------

Part 1: Explanation of the top K Algorithm
Problem description
Baidu interview questions:
The search engine records all the search strings used for each search using log files. The length of each query string is 1-bytes.
Suppose there are currently 10 million records (these query strings have a relatively high degree of repetition, although the total number is 10 million, but if the repetition is not removed, there will be no more than 3 million records. The higher the repetition of a query string, the more users query it, that is, the more popular it is .), Please count the top 10 query strings. The memory required cannot exceed 1 GB.

Required knowledge:
What is a hash table?
A hash table (also called a hash table) is a data structure that is directly accessed based on the key value. That is to say, It maps the key value to a location in the table to access records to speed up the search. This ing function is called a hash function, and the array storing records is called a hash function.

The hash table method is actually very simple, that is, to convert the key into an integer using a fixed algorithm function called a hash function, and then perform the remainder operation on the array length, the remainder result is used as the subscript of the array, and the value is stored in the array space with the number as the base object.
When a hash table is used for query, the hash function is used again to convert the key to the corresponding array subscript and locate the space to obtain the value, the positioning performance of the array can be fully utilized for Data Location (the second and third parts of the article will be detailed on the hash table ).

Problem Analysis:
To count the most popular queries, you must first count the number of times each query appears, and then find the top 10 based on the statistical results. Therefore, we can design the algorithm in two steps based on this idea.
That is, there are two steps to solve this problem:

Step 1: Query statistics
Query statistics are available in the following two methods:
1. Direct sorting
The first algorithm we come up with is sorting. First, we sort all the queries in this log, and then traverse the sorted query to count the number of times each query appears.

But there is a clear requirement in the question, that is, the memory cannot exceed 1 GB, there are 10 million records, each record is 255 bytes, it is obvious that it will occupy Gbps memory, this condition does not meet the requirements.

Let's recall the content in the Data Structure course. When the data volume is large and the memory cannot be loaded, we can sort it by external sorting. Here we can sort it by merging, because Merge Sorting has a better time complexity O (nlgn ).

After sorting, we traverse the sorted Query file, count the number of times each query appears, and write it into the file again.

According to a comprehensive analysis, the time complexity of sorting is O (nlgn), and the time complexity of traversal is O (n). Therefore, the overall time complexity of this algorithm is O (n + nlgn) = O (nlgn ).

2. Hash Table Method
In the 1st methods, we used the sorting method to count the number of times each query appears. The time complexity is nlgn. Can we have a better way to store the data, while the time complexity is lower?

The question shows that although there are 10 million queries, but because of the high repetition, there are actually only 3 million queries, each of which is bytes, we can consider putting them into the memory, now, we only need a suitable data structure. Here, hash table is definitely our priority, because the query speed of hash table is very fast, almost O (1) time complexity.

Then, our algorithm has: maintain a hashtable with the key as the query string and the value as the number of occurrences of the query. Read a query each time. If the string is not in the table, add the string and set the value to 1. If the string is in table, add one To the count of the string. Finally, we processed the massive data in the time complexity of O (n.

Compared with algorithm 1, this method increases the time complexity by an order of magnitude, which is O (n), but not only the optimization of time complexity. This method only requires one Io data file, algorithm 1 has a large number of I/O operations. Therefore, algorithm 2 has better operability than algorithm 1 in Engineering.

Step 2: Find the top 10
Algorithm 1: normal sorting
I don't want to go into details about sorting algorithms. We should note that the time complexity of sorting algorithms is nlgn. In this question, there are 3 million records, 1 GB memory can be used for storage.

Algorithm 2: Partial sorting
The requirement for the question is to find the top 10, so we do not need to sort all the queries. We only need to maintain an array of 10 sizes, and put 10 queries in initialization, sort by the statistics of each query from large to small, and then traverse the 3 million records. Each read record is compared with the last query of the array. If it is smaller than this query, continue to traverse, otherwise, the last row of data in the array is eliminated and added to the current query. Finally, after all the data is traversed, the 10 queries in this array are the top 10 we are looking.

In this way, the worst time complexity of the algorithm is N * k, where K refers to the top.

Algorithm 3: heap
In algorithm 2, we have optimized the time complexity from nlogn to NK. I have to say this is a big improvement. But is there any better way?

Analysis: In algorithm 2, after each comparison is completed, the operation complexity is K, because the elements need to be inserted into a linear table and sequential comparison is used. Here, we note that the array is ordered. We can use the binary search method every time we look for it. This reduces the complexity of the operation to the logk. However, the problem that arises is data movement, because the number of mobile data increases. However, this algorithm is better than algorithm 2.

Based on the above analysis, do you have a data structure that can quickly search and move elements? The answer is yes, that is, heap.
With the help of the heap structure, we can search, adjust, and move logs in a time range of log magnitude. So here, our algorithm can be improved to maintain a small root heap K (10 in this question) and traverse the 3 million query to compare it with the root element.

The idea is consistent with the above two algorithms, but the algorithm is in algorithm 3. We use the minimum heap data structure to replace the array, and the time complexity of searching the target element is O (k) reduced to O (logk ).
In this way, using the heap data structure and algorithm 3 reduces the final time complexity to n 'logk, which is greatly improved compared with algorithm 2.

Summary:
So far, the algorithm has completely ended. After the first step, use the hash table to calculate the number of times each query appears, O (N). Then, step 2, use the heap data structure to find the top 10, N * O (logk ). Therefore, our final time complexity is: O (n) + n '* O (logk ). (N is 10 million, n is 3 million ). If you have any better algorithms, please leave a comment. The first part is complete.

Part 2: detailed analysis of Hash Table Algorithms

What is hash?
Hash is usually translated as "hash", which is also directly translated as "hash", that is, input of any length (also called pre- ing, pre-image ), the hash algorithm is used to convert an output with a fixed length. The output is the hash value. This type of conversion is a compression ing, that is, the space of hash values is usually much smaller than the input space, and different inputs may be hashed into the same output, instead, it is impossible to uniquely determine the input value from the hash value. Simply put, a function compresses messages of any length to a fixed-length message digest.

Hash is mainly used for encryption algorithms in the information security field. It converts information of different lengths into messy 128-bit codes. These encoding values are called hash values. it can also be said that hash is to find a ing between the data content and the data storage address.

Arrays are characterized by ease of addressing and difficulty in insertion and deletion. linked lists are characterized by difficulties in addressing and insertion and deletion. So can we combine the two features to make a data structure that is easy to address and easily inserted and deleted? The answer is yes. This is the hash table to be mentioned. There are many different implementation methods for hash tables. What I will explain next is the most commonly used method-the zipper method, we can understand it as an array of linked lists ",

The left is obviously an array. Each member of the array contains a pointer pointing to the head of a linked list. Of course, this linked list may be empty or contain many elements. We distribute elements to different linked lists based on some features of the elements. We also find the correct linked list based on these features and then find this element from the linked list.

The method for converting element features into arrays is the hash method. Of course, there are more than one hash method, which are listed below:

1. Division hash
The most intuitive method is the hash method. The formula is as follows:
Index = Value % 16
All those who have learned assembly know that the modulus is actually obtained through a division operation, so it is called the Division hash method ".

2. Square hash Method
Index is a very frequent operation, while multiplication is much more time-saving than Division (for the current CPU, we cannot feel it ), so we want to replace division with multiplication and a displacement operation. Formula:
Index = (value * value)> 28 (right shift, divided by 2 ^ 28. Note: shift left to enlarge, Which is multiplication. Shift right to a smaller value, which is division .)
If the value distribution is relatively uniform, this method can produce good results, but the index calculated by the values of each element in the graph I drew above is 0-very failed. Maybe you still have a problem. If the value is large, will the value * value not overflow? The answer is yes, but we do not care about overflow in this multiplication, because we are not trying to get the multiplication result, but to get the index.

3. Fibonacci hash

The disadvantages of the square hash method are obvious, so can we find an ideal multiplier instead of using the value itself as the multiplier? The answer is yes.

1. For a 16-digit integer, the multiplier is 40503.
2. For a 32-bit integer, the multiplier is 2654435769.
3. For a 64-bit integer, the multiplier is 11400714819323198485.

How are these "ideal multiplier" obtained? This is related to a rule called the golden division rule, and the most classic expression that describes the golden division rule is undoubtedly the famous Fibonacci series, that is, the sequence in this form: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89,144,233,377,610,987,159 7, 2584,418 1, 6765,109 46 ,.... In addition, the Fibonacci sequence value is surprisingly consistent with the ratio of the orbital radius of the eight planets in the solar system.

For our common 32-bit integer, the formula is as follows:
Index = (value * 2654435769)> 28

If the Fibonacci hash is used, the figure above becomes like this:

Obviously, it is much better to use the Fibonacci hash method after adjustment than the original scatter method.

Applicability
The basic data structure to be deleted, which usually requires a total amount of data to be stored in the memory.

Basic principles and key points
Hash function Selection, for strings, integers, sorting, specific hash method.
For collision processing, one is open hashing, also known as the zipper method, and the other is closed hashing, also known as the Open address method and opened addressing.

Extension
D in D-left hashing refers to multiple meanings. Let's first simplify this problem and take a look at 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of the same length, namely T1 and T2, and configuring a hash function, H1 and H2 for T1 and T2 respectively. When a new key is stored, two hash functions are used for calculation to obtain the addresses H1 [Key] and H2 [Key]. In this case, you need to check the H1 [Key] location in T1 and the H2 [Key] location in T2. Which location has been stored (with collision) and there are many keys, store the new key in a location with less load. If there are as many keys on both sides, for example, if both locations are empty or one key is stored, the new key
In the T1 subtable on the left, 2-left also comes from this. When searching for a key, you must perform two hashes and query both locations at the same time.

Problematic instances (massive data processing)
We know that hash tables are widely used in massive data processing. For details, refer to another Baidu interview question:
Question: extract the IP address with the most visits to Baidu on a certain day from massive log data.
Solution: the number of IP addresses is still limited. A maximum of 2 ^ 32 IP addresses are allowed. Therefore, you can use hash to directly store IP addresses in the memory for statistics.

Part 3: fastest Hash Table Algorithm

Next, let's analyze the next fastest hasb table algorithm.
Let's start with a simple question step by step: there is a huge string array, and then you will be given a separate string, so that you can find out whether the string exists in this array and find it, what do you do? There is one method that is the easiest, honestly from the beginning to the end, one by one comparison, until it is found, I think anyone who has learned programming can make such a program, but if a programmer gives such a program to a user, I can only comment it without words. Maybe it can really work,... this is the only way to do this.

The most suitable algorithm is to use hashtable (hash table). First, we will introduce the basic knowledge. The so-called hash is generally an integer, you can compress a string into an integer. Of course, in any case, a 32-bit integer cannot correspond to a string, but in the program, the hash values calculated by the two strings may be very small, next let's look at the hash algorithm in mpq:

Function 1. The following functions generate a crypttable with a length of 0x500 (in combination with 10 hexadecimal numbers: 1280) [0x500]

Void preparecrypttable ()
{
Unsigned long seed = 0x00100001, index1 = 0, index2 = 0, I;

For (index1 = 0; index1 <0x100; index1 ++)
{
For (index2 = index1, I = 0; I <5; I ++, index2 + = 0x100)
{
Unsigned long temp1, temp2;

Seed = (seed * 125 + 3) % 0x2aaaab;
Temp1 = (Seed & 0 xFFFF) <0x10;

Seed = (seed * 125 + 3) % 0x2aaaab;
Temp2 = (Seed & 0 xFFFF );

Crypttable [index2] = (temp1 | temp2 );
}
}
}

Function 2: The following function calculates the hash value of the lpszfilename string. dwhashtype is of the hash type. function 2 is called in function 3 and gethashtablepos, the value can be 0, 1, and 2. This function returns the hash value of the lpszfilename string:

Unsigned long hashstring (char * lpszfilename, unsigned long dwhashtype)
{
Unsigned char * Key = (unsigned char *) lpszfilename;
Unsigned long seed1 = 0x7fed7fed;
Unsigned long seed2 = 0 xeeeeeeee;
Int ch;

While (* key! = 0)
{
Ch = toupper (* Key ++ );

Seed1 = crypttable [(dwhashtype <8) + CH] ^ (seed1 + seed2 );
Seed2 = CH + seed1 + seed2 + (seed2 <5) + 3;
}
Return seed1;
}

This algorithm of blizzard is very efficient, called "one-way hash" (a one-way hash is a an algorithm that is constructed in such a way that deriving the original string (Set of strings, actually) is always Ally impossible ). For example, the result of the string "unitneutralacritter. GRP" obtained through this algorithm is 0xa26067f3.

Is it possible to improve the first algorithm by comparing the hash values of strings one by one? The answer is: it is far from enough. If you want to get the fastest algorithm, you cannot compare the values one by one, A hash table is usually constructed to solve the problem. A hash table is a large array, and the size of this array is defined according to program requirements, such as 1024, each hash value corresponds to a position in the array through mod, so as long as the position corresponding to the hash value of this string is compared, you can get the final result. Think about the speed? Yes, it is the fastest O (1). Now let's take a closer look at this algorithm:

Typedef struct
{
Int nhasha;
Int nhashb;
Char bexists;
......
} Somestructrue;
A possible struct definition?

Function 3. The following functions are used to find whether the target string exists in the hash table. If yes, the hash value of the string to be searched is returned. If no, return-1.

Int gethashtablepos (HAR * lpszstring, somestructure * lptable)
// The string to be searched in the hash table for lpszstring. lptable is the hash table that stores the string hash value.
{
Int nhash = hashstring (lpszstring); // call function 2 and return the hash value of the string lpszstring to be searched.
Int nhashpos = nhash % ntablesize;

If (lptable [nhashpos]. bexists &&! Strcmp (lptable [nhashpos]. pstring, lpszstring ))
{// If the hash value found exists in the table and the string to be searched is the same as the string at the corresponding position in the table,
Return nhashpos; // return the hash value found after function 2 is called.
}
Else
{
Return-1;
}
}

Seeing this, I think everyone is thinking about a very serious problem: "What if the two strings have the same location in the hash table ?", After all, the size of an array is limited, which is highly probable. There are many ways to solve this problem. The first thing I think of is to use a "Linked List". Thanks to the data structure I learned in college, I have taught you the magic weapon of the experiment, many algorithms I have encountered can be converted into linked lists. As long as a linked list is mounted at each entry of the hash table, it is okay to save all the corresponding strings. This seems to have a perfect ending. If you leave the problem to me alone, then I may have to define the data structure and write the code.

However, blizzard programmers use more sophisticated methods. The basic principle is: they do not use a hash value in the hash table, but use three hash values to verify the string.

Mpq uses a file name hash table to track all internal files. However, the format of this table is somewhat different from that of a normal hash table. First, it does not use Hash as the subscript and stores the actual file name in the table for verification. In fact, it does not store the file name at all. Instead, three different hashes are used: a subscript for the hash table, and two for verification. The two verification hashing Replace the actual file name.
Of course, two different file names will be hashed to three identical hashes. However, the average probability of this situation is: 1: 18889465931478580854784. This probability should be small enough for anyone. Now back to the data structure, the hash table used by Blizzard does not use the linked list, but uses the "extend" method to solve the problem. Let's look at this algorithm:

Function 4. lpszstring is the string to be searched in the hash table; lptable is the hash table storing the string hash value; ntablesize is the length of the hash table:

Int gethashtablepos (char * lpszstring, mpqhashtable * lptable, int ntablesize)
{
Const int hash_offset = 0, hash_a = 1, hash_ B = 2;

Int nhash = hashstring (lpszstring, hash_offset );
Int nhasha = hashstring (lpszstring, hash_a );
Int nhashb = hashstring (lpszstring, hash_ B );
Int nhashstart = nhash % ntablesize;
Int nhashpos = nhashstart;

While (lptable [nhashpos]. bexists)
{
/* If you only judge that this string exists in the table, you can compare the two hash values.
* The strings in the struct are compared. Will this speed up the operation? Reduce the space occupied by hash tables? This
* Where is the method generally used? */
If (lptable [nhashpos]. nhasha = nhasha
& Lptable [nhashpos]. nhashb = nhashb)
{
Return nhashpos;
}
Else
{
Nhashpos = (nhashpos + 1) % ntablesize;
}

If (nhashpos = nhashstart)
Break;
}
Return-1;
}

Explanation of the above procedure:

1. Calculate the three hash values of the string (one is used to determine the position, and the other two are used for verification)
2. view the position in the hash table
3. is the position in the hash table empty? If it is null, the string does not exist and-1 is returned.
4. If yes, check whether the other two hash values match. If yes, the string is found and its hash value is returned.
5. Move to the next position. If you have already moved to the end of the table, the query continues from the beginning of the table.
6. check whether it is back to the original position. If yes, the returned result is not found.
7. Return to 3

OK. This is the fastest hash table algorithm mentioned in this article. What? Not fast enough? : D. Thank you for your criticism.

--------------------------------------------
1. A simple hash function:

/* The key is a string, and ntablelength is the length of the hash table.
* The hash value distribution obtained by this function is relatively uniform */
Unsigned long gethashindex (const char * Key, int ntablelength)
{
Unsigned long nhash = 0;

While (* key)
{
Nhash = (nhash <5) + nhash + * Key ++;
}

Return (nhash % ntablelength );
}

Supplement 2: a complete test procedure:
The array of the hash table is fixed length. If it is too large, it will be wasted. If it is too small, it will not reflect the efficiency. The proper array size is the key to the performance of hash tables. The size of a hash table is preferably a prime number. Of course, there will be different hash table sizes based on different data volumes. For applications with a small amount of data, the best design is to use a dynamically variable-size hash table. If you find that the size of the hash table is too small, for example, if the element is twice the size of the hash table, we need to expand the size of the hash table, which is generally doubled.

The following are possible values of the hash table size:

17, 37, 79,163,331,
673,136 1, 2729,547 1, 10949,
21911,438 53, 87719,175 447, 350899,
701819,140, 2807303,561, 11229331,
22458671,449 17381, 89834777,179 669557, 359339171,
718678369,143 7356741, 2147483647

The complete source code of the program is as follows, which has been tested in Linux:

View plain
# Include <stdio. h>
# Include <ctype. h> // thanks for citylove.
// Cryttable [] stores some data that will be used in the hashstring function.
// Initialize the Function
Unsigned long crypttable [0x500];

// The following function generates a crypttable with a length of 0x500 (in combination with the 10 hexadecimal number: 1280) [0x500]
Void preparecrypttable ()
{
Unsigned long seed = 0x00100001, index1 = 0, index2 = 0, I;

For (index1 = 0; index1 <0x100; index1 ++)
{
For (index2 = index1, I = 0; I <5; I ++, index2 + = 0x100)
{
Unsigned long temp1, temp2;

Seed = (seed * 125 + 3) % 0x2aaaab;
Temp1 = (Seed & 0 xFFFF) <0x10;

Seed = (seed * 125 + 3) % 0x2aaaab;
Temp2 = (Seed & 0 xFFFF );

Crypttable [index2] = (temp1 | temp2 );
}
}
}

// The following function calculates the hash value of the lpszfilename string, where dwhashtype is of the hash type,
// Call this function in the following gethashtablepos function. The value can be 0, 1, or 2. This function
// Returns the hash value of the lpszfilename string;
Unsigned long hashstring (char * lpszfilename, unsigned long dwhashtype)
{
Unsigned char * Key = (unsigned char *) lpszfilename;
Unsigned long seed1 = 0x7fed7fed;
Unsigned long seed2 = 0 xeeeeeeee;
Int ch;

While (* key! = 0)
{
Ch = toupper (* Key ++ );

Seed1 = crypttable [(dwhashtype <8) + CH] ^ (seed1 + seed2 );
Seed2 = CH + seed1 + seed2 + (seed2 <5) + 3;
}
Return seed1;
}

// Test the three hash values of argv [1] In Main:
/// Hash "arr/units. dat"
/// Hash "unit/Neutral/acritter. GRP"
Int main (INT argc, char ** argv)
{
Unsigned long ulhashvalue;
Int I = 0;

If (argc! = 2)
{
Printf ("Please input two arguments/N ");
Return-1;
}

/* Initialize the array: cryttable [0x500] */
Preparecrypttable ();

/* Print the values in the array cryttable [0x500 */
For (; I <0x500; I ++)
{
If (I % 10 = 0)
{
Printf ("/N ");
}

Printf ("%-12x", crypttable [I]);
}

Ulhashvalue = hashstring (argv [1], 0 );
Printf ("/n ---- % x ----/N", ulhashvalue );

Ulhashvalue = hashstring (argv [1], 1 );
Printf ("---- % x ----/N", ulhashvalue );

Ulhashvalue = hashstring (argv [1], 2 );
Printf ("---- % x ----/N", ulhashvalue );

Return 0;
}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More