Author: July, wuliming, pkuoliver
Source: http://blog.csdn.net/v_july_v.
Note: This article is divided into three parts,
The first part is a detailed explanation of the Top K algorithm of Baidu interview questions; the second part is a detailed description of the Hash table algorithm; the third part is to build the fastest Hash table algorithm.
------------------------------------
Part 1: Explanation of the Top K Algorithm
Problem description
Baidu interview questions:
The search engine records all the search strings used for each search using log files. The length of each query string is 1-bytes.
Suppose there are currently 10 million records (these query strings have a relatively high degree of repetition, although the total number is 10 million, but if the repetition is not removed, there will be no more than 3 million records. The higher the repetition of a query string, the more users query it, that is, the more popular it is .), Please count the top 10 query strings. The memory required cannot exceed 1 GB.
Required knowledge:
What is a hash table?
A Hash table (also called a Hash table) is a data structure that is directly accessed based on the Key value. That is to say, It maps the key value to a location in the table to access records to speed up the search. This ing function is called a hash function, and the array storing records is called a hash function.
The hash table method is actually very simple, that is, to convert the Key into an integer using a fixed algorithm function called a hash function, and then perform the remainder operation on the array length, the remainder result is used as the subscript of the array, and the value is stored in the array space with the number as the base object.
When a hash table is used for query, the hash function is used again to convert the key to the corresponding array subscript and locate the space to obtain the value, the positioning performance of the array can be fully utilized for Data Location (the second and third parts of the article will be detailed on the Hash table ).
Problem Analysis:
To count the most popular queries, you must first count the number of times each Query appears, and then find the Top 10 based on the statistical results. Therefore, we can design the algorithm in two steps based on this idea.
That is, there are two steps to solve this problem:
Step 1: Query statistics
Query statistics are available in the following two methods:
1. Direct sorting
The first algorithm we come up with is sorting. First, we sort all the queries in this log, and then traverse the sorted Query to count the number of times each Query appears.
But there is a clear requirement in the question, that is, the memory cannot exceed 1 GB, there are 10 million records, each record is 225 bytes, it is obvious that it occupies 2 to 55 GB memory, this condition does not meet the requirements.
Let's recall the content in the Data Structure course. When the data volume is large and the memory cannot be loaded, we can sort it by external sorting. Here we can sort it by merging, because Merge Sorting has a better time complexity O (NlgN ).
After sorting, we traverse the sorted Query file, count the number of times each Query appears, and write it into the file again.
According to a comprehensive analysis, the time complexity of sorting is O (NlgN), and the time complexity of traversal is O (N). Therefore, the overall time complexity of this algorithm is O (N + NlgN) = O (NlgN ).
2. Hash Table Method
In the 1st methods, we used the sorting method to count the number of times each Query appears. The time complexity is NlgN. Can we have a better way to store the data, while the time complexity is lower?
The question shows that although there are 10 million queries, but because of the high repetition, there are actually only 3 million queries, each of which is bytes, we can consider putting them into the memory, now, we only need a suitable data structure. Here, Hash Table is definitely our priority, because the query speed of Hash Table is very fast, almost O (1) time complexity.
Then, our algorithm has: maintain a HashTable with the Key as the Query string and the Value as the number of occurrences of the Query. Read a Query each time. If the string is not in the Table, add the string and set the Value to 1. If the string is in Table, add one To the count of the string. Finally, we processed the massive data in the time complexity of O (N.
Compared with algorithm 1, this method increases the time complexity by an order of magnitude, which is O (N), but not only the optimization of time complexity. This method only requires one IO data file, algorithm 1 has a large number of I/O operations. Therefore, algorithm 2 has better operability than algorithm 1 in Engineering.
Step 2: Find the Top 10
Algorithm 1: normal sorting
I don't want to go into details about sorting algorithms. We should note that the time complexity of sorting algorithms is NlgN. In this question, there are 3 million records, 1 GB memory can be used for storage.
Algorithm 2: Partial sorting
The requirement for the question is to find the Top 10, so we do not need to sort all the queries. We only need to maintain an array of 10 sizes, and put 10 queries in initialization, sort by the statistics of each Query from large to small, and then traverse the 3 million records. Each read record is compared with the last Query of the array. If it is smaller than this Query, continue to traverse, otherwise, the last row of data in the array is eliminated and added to the current Query. Finally, after all the data is traversed, the 10 queries in this array are the top 10 we are looking.
In this way, the worst time complexity of the algorithm is N * K, where K refers to the top.
Algorithm 3: heap
In algorithm 2, we have optimized the time complexity from NlogN to NK. I have to say this is a big improvement. But is there any better way?
Analysis: In algorithm 2, after each comparison is completed, the operation complexity is K, because the elements need to be inserted into a linear table and sequential comparison is used. Here, we note that the array is ordered. We can use the binary search method every time we look for it. This reduces the complexity of the operation to the logK. However, the problem that arises is data movement, because the number of mobile data increases. However, this algorithm is better than algorithm 2.
Based on the above analysis, do you have a data structure that can quickly search and move elements? The answer is yes, that is, heap.
With the help of the heap structure, we can search, adjust, and move logs in a time range of log magnitude. So here, our algorithm can be improved to maintain a small root heap K (10 in this question) and traverse the 3 million Query to compare it with the root element.
The idea is consistent with the above two algorithms, but the algorithm is in algorithm 3. We use the minimum heap data structure to replace the array, and the time complexity of searching the target element is O (K) reduced to O (logK ).
In this way, using the heap data structure and algorithm 3 reduces the final time complexity to N 'logK, which is greatly improved compared with algorithm 2.
Summary:
So far, the algorithm has completely ended. After the first step, use the Hash table to calculate the number of times each Query appears, O (N). Then, step 2, use the heap data structure to find the Top 10, N * O (logK ). Therefore, our final time complexity is: O (N) + N * O (logK ). (N is 10 million, N is 3 million ). If you have any better algorithms, please leave a comment. The first part is complete.
Part 2: detailed analysis of Hash Table Algorithms
What is Hash?
Hash is usually translated as "Hash", which is also directly translated as "Hash", that is, input of any length (also called pre- ing, pre-image ), the hash algorithm is used to convert an output with a fixed length. The output is the hash value. This type of conversion is a compression ing, that is, the space of hash values is usually much smaller than the input space, and different inputs may be hashed into the same output, instead, it is impossible to uniquely determine the input value from the hash value. Simply put, a function compresses messages of any length to a fixed-length message digest.
HASH is mainly used for encryption algorithms in the information security field. It converts information of different lengths into messy 128-bit codes. These encoding values are called HASH values. it can also be said that hash is to find a ing between the data content and the data storage address.
Arrays are characterized by ease of addressing and difficulty in insertion and deletion. linked lists are characterized by difficulties in addressing and insertion and deletion. So can we combine the two features to make a data structure that is easy to address and easily inserted and deleted? The answer is yes. This is the hash table to be mentioned. There are many different implementation methods for hash tables. What I will explain next is the most commonly used method-the zipper method, we can understand it as an array of linked lists ",
The left is obviously an array. Each member of the array contains a pointer pointing to the head of a linked list. Of course, this linked list may be empty or contain many elements. We distribute elements to different linked lists based on some features of the elements. We also find the correct linked list based on these features and then find this element from the linked list.
The method for converting element features into arrays is the hash method. Of course, there are more than one hash method, which are listed below:
1. Division hash
The most intuitive method is the hash method. The formula is as follows:
Index = value % 16
All those who have learned assembly know that the modulus is actually obtained through a division operation, so it is called the Division hash method ".
2. Square hash Method
Index is a very frequent operation, while multiplication is much more time-saving than Division (for the current CPU, we cannot feel it ), so we want to replace division with multiplication and a displacement operation. Formula:
Index = (value * value)> 28 (right shift, divided by 2 ^ 28. Note: shift left to enlarge, Which is multiplication. Shift right to a smaller value, which is division .)
If the value distribution is relatively uniform, this method can produce good results, but the index calculated by the values of each element in the graph I drew above is 0-very failed. Maybe you still have a problem. If the value is large, will the value * value not overflow?