Source: Researcher July
Description: This article is divided into three parts, the first part of a Baidu face question top K algorithm, the second part is about the hash table algorithm detailed elaboration; The third part is to create a fast hash table algorithm.
The first part: the Top K algorithm detailed
Problem description
Baidu Interview Question:
The search engine logs all the retrieved strings used by the user each time they are retrieved through a log file, with a length of 1-255 bytes for each query string.
Assume that there are currently 10 million records (these query strings have a high degree of repetition, although the total is 10 million, but if you remove the duplicates, no more than 3 million. The higher the repetition of a query string, the more users are queried for it, the more popular it is. ), please count the most popular 10 query strings, requiring no more than 1G of memory to use.
Required Knowledge:
What is a hash table.
A hash table (hash table, also known as a hash list) is a data structure that is accessed directly from a key value. That is, it accesses records by mapping key code values to a location in the table to speed up lookups. This mapping function is called a hash function, and the array that holds the record is called the hash table.
The hash table is actually very simple, that is, the key through a fixed algorithm function of the so-called hash function into an integer number, and then the number of the logarithm of the length of the array to take the remainder, the remainder of the result is treated as the subscript, the value is stored in the number as subscript in the array space.
When using a hash table to query, it is again using the hash function to convert the key to the corresponding array subscript, and to locate the space to get value, so that you can take full advantage of the array positioning performance for data positioning (article second to third, the hash table will be elaborated in detail).
Problem Resolution:
To count the most popular queries, the first thing to do is to count the occurrences of each query, and then find the top 10 based on the statistical results. So we can design this algorithm in two steps based on this idea.
That is, the resolution of this problem is divided into the following two steps:
First step: Query statistics
Query statistics have the following two methods to choose from:
1. Direct sorting method
First we think of the first algorithm is sort, first of all the query in this log is sorted, and then traverse the order of the query, the number of each query appears.
But the topic has a clear requirement, that is, memory can not exceed 1G, 10 million records, each record is 255Byte, it is clear to occupy 2.375G of memory, this condition is not satisfied with the requirements.
Let us recall the contents of the data structure course, when the amount of the volume is larger and the memory can not be loaded, we may use the method of sorting out, here we can use the merge sort, because the merge sort has a relatively good time complexity O (NLGN).
After finishing the sequence, we then iterate over the already ordered query file, counting the number of occurrences of each query and writing it back to the file.
In a comprehensive analysis, the time complexity of sorting is O (NLGN), and the time complexity of the traversal is O (N), so the overall time complexity of the algorithm is O (N+NLGN) =o (NLGN).
2. Hash Table method
In the 1th approach, we used a sort of method to count the number of times each query appeared, time complexity is NLGN, then can have a better way to store, and time complexity is lower.
The topic shows that although there are 10 million query, but because of the high repetition, so actually only 3 million of the query, each query255byte, so we can consider putting them all into memory, and now just need a suitable data structure, here, Hash Table is definitely our first choice, because hash table queries are very fast, almost O (1) time complexity.
So, our algorithm has: to maintain a key is the query string, value is the number of times the query hashtable, each read a query, if the string is not in the table, then add the string, and the value is set to 1 If the string is in table, add a count of the string. Finally, we completed the processing of this massive amount of data in the time complexity of O (N).
Compared to the algorithm 1: The time complexity of the increase of an order of magnitude, O (N), but not only the time complexity of the optimization, the method only need to IO data file once, and the algorithm 1 more than the number of Io, so the algorithm 2 than the algorithm 1 in engineering has better operability.
Step two: Find Top 10
Algorithm one: General sort
I think for the sorting algorithm everyone is not unfamiliar, here is not to repeat, we should pay attention to the sorting algorithm time complexity is NLGN, in this topic, 3 million records, with 1G of memory can be saved.
Algorithm two: Partial sorting
The title asks for the top 10, so we don't need to sort all of the query, we just need to maintain an array of 10 sizes, initialize it into 10 query, sort by the count of each query from big to small, and then traverse the 3 million records, Each record is compared to the last query in the array, and if it is less than the query, continue the traversal, otherwise, the last data in the arrays is eliminated and the current query is added. Finally, after all the data has been traversed, 10 of the query in this array is the TOP10 we're looking for.
It is not difficult to analyze, so that the worst time complexity of the algorithm is n*k, where K refers to the top number.
Algorithm three: Heap
In the algorithm two, we have to optimize the time complexity from Nlogn to NK, we have to say that this is a relatively big improvement, but there is no better way.
Analysis, in the algorithm two, each time the comparison is completed, the required operational complexity is k, because the element is inserted into a linear table, and the use of sequential comparisons. Here we notice that the array is ordered, once we find each time can be used to find a two-point method, so that the complexity of the operation is reduced to LOGK, but the attendant problem is the data movement, because the number of mobile data increased. However, the algorithm is still better than the algorithm two.
Based on the above analysis, we think there is a data structure that can quickly find and move elements quickly. The answer is yes, that's the heap.
With the heap structure, we can find and adjust/move within the time of the log magnitude. So here, our algorithm can be improved to maintain a K (10) Size of the small Gan, and then traverse 3 million of the query, respectively, and the root element to compare.
The idea is consistent with the above algorithm, only the algorithm in the algorithm three, we use the smallest heap of this data structure instead of arrays, to find the target element of the time complexity of O (K) to O (LOGK).
So, the use of heap data structure, algorithm three, the final time complexity is reduced to n ' logk, compared with the algorithm two, there is a relatively large improvement.
Summarize:
At this point, the algorithm is completely finished, after the first step above, using the hash table to count the occurrences of each query, O (N); Then the second step, using the heap data structure to find the top 10,n*o (LOGK). So, our final time complexity is: O (n) + N ' *o (LOGK). (n is 10 million, N ' is 3 million). If you have any better algorithm, please comment on the message. The first part, the end.
The second part, the hash table algorithm detailed analysis
What is a hash?
Hash, the general translation to do "hash", there is a direct transliteration of "hash", is the arbitrary length of the input (also known as pre-mapping, pre-image), through the hash algorithm, transformed into a fixed-length output, the output is the hash value. This conversion is a compression map, that is, the space of the hash value is usually much smaller than the input space, the different inputs may be hashed to the same output, but not from the hash value to uniquely determine the input value. Simply, a function that compresses messages of any length to a message digest of a fixed length.
Hash is mainly used in the field of information security encryption algorithm, it has a number of different lengths of information into a cluttered 128-bit encoding, these encoded values are called hash values. It can also be said that the hash is to find a data content and data storage address between the mapping relationship.
The characteristics of the array are: easy addressing, insertion and deletion difficulties, and the list is characterized by: difficult to address, insert and delete easy. So can we combine the characteristics of both, make an easy to address, insert delete also easy data structure. The answer is yes, this is the hash table we are going to mention, the hash table has a number of different implementations, and I will explain the most common method-the Zipper method, which we can understand as "array of linked lists", as shown:
On the left is obviously the array, each member of the arrays consists of a pointer to the head of a linked list, and of course the list may be empty, or there may be many elements. We assign elements to different linked lists according to some of the characteristics of the elements, and we find the correct linked list based on these characteristics, and then we find this element from the list.
The method that the element feature transforms the subscript is the hash method. The hashing method is of course more than one, the following list three kinds of more commonly used:
1, Division Hash method
The most intuitive one, the above figure uses this hashing method, the formula:
index = value% 16
learned that the assembly is known that the modulus is actually obtained through a division operation, so called "division hashing method."
2, Square hash method
The index is very frequent operation, and the multiplication operation is more time-saving than division (for the current CPU, it is estimated that we do not feel it), so we consider dividing the division into multiplication and a displacement operation. Formula:
Index = (value * value) >> 28 (Shift right, divided by 2^28. Notation: Shift left to large, is multiply. Right shift to small, is except. )
This method can get good results if the values are evenly distributed, but the values of the individual elements of the graph I drew above are 0--very unsuccessful. Perhaps you have a question, if value is large, value * value does not overflow. The answer is yes, but our multiplication does not care about overflow, because we are not at all to get the multiplication result, but to get index.
3, Fibonacci (Fibonacci) hash method
The disadvantage of the square hashing method is obvious, so can we find an ideal multiplier instead of using value itself as a multiplier? The answer is yes.
1, for 16-bit integers, this multiplier is 40503
2, for 32-bit integers, this multiplier is 2654435769
3, for 64-bit integers, this multiplier is 11400714819323198485
How did these "ideal multipliers" come out? This is related to a law, called the golden Rule, and the most classical expression describing the golden rule is undoubtedly the famous Fibonacci sequence, that is, sequences of this sort: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144,233, 377, 610, 987 597, 2584, 4181, 6765, 10946, .... In addition, the value of the Fibonacci sequence coincides with the orbital radius of the eight planets in the solar system.
For our common 32-bit integers, the formula:
Index = (value * 2654435769) >> 28
If you use this Fibonacci scatter FPT, the above diagram will look like this:
It is obvious that the Fibonacci hashing method is much better than the original method of fetching and hashing.
Scope of application
Quick Find, delete the basic data structure, usually requires the total amount of data can be put into memory.
Fundamentals and key points
hash function selection, for strings, integers, permutations, the specific corresponding hash method.
Collision treatment, one is open hashing, also known as Zipper method, the other is closed hashing, also known as the Address law, opened addressing.
Extended
D-left hashing in D is a number of meanings, we first simplify this problem, take a look at 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of equal length, called T1 and T2 respectively, with a hash function for T1 and T2, H1 and H2. When a new key is stored, it is calculated with two hash functions, resulting in two addresses H1[key] and H2[key]. At this point you need to check the H1[key] position in the T1 and the H2[key] position in the T2, which location has been stored (collision) key more, and then store the new key in a low-load location. If the two sides are the same, for example, two positions are empty or all store a key, the new key is stored in the left T1 sub-table, 2-left also come. When looking for a key, you must make a hash of two times and find two positions.
Problem example (mass data processing)
We know that hash table in the massive data processing has a wide range of applications, below, please look at another Baidu interview questions:
Title: Massive log data, extract a day to visit Baidu the most times the IP.
Scenario: The number of IP is still limited, up to 2^32, so you can consider using a hash of the IP directly into memory, and then statistics.
The third part, the fastest hash table algorithm
Next, let's take a concrete look at one of the fastest HASB table algorithms.
Let's start with a simple question: There's a huge array of strings, and then give you a separate string that lets you find out if you have this string from this array and find out what you're going to do. There is a way to the simplest, honestly from the tail, a comparison, until found, I think as long as the people who have learned the program design can make such a process, but if there are programmers to give such a program to the user, I can only use no language to evaluate, maybe it really can work, but ... This is the only way.
The most appropriate algorithm is the use of Hashtable (hash table), the first introduction of the basic knowledge, the so-called hash, is generally an integer, through an algorithm, you can put a string "compressed" into an integer. Of course, in any case, a 32-bit integer cannot correspond back to a string, but in the program, the two strings calculated by the hash value of equal may be very small, the following look at the hash algorithm in MPQ:
Function one, the following function produces a length of 0x500 (10 in Number: 1280) crypttable[0x500]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |