A text file to find the top 10 frequently appearing words, but this time the file is longer, said to be hundreds of lines or 1 billion lines, in short, can not read into the memory

Source: Internet
Author: User
Tags hash repetition

Top K Algorithm Detailed
Application Scenarios:

The search engine logs all the retrieved strings used by the user each time they are retrieved through a log file, with a length of 1-255 bytes for each query string.
Assume that there are currently 10 million records (these query strings have a high degree of repetition, although the total is 10 million, but if you remove the duplicates, no more than 3 million. The higher the repetition of a query string, the more users are queried for it, the more popular it is. ), please count the most popular 10 query strings, requiring no more than 1G of memory to use.

Required Knowledge:
What is a hash table.
A hash table (hash table, also known as a hash list) is a data structure that is accessed directly from a key value.

That is, it accesses records by mapping key code values to a location in the table to speed up lookups. This mapping function is called a hash function, and the array that holds the record is called the hash table.

The hash table is actually very simple, that is, the key through a fixed algorithm function of the so-called hash function into an integer number, and then the number of the logarithm of the length of the array to take the remainder, the remainder of the result is treated as the subscript, the value is stored in the number as subscript in the array space.
When querying using a hash table, the hash function is used again to convert the key to the corresponding array subscript, and to locate the space to get value, so that we can make full use of the array positioning performance for data positioning.
Problem Resolution:

To count the most popular queries, the first thing to do is to count the occurrences of each query, and then find the top 10 based on the statistical results. So we can design this algorithm in two steps based on this idea.

That is, the resolution of this problem is divided into the following two steps:

First step: Query statistics (count the number of occurrences of each query)
Query statistics have the following two methods to choose from:
1, direct sorting method (often in the log file statistics, using cat File|format key|sort | uniq-c | sort-nr | Head-n 10, this is the method)
First we think of the first algorithm is sort, first of all the query in this log is sorted, and then traverse the order of the query, the number of each query appears.

But the topic has a clear requirement, that is, memory can not exceed 1G, 10 million records, each record is 255Byte, it is clear to occupy 2.375G of memory, this condition is not satisfied with the requirements.

Let us recall the contents of the data structure course, when the amount of the volume is larger and the memory can not be loaded, we may use the method of sorting out, here we can use the merge sort, because the merge sort has a relatively good time complexity O (NLGN).

After finishing the sequence, we then iterate over the already ordered query file, counting the number of occurrences of each query and writing it back to the file.

In a comprehensive analysis, the time complexity of sorting is O (NLGN), and the time complexity of the traversal is O (N), so the overall time complexity of the algorithm is O (N+NLGN) =o (NLGN).

2. Hash Table Method (This method counts the number of occurrences of a string very well)
In the 1th approach, we used a sort of method to count the number of times each query appeared, time complexity is NLGN, then can have a better way to store, and time complexity is lower.

The topic shows that although there are 10 million query, but because of the high repetition, so in fact only 3 million of the query, each query 255Byte, so we can consider putting them all into memory, and now just need a suitable data structure, here, Hash Table is definitely our first choice, because hash table queries are very fast, almost O (1) time complexity.

So, here's our algorithm:

Maintain a key as the query string, value is the hashtable of the number of times the query occurs, each time a query is read, if the string is not in the table, the string is added, and the value is set to 1, if the string is in the table, Then add a count of the string. Finally, we completed the processing of this massive amount of data in the time complexity of O (N).

Compared to the algorithm 1: The time complexity of the increase of an order of magnitude, O (N), but not only the time complexity of the optimization, the method only need to IO data file once, and the algorithm 1 more than the number of Io, so the algorithm 2 than the algorithm 1 in engineering has better operability.

Step two: Find Top 10 (Find out the 10 most occurrences)
Algorithm one: normal sort (we only use to find Top10, so all sorts have redundancy)
I think for the sorting algorithm everyone is not unfamiliar, here is not to repeat, we should pay attention to the sorting algorithm time complexity is NLGN, in this topic, 3 million records, with 1G of memory can be saved.

Algorithm two: Partial sorting
The title asks for the top 10, so we don't need to sort all of the query, we just need to maintain an array of 10 sizes, initialize it into 10 query, sort by the count of each query from big to small, and then traverse the 3 million records, Each time a record is read and the last query in the array is compared, if it is smaller than this query, then continue the traversal, otherwise, the last piece of data will be eliminated (or put in the appropriate position, keep order), add the current query. Finally, after all the data has been traversed, 10 of the query in this array is the TOP10 we're looking for.

It is not difficult to analyze, so that the worst time complexity of the algorithm is n*k, where K refers to the top number.

Algorithm three: Heap
In the algorithm two, we have to optimize the time complexity from Nlogn to n*k, we have to say that this is a relatively big improvement, but there is no better way.

Analysis, in the algorithm two, each time the comparison is completed, the required operational complexity is k, because the element is inserted into a linear table, and the use of sequential comparisons. Here we notice that the array is ordered, once we find each time can be used to find a two-point method, so that the complexity of the operation is reduced to LOGK, but the attendant problem is the data movement, because the number of mobile data increased. However, the algorithm is still better than the algorithm two.

Based on the above analysis, we think there is a data structure that can quickly find and move elements quickly.

The answer is yes, that's the heap.
With the heap structure, we can find and adjust/move within the time of the log magnitude. So here, our algorithm can be improved to maintain a K (10) Size of the small Gan, and then traverse 3 million of the query, respectively, and the root element to compare.

The idea is consistent with the above algorithm, just in the algorithm three, we use the smallest heap of this data structure instead of arrays, the time complexity of finding the target element has O (K) to O (LOGK).
So, using heap data structure, algorithm three, the final time complexity is reduced to N*LOGK, compared with the algorithm two, there is a relatively large improvement.

Summarize:

At this point, the algorithm is completely finished, after the first step above, using the hash table to count the number of occurrences of each query, O (N); Then the second step, use the heap data structure to find the top 10,n*o (LOGK). So, our final time complexity is: O (n) + N ' *o (LOGK). (n is 10 million, N ' is 3 million).  

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.