A text file to find the top 10 frequently appearing words, but this time the file is longer, said to be hundreds of lines or 1 billion lines, in short, can not read into the memory

Source: Internet
Author: User
Tags repetition

Top K Algorithm Detailed
Application Scenarios:

The search engine logs all the retrieved strings used by the user each time they are retrieved through a log file, with a length of 1-255 bytes for each query string.
Assume that there are currently 10 million records (these query strings have a high degree of repetition, although the total is 10 million, but if you remove the duplicates, no more than 3 million. The higher the repetition of a query string, the more users are queried for it, the more popular it is. ), please count the most popular 10 query strings, requiring no more than 1G of memory to use.

Required Knowledge:
What is a hash table?

A hash table (hash table, also known as a hash list) is a data structure that is accessed directly from a key value.

That is, it accesses records by mapping key code values to a location in the table to speed up lookups. This mapping function is called a hash function, and the array that holds the record is called the hash table.

The hash table is actually very simple, that is, the key through a fixed algorithm function of the so-called hash function into an integer number, and then the number of the logarithm of the length of the array to take the remainder, the remainder of the result is treated as the subscript, the value is stored in the number as subscript in the array space.
When querying using a hash table, the hash function is used again to convert the key to the corresponding array subscript, and to locate the space to get value, so that we can make full use of the array positioning performance for data positioning.
Problem Resolution:

To count the most popular queries, the first thing to do is to count the occurrences of each query, and then find the top 10 based on the statistical results. So we can design this algorithm in two steps based on this idea.

That is, the resolution of this problem is divided into the following two steps:

First step: Query statistics (count the number of occurrences of each query)
Query statistics have the following two methods to choose from:
1, direct sorting method (often in the log file statistics, using cat File|format key|sort | uniq-c | sort-nr | Head-n 10, this is the method)
First we think of the first algorithm is sort, first of all the query in this log is sorted, and then traverse the order of the query, the number of each query appears.

But the topic has a clear requirement, that is, memory can not exceed 1G, 10 million records, each record is 255Byte, it is clear to occupy 2.375G of memory, this condition is not satisfied with the requirements.

Let us recall the contents of the data structure course, when the amount of the volume is larger and the memory can not be loaded, we may use the method of sorting out, here we can use the merge sort, because the merge sort has a relatively good time complexity O (NLGN).

After finishing the sequence, we then iterate over the already ordered query file, counting the number of occurrences of each query and writing it back to the file.

In a comprehensive analysis, the time complexity of sorting is O (NLGN), and the time complexity of the traversal is O (N), so the overall time complexity of the algorithm is O (N+NLGN) =o (NLGN).

2. Hash Table Method (This method counts the number of occurrences of a string very well)
In the 1th approach, we used a sort of method to count the number of times each query appeared, time complexity is NLGN, then can there be a better way to store, and less time complexity?

The topic shows that although there are 10 million query, but because of the high repetition, so in fact only 3 million of the query, each query 255Byte, so we can consider putting them all into memory, and now just need a suitable data structure, here, Hash Table is definitely our first choice, because hash table queries are very fast, almost O (1) time complexity.

So, here's our algorithm:

maintain a key as the query string, value is the hashtable of the number of times the query occurs, each time you read a query, if the string is not in the table, then add the string and set the value to 1 if the string is in Ta BLE, then add a count of the string. Finally, we completed the processing of this massive amount of data in the time complexity of O (N) .

Compared to the algorithm 1: The time complexity of the increase of an order of magnitude, O (N), but not only the time complexity of the optimization, the method only need to IO data file once, and the algorithm 1 more than the number of Io, so the algorithm 2 than the algorithm 1 in engineering has better operability.

Step two: Find Top 10 (Find out the 10 most occurrences)
Algorithm one: normal sort (we only use to find Top10, so all sorts have redundancy)
I think for the sorting algorithm everyone is not unfamiliar, here is not to repeat, we should pay attention to the sorting algorithm time complexity is NLGN, in this topic, 3 million records, with 1G of memory can be saved.

algorithm two: partial sorting
The title is the top 10, so we don't have to sort all of the query, we just need to maintain an array of 10 size, initialize it into 10 query, Sort by the count of each query from the big to the small, and then traverse the 3 million records , each record is compared to the last query in the array, and if it is smaller than the query, continue the traversal, otherwise, the last piece of data is eliminated (or placed in the appropriate position, kept in order), adding the current query. Finally, after all the data has been traversed, 10 of the query in this array is the TOP10 we're looking for.

It is not difficult to analyze, so that the worst time complexity of the algorithm is n*k, where K refers to the top number.

algorithm Three: Heap
In the algorithm two, we have to optimize the time complexity from Nlogn to n*k, we have to say that this is a relatively big improvement, but there is no better way ?

Analysis, in the algorithm two, each time the comparison is completed, the required operational complexity is k, because the element is inserted into a linear table, and the use of sequential comparisons. Here we notice that the array is ordered, once we find each time can be used to find a two-point method, so that the complexity of the operation is reduced to LOGK, but the attendant problem is the data movement, because the number of mobile data increased. However, the algorithm is still better than the algorithm two.

based on the above analysis, can we think of a data structure that can quickly find and move elements quickly?

The answer is yes, that's the heap.
With the heap structure, we can find and adjust/move within the time of the log magnitude. So here, our algorithm can be improved to maintain a K (the problem is the size of the small Gan ), and then traverse 3 million of the query, respectively, and the root element to compare.

The idea is consistent with the above algorithm, just in the algorithm three, we use the smallest heap of this data structure instead of arrays, the time complexity of finding the target element has O (K) to O (LOGK).
So, using heap data structure, algorithm three, the final time complexity is reduced to N*LOGK, compared with the algorithm two, there is a relatively large improvement.

Summarize:

At this point, the algorithm is completely finished, after the first step above, using the hash table to count the occurrences of each query, O (N); then the second step , using the heap data structure to find the top 10,n*o (LOGK). So, our final time complexity is: O (n) + N ' *o (LOGK). (n is 10 million, N ' is 3 million).  

/////////////////////////////////////////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////

/////////////////////////////////////////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////

Question one:

Find the maximum number of K in an unordered array algorithm idea 1:

The array is sorted in descending order, and then the first k elements are returned, which is the required K maximum number.

There are many choices of sorting algorithms, considering the disorder of the array, we can consider choosing the fast sorting algorithm, whose average time complexity is O (NLOGN). Specific code implementations can be found in related data structures and algorithmic books.

Algorithm idea 2 (better):

Observation of the first algorithm, the problem only need to find an array in front of the K maximum number, and the first algorithm of the whole order, not only to find the first K maximum number, but also find out the first n (n array size) maximum number, obviously the algorithm exists "redundant", so based on such a reason, proposed an improved algorithm two.

First set up a temporary array, the size of the array k, read the number of k from N, descending full order (sorting algorithm can choose by itself, consider the disorder of the array, you can consider choosing a fast sorting algorithm), and then read into the remaining number of n-k and the K-element comparison, greater than the value of the K-element is inserted into the appropriate The last element of the array overflows, whereas a value less than or equal to the K-name element does not insert. Just wait for the loop to return the K-elements of the temporary array, which is the required K-max number. The average time complexity of the same algorithm is O (klogk + (n-k)). The specific code implementation can be done by itself.

Original:

http://blog.csdn.net/wwang196988/article/details/6618746

question two:There are 100 million floating-point numbers, please find out the largest of the 10,000. Tip: Assuming that each floating-point number is 4 bytes, 100 million floating-point numbers are going to stand in a fairly large space, so you cannot sort all of them in memory at once.

It can be found that if the memory of the machine is not enough, so we can only think of other ways to solve the problem, in order to be efficient or to meet a certain probability of resolution, the results are not necessarily accurate, but should be able to do most of the data.

Algorithm Idea 1,
1, we can divide 100 million floating-point numbers into 1000 groups (hash the same number into the same array);

2, for the first time in each group to find the largest number of 1W, a total of 1000;

3, the second query is the 100W number to find the largest number of 1W.
The number of ps:100w to find the largest number of 1W with similar to the idea of a fast-line to fix.
Algorithm thought 2 (better),
1. Read the first 10,000 numbers and create a two-fork sort tree directly. O (1)
2, for each subsequent reading of the number, compare whether the smallest than the first 10,000 number of large. (N-Times comparison) if small, then read the following number. O (N)
3, if large, find the binary sort tree, find the location should be inserted.
4, delete the current minimum node.
5. Repeat step 2 until 1 billion numbers are all read out.
6. Follow the middle sequence to output all 10,000 numbers in the current binary sort tree.
Basically the time complexity of the algorithm is O (N) times the comparison
The spatial complexity of the algorithm is 10000 (constant)

Based on the idea above, it can be implemented with the smallest heap, so that the complexity is log10000 when the smallest number in the 10,000 tree is not added.

Related Similar issues:

1, a text file, about 10,000 lines, one word per line, asked to count the most frequent occurrences of the first 10 words, please give the idea, give the time complexity analysis.

Scenario 1: This question is about time efficiency. Use the trie tree (prefix tree) to count the occurrences of each word, and the time complexity is O (n*le) (Le denotes the alignment length of the word). Then is to find the most frequent first 10 words, can be implemented with the heap, the previous question has been mentioned, the time complexity is O (N*LG10). So the total time complexity is the larger of O (n*le) and O (N*LG10).

2, a text file, find the first 10 frequently appearing words, but this time the file is longer, said to be hundreds of billions of lines or 1 billion lines, in short, can not read into memory, ask the best solution.

Scenario 1: First, according to the use of hash and modulo, the file decomposition into a number of small files, for a single file using the method of the problem to find out the 10 most common words in each file. Then merge to find the final 10 most commonly occurring words.

3, the number of 100w to find the largest number of 100.

      • Option 1: Adopt a local elimination approach. Select the first 100 elements, and sort, as sequence L. Then scan the remaining element x one at a time, compared to the smallest element in the ordered 100 elements, if it is larger than the smallest one, then delete the smallest element and insert the X into the sequence L with the idea of inserting sort. Loop in turn, knowing that all the elements have been scanned. The complexity is O (100w*100).
      • Scenario 2: The idea of using a quick sort, after each split only to consider a larger than the axis of the part, know that the larger than the axis of the large part of the time than 100, using the traditional sorting algorithm, the first 100. The complexity is O (100w*100).
      • Scenario 3: In the previous question, we have mentioned, with a minimum heap of 100 elements to complete. The complexity is O (100w*lg100).

A text file to find the top 10 frequently occurring words, but this time the file is longer, said to be hundreds of millions of lines or 1 billion lines, in short, can not read into memory at one time

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.