1. Problem Description
In large-scale data processing, one of the common problems is to find the top K count with the highest frequency in massive data or find the largest top K count from massive data, this type of problem is usually called the "top K" problem. For example, in a search engine, the top 10 query words are counted, and the top 10 songs with the highest download rate are counted in the song library.
2. Current solution
For top K problems, a better solution is [grouping + trie tree/HASH + small top heap]. That is, the dataset is first divided into multiple small datasets by hash method, use the trie tree or hash to calculate the query term frequency in each small data set, and then use the small top heap to find the top K numbers with the highest outbound frequency in each data set, finally, find the final top K in all the top K.
In fact, the optimal solution should be the solution that best meets the actual design requirements. In actual applications, there may be enough memory, so you can directly throw the data to the memory for one-time processing, it is also possible that the machine has multiple cores, so that multithreading can be used to process the entire dataset.
This article introduces solutions suitable for different application scenarios.
3. Solution
3.1 single-host + single-core + large enough memory
If each query word occupies an average of 8 bytes, the memory required for the 1 billion query words is about 10 ^ 9*8 = 8 GB. If you have such a large memory, you can sort the query words directly in the memory and traverse them sequentially to find the 10 most frequently occurring 10 words. This method is simple, fast, and more practical. Of course, you can also use hashmap to find the occurrence frequency of each word, and then find the 10 words with the maximum occurrence frequency.
3.2 single-host + multi-core + large enough memory
In this case, the hash method can be used directly in the memory to divide the data into N partitions, and each partition is handed over to a thread for processing. The thread processing logic is similar to Section 3.1, the last thread merges the results.
This method has a bottleneck that will significantly affect the efficiency, that is, data skew. The processing speed of each thread may be different. A fast thread needs to wait for a slow thread. The final processing speed depends on the slow thread. The solution is to divide the data into C * n partition (C> 1). After each thread finishes processing the current partition, it takes the initiative to take the next partition for further processing until all data processing is complete, finally, it is merged by a thread.
3.3 single-host + single-core + restricted memory
In this case, you need to cut the original data file into a small file. For example, use Hash (x) % m to cut the data in the original file into M small files, if the size of a small file is greater than the memory size, continue to use the hash method to cut the data file until each small file is smaller than the memory size, so that each file can be stored in the memory for processing. Use the method in section 3.1 to process each small file in sequence.
3.4 multi-host + limited memory
In this case, data can be distributed to multiple machines to make rational use of the resources of multiple machines. Each machine uses the policy in section 3.3 to solve local data. The hash + socket method can be used for data distribution.
From the perspective of practical application, 3.1 ~ The solution in section 3.4 is not feasible, because in a large-scale data processing environment, job efficiency is not the primary consideration, and algorithm scalability and fault tolerance are the primary considerations. Algorithms should be well scalable so that the data volume can be further increased (as the business grows, the data volume increases) without modifying the algorithm framework, similar linear ratio can be reached; the algorithm should be fault tolerant, that is, after a file fails to be processed, it can be automatically handed over to another thread for further processing, rather than starting from scratch.
The top K problem is suitable for the mapreduce framework. You only need to write one map function and two reduce functions, and then submit them to hadoop (using mapchain and reducechain) to solve the problem. For map functions, the hash algorithm is used to deliver data with the same hash value to the same reduce task. For the first reduce function, hashmap is used to calculate the frequency of occurrence of each word, for the second reduce function, calculate the top K of all the output data of reduce tasks.
4. Summary
Top K is a very common problem. Companies generally do not write their own programs for computing, but submit them to their core data processing platforms for computing, the computing efficiency of this platform may not be as high as that of direct writing, but it has good scalability and fault tolerance, which is the most important thing for enterprises.
Original article, reprinted Please note:Reposted from the new book "Programmer interview test book" Official Website
Link:How can we find the top 10 most frequently-occurring words from 1 billion?