The top 10 most occurrences are taken out of the 1000w data
The general description of this problem is the most frequent occurrence of K before the large data set is taken out.
This is the problem of data frequency calculation. Because it is a large dataset, a single machine is not handled, and requires distributed processing. Processing steps: Data grouping, such as M-group; For a single data group to count the frequency of each data occurrence, use data structures such as HASHMAP,TREEMAP to store the number of occurrences of each data; Remove K*m data from each group, form a new dataset, and repeat step 2; The final result is the first k of the result set calculated according to step 3.
Now, without proof, simply explain the above steps:
Step 1, data grouping, because the data volume is too large to host a single machine, so need to group;
Step 2, calculate the frequency of each data, using TREEMAP results can be naturally sorted;
Step 3, remove the number of each group is k*m, not K. The reason is that when the data frequency distribution is more uniform, the data errors and omissions.
Step 4, form a new set, and then repeat step 2 to get the corresponding result, take out the first k is the result set. Remove the top 10 largest data from the 1000w data
The general description of this problem is to remove the first k largest data in the large data set.
This is the problem of sorting data. For the sorting problem solving idea is relatively simple, processing thought: The data grouping, for example M group; Build an array of length k, read the corresponding data placed in the corresponding position of the array, each data storage to be a reordering; For each array, repeat reading with step 2; The data stored in the final data is the result set;