Massive log data, extract a day to visit Baidu the most times the IP. since it is massive data processing, it is conceivable that the data to us must be huge. How do we get started with this massive amount of data? Yes, it's just a divide-and-conquer/hash map + hash statistics + heap/fast/merge sort, plainly, is the first mapping, then statistics, the last sort:
- Divide-and- conquer/hash mapping : For data is too large, memory is limited, can only be: the large file into (modulo mapping) small files, that is, 16 words policy: Large and small, conquer, reduce the scale, one by one solve
- hash_map Statistics: when large files are converted to small files, we can use regular hash_map (ip,value) for frequency statistics.
- Heap/Quick sort: After the statistics are done, sort (heap sort) to get the most number of IPs.
Massive data processing: Hash map + hash_map statistics + heap/quick/merge sort