1, the massive log data, extracts one day to visit Baidu the most times the IP.
Solution: The number of IPs is 4 digits from 0 to 256. So he's a 2^32.
Scan the log: Directly put all the first number is n in a file n. So we have 256 files.
For each small file, he found the most visited IP in Baidu (can be counted as a dictionary). Then get 256 IPs. Find the largest in 256 IPs. Overall efficiency O (N)
2. Assume that there are currently 10 million records (these query strings have a high degree of repeatability, although the total is 10 million, but if you remove the duplicates, no more than 3 million.) The higher the repetition of a query string, the more users are queried for it, the more popular it is. ), please count the most popular 10 query strings, requiring no more than 1G of memory to use.
Solution: Use a small Gan of length 10 (give a string his frequency if he inserts a heap larger than the top of the stack, otherwise it discards) and trie tree. The string records are given to the trie tree, and the corresponding value is the number of occurrences.
(That is, the scan has already been added). This structure is fast to search. Build a structure while maintaining a small Gan with a length of 10 (update a small Gan for each record or add a record). Finally, the small Gan results can be displayed.
3. Find the non-repeating integer in 250 million integers, note that the memory is not enough to accommodate the 250 million integers.
Classic Bitmap topic. Bitmap to the fastest.
Review of common interview algorithms for Big Data