One, only with 2GB within 2 billion integers (32-bit) to find the number of the most occurrences of the problem solving ideas:
To find the largest number of occurrences in many integers, it is common practice to use a hash table to count the number of occurrences of each number, and the key of the hash table is an integer, and value is the occurrence. In the case, a total of 2 billion numbers, even if only a number appeared 2 billion times, with 32-bit integer can also indicate the number of occurrences without overflow, so the hash table key occupies 4b,value occupy 4B, so a record occupies 8B, in extreme cases, there are 2 billion records, need 2 billion * 8B=16GB>2GB.
The solution is to divide a large file containing 2 billion numbers into 16 small files with a hash function, which, depending on the nature of the function, cannot be hashed to a different small file with a single number. Then we use a hash table in each small file to count the number of occurrences of each of them. We get the most occurrences of each of the 16 small files, as well as the respective number of times. Next, just select the largest number of the 16 small files in their first place.
Second, the
massive log data, extracts one day to visit a website the most times the IP
The first is the day, and the IP is accessed from the log in the specified Web site, written to a large file. Note that the IP is 32-bit and has a maximum of ^32 IP. The same can be used to map the method, such as module 1000, the entire large file mapping to 1000 small files, and then find out the frequency of each of the most frequent IP (can be used hash_map frequency statistics, and then find the largest number of frequencies) and the corresponding frequency. Then in the 1000 largest IP, find out the most frequent IP, that is, the request.
algorithm idea: Divide and conquer +hash
The 1.IP address has a maximum of 2^32=4g, so it can not be fully loaded into memory processing;
2. You can consider the idea of "divide and conquer", according to the IP address of the hash (IP)%1024 value, the vast number of IP logs stored in 1024 small files. Thus, each small file contains a maximum of 4MB IP addresses;
3. For each small file, you can build a hash map with the IP key, the number of occurrences, and the most current occurrence of the IP address;
4. You can get the most occurrences of IP in 1024 small files, and then get the most occurrences of IP based on the general sorting algorithm;
[Algorithm] One of the massive data problems