[Algorithm] One of the massive data problems

Source: Internet
Author: User
One, only with 2GB within 2 billion integers (32-bit) to find the number of the most occurrences of the problem solving ideas:

To find the largest number of occurrences in many integers, it is common practice to use a hash table to count the number of occurrences of each number, and the key of the hash table is an integer, and value is the occurrence. In the case, a total of 2 billion numbers, even if only a number appeared 2 billion times, with 32-bit integer can also indicate the number of occurrences without overflow, so the hash table key occupies 4b,value occupy 4B, so a record occupies 8B, in extreme cases, there are 2 billion records, need 2 billion * 8B=16GB>2GB.

The solution is to divide a large file containing 2 billion numbers into 16 small files with a hash function, which, depending on the nature of the function, cannot be hashed to a different small file with a single number. Then we use a hash table in each small file to count the number of occurrences of each of them. We get the most occurrences of each of the 16 small files, as well as the respective number of times. Next, just select the largest number of the 16 small files in their first place.

Second, the massive log data, extracts one day to visit a website the most times the IP

The first is the day, and the IP is accessed from the log in the specified Web site, written to a large file. Note that the IP is 32-bit and has a maximum of ^32 IP. The same can be used to map the method, such as module 1000, the entire large file mapping to 1000 small files, and then find out the frequency of each of the most frequent IP (can be used hash_map frequency statistics, and then find the largest number of frequencies) and the corresponding frequency. Then in the 1000 largest IP, find out the most frequent IP, that is, the request.
algorithm idea: Divide and conquer +hash

The 1.IP address has a maximum of 2^32=4g, so it can not be fully loaded into memory processing;
2. You can consider the idea of "divide and conquer", according to the IP address of the hash (IP)%1024 value, the vast number of IP logs stored in 1024 small files. Thus, each small file contains a maximum of 4MB IP addresses;
3. For each small file, you can build a hash map with the IP key, the number of occurrences, and the most current occurrence of the IP address;
4. You can get the most occurrences of IP in 1024 small files, and then get the most occurrences of IP based on the general sorting algorithm;

[Algorithm] One of the massive data problems

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.