Given the two files a and B, each of them Stores 5 billion URLs. each url occupies 64 bytes and the memory limit is 4 GB. how can we find the common URLs of files a and B ?, 5 billion 4g_PHP tutorial

Source: Internet
Author: User
Given the two files a and B, each of them Stores 5 billion URLs. each url occupies 64 bytes and the memory limit is 4 GB. how can we find the common URLs of files a and B ?, 5 billion 4g. Given the two files a and B, each of them Stores 5 billion URLs. each url occupies 64 bytes and the memory limit is 4 GB. how can we find the common URLs of files a and B ?, The size of each 5 billion GB file can be estimated to be a given a and B files, each storing 5 billion URLs, each occupying 64 bytes, with a memory limit of 4 GB, how can I find the common URLs of files a and B ?, 5 billion 4g

It can be estimated that the size of each file is 5G * 64 = 300G, much larger than 4G. Therefore, it is impossible to fully load it into the memory for processing. Consider a divide-and-conquer approach.
Traverse file a, obtain hash (url) % 1000 for each url, and store the url to 1000 small files (set to a0, a1 ,... a999. In this way, the size of each small file is about 300 MB. Traverse file B and store the url to 1000 small files (b0, b1. .. b999) in the same way as. After such processing, all the URLs that may be the same are stored in the corresponding small File (a0 vs b0, a1 vs b1 .... in a999 vs b999), non-corresponding small files (such as a0 vs b99) cannot have the same url. Then we only need to find the same url in the 1000 pairs of small files.
For example, for a0 vs b0, we can traverse a0 and store the url in hash_map. Then traverse b0. if the url is in hash_map, it indicates that this url exists in both a and B. save it to the file.
If the split into small files is uneven, and some small files are too large (such as larger than 2 GB), you can consider dividing these large small files into small files in a similar way.

What did Baidu interviewer ask yesterday to study today?

Why ?, The size of each file is estimated to be 5 billion GB...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.