Given the two files a and B, each of them stores 5 billion URLs. Each url occupies 64 bytes and the memory limit is 4 GB. How can we find the common URLs of files a and B ?, 5 billion 4g
It can be estimated that the size of each file is 5G * 64 = 300G, much larger than 4G. Therefore, it is impossible to fully load it into the memory for processing. Consider a divide-and-conquer approach.
Traverse file a, obtain hash (url) % 1000 for each url, and store the url to 1000 small files (set to a0, a1 ,... a999. In this way, the size of each small file is about 300 MB. Traverse file B and store the url to 1000 small files (b0, b1. .. b999) in the same way as. After such processing, all the URLs that may be the same are stored in the corresponding small file (a0 vs b0, a1 vs b1 .... in a999 vs b999), non-corresponding small files (such as a0 vs b99) cannot have the same url. Then we only need to find the same url in the 1000 pairs of small files.
For example, for a0 vs b0, We can traverse a0 and store the url in hash_map. Then traverse b0. If the url is in hash_map, it indicates that this url exists in both a and B. Save it to the file.
If the split into small files is uneven, and some small files are too large (such as larger than 2 GB), you can consider dividing these large small files into small files in a similar way.
What did Baidu interviewer ask yesterday to study today?