It can be estimated that the size of each file is 5g * 64 = 300g, much larger than 4G. Therefore, it is impossible to fully load it into the memory for processing. Consider a divide-and-conquer approach.
Traverse file a, obtain Hash (URL) % 1000 for each URL, and store the URL to 1000 small files (set to A0, A1 ,... a999. In this way, the size of each small file is about 300 MB. Traverse file B and store the URL to 1000 small files (B0, B1. .. b999) in the same way as. After such processing, all the URLs that may be the same are stored in the corresponding small file (A0 vs B0, A1 vs B1.... a999
B999), non-corresponding small files (such as A0 vs b99) cannot have the same URL. Then we only need to find the same URL in the 1000 pairs of small files.
For example, for A0 vs B0, We can traverse A0 and store the URL in hash_map. Then traverse B0. If the URL is in hash_map, it indicates that this URL exists in both A and B. Save it to the file.
If the split into small files is uneven, and some small files are too large (such as larger than 2 GB), you can consider dividing these large small files into small files in a similar way.
Refer to the blog I don't know where the source is...