First of all, to deal with big data interview questions, some basic concepts should be clear:
(1) 1Gb = 109bytes (1Gb = 1 billion bytes): 1Gb = 1024Mb, 1Mb = 1024Kb, 1Kb = 1024bytes;
(2) The basic process is to decompose big problems, solve small problems, and choose global optimal from local optimum; (Of course, if you can solve it directly in memory, then you can directly solve the solution without dissolving it. )
(3) Common method of decomposition process: hash(x)%m. Where x is the string / url / ip, m is the number of small problems, such as breaking a large file into 1000 copies, m = 1000;
(4) Problem-solving auxiliary data structure: hash_map, Trie tree, bit map, binary sort tree (AVL, SBT, red-black tree);
(5) Top K problem: the maximum K uses the smallest heap, and the minimum K uses the largest heap. (As for why? Write a small chestnut on paper, try it and you will know.)
(6) Common sorting for processing big data: quick sort / heap sort / merge sort / bucket sort
Here are a few examples (the solution to each question is not unique, only one of the many solutions is listed below):
1. Given a and b files, each stores 5 billion urls, each url is 64 bytes, the memory limit is 4G, let you find the common url of a and b files?
If each url size is 10bytes, then it can be estimated that the size of each file is 50G×64=320G, which is much larger than the memory limit of 4G, so it is impossible to load it completely into memory. You can use the idea of divide and conquer. solve.
Step1: Traverse the file a, obtain hash(url)%1000 for each url, and then store the url to 1000 small files according to the obtained value (denoted as a0, a1,..., a999, each small The document is about 300M);
Step2: Traverse the file b, store the url in the same way as a to 1000 small files (denoted as b0, b1, ..., b999);
Ingenious: After this processing, all possible urls are saved in the corresponding small files (a0 vs b0, a1 vs b1,...,a999 vs b999), and the corresponding small files may not have the same Url. Then we just need to find the same url in the 1000 pairs of small files.
Step3: When seeking the same url for each pair of small files ai and bi, you can store the ai url in hash_set/hash_map. Then iterate through each url of bi to see if it is in the hash_set just built. If it is, then it is the common url, and it can be stored in the file.
2. There is a file of 1G size, each line is a word, the size of the word is no more than 16 bytes, the memory limit is 1M, and the 100 words with the highest frequency are required to be returned.
Step1: In the sequential read file, for each word x, take hash(x)%5000, and then save the value to 5000 small files (denoted as f0, f1, ..., f4999), so that each file It is about 200k. If some of the files exceed the 1M size, you can continue to divide in the same way until the size of the small file obtained by the decomposition does not exceed 1M.
Step2: For each small file, count the words appearing in each file and the corresponding frequency (trie tree/hash_map can be used), and take out the 100 words with the most frequent occurrence (you can use the smallest heap with 100 nodes). ), and stored 100 words and the corresponding frequency into the file, which in turn got 5,000 files;
Step3: Combine these 5000 files (similar to sorting and sorting);
3. The existing massive log data is stored in a super large file. The file cannot be directly read into the memory, and it is required to extract the IP with the most access to Baidu on a certain day.
Step1: Take the IP accessing Baidu from the log data of this day and write it one by one into a large file;
Step2: Note that the IP is 32-bit and has a maximum of 2^32 IPs. It is also possible to use a mapping method, such as modulo 1000, to map the entire large file to 1000 small files;
Step3: Find the IP with the highest frequency in each small text (you can use hash_map for frequency statistics, then find the most frequent ones) and the corresponding frequency;
Step4: In the 1000 largest IPs, find the IP with the highest frequency, which is what you want.