Large Data processing interview problem summary

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Each then can appear times

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Given a, b two files, each store 5 billion URLs, each URL accounted for 64 bytes, memory limit is 4G, let you find a, b file common URL?

Scenario 1: The size of each file can be estimated to be 50gx64=320g, far larger than the memory limit of 4G. So it is not possible to fully load it into memory processing. Consider adopting a divide-and-conquer approach.

s traverses file A, asks for each URL, and then stores the URL to 1000 small files (recorded) based on the values obtained. This is about 300M per small file.

s traverses file B and takes the same way as a to store the URL to 1000 small files (recorded as). After this process, all possibly the same URLs are in the corresponding small file (), and no corresponding small file can have the same URL. Then we only ask for the same URL in 1000 pairs of small files.

s for each small file in the same URL, you can store the URL of one of the small files into the hash_set. Then iterate over each URL of another small file to see if it is in the hash_set just built, if it is, then the common URL, stored in the file can be.

Scenario 2: If a certain error rate is allowed, the Bloom filter,4g memory may be used to represent 34 billion bit. Map the URLs in one file to the 34 billion bit using Bloom filter, and then read the URL of the other file one at a time to check if it is with Bloom filter, and if so, the URL should be a common URL (note that there is a certain error rate).

2. There are 10 files, each file 1G, each row of each file is stored in the user's query, each file query may be repeated. You are asked to sort by the frequency of query.

Programme 1:

The S order reads 10 files and writes query to another 10 files (in notes) according to the results of the hash (query)%10. Each of these newly generated files is about 1G in size (assuming that the hash function is random).

s find a machine with around 2G in it, and then use Hash_map (query, Query_count) to count the occurrences of each query. Use the fast/heap/merge sort to sort by occurrence times. Output the sorted query and corresponding query_cout to the file. This gives 10 sorted files (written).

S to merge the 10 files (the inner sort is combined with the outer sort).

Programme 2:

General query's total is limited, but the number of repetitions is more, probably for all query, one-time can be added to the memory. In this way, we can use the trie tree/hash_map and so on directly to calculate the number of times each query appears, and then do a quick/heap/merge sort on the number of occurrences.

Programme 3:

Similar to Scenario 1, but after hashing, divided into multiple files, can be handed over to a number of files to deal with, using a distributed architecture to deal with (such as MapReduce), and finally merge.

3. There is a 1G size of a file, inside each row is a word, the word size does not exceed 16 bytes, memory limit size is 1M. Returns the http://www.aliyun.com/zixun/aggregation/11629.html ">100" with the highest frequency.

Scenario 1: Sequentially read the file, take it for each word X, and then save it to 5,000 small files (recorded) in that value. So each file is about 200k. If some of these files exceed the 1M size, you can continue to follow a similar approach to the point, know that the small file size of the decomposition is no more than 1M. For each small file, statistics the words appearing in each file and the corresponding frequency (can be used trie tree/hash_map, etc.), and remove the maximum frequency of the 100 words (can be used with 100 nodes of the smallest heap), and 100 words and corresponding frequency into the file, so that 5,000 files. The next step is to merge the 5,000 files (similar to the merge sort) process.

4. Massive log data, extracted one day visit Baidu the most times that IP.

Scenario 1: First of all, this is the day, and is to visit the log in Baidu IP out, one to write to a large file. Note that IP is 32-bit and has at most one IP. can also use the mapping method, such as modulo 1000, the whole large file map to 1000 small files, and then find each small frequency of the most frequent IP (can be used hash_map frequency statistics, and then find the most frequency of several) and the corresponding frequency. Then in the 1000 largest IP, find the most frequent IP, that is the request.

5. To find a repeat integer in 250 million integers, there is not enough memory to accommodate the 250 million integers.

Scenario 1: The use of 2-bitmap (each number allocation 2bit,00 indicates that there is no, 01 indicates that one time, 10 means many times, 11 meaningless), the total memory memory, but also acceptable. Then scan the 250 million integers to see the relative bitmap in the 01,01, and if it is 00 variable, the 10,10 remains unchanged. After the finished, look at the bitmap, the corresponding bit is 01 of the integer output.

Scenario 2: You can also use a similar approach to the problem of small file segmentation. Then find the duplicate integers in the small file and sort them. Then merge and take care to remove duplicate elements.

6. Mass data distribution in 100 computers, think of a way colleges and universities to count the TOP10 of this batch of data.

Programme 1:

s on each computer to find TOP10, can be used with 10 elements of the heap completed (TOP10 small, with the largest heap, TOP10 large, with the smallest heap). For example TOP10 large, we first take the top 10 elements to adjust to the smallest heap, if found, and then scan the following data, and compared to the top of the heap, if larger than the top of the heap, then replace the heap top with the element, and then adjust to the smallest heap. The last element in the heap is TOP10.

s TOP10 each computer, and then put the 100 computer TOP10 together, a total of 1000 data, and then use the above similar method to find TOP10 on it.

7. How to find the largest number of repetitions in the mass data?

Scenario 1: Hash first, then modulus mapping to small files, find the number of repetitions in each small file, and record the number of repetitions. And then find the data in the last step of the most repetition of the one is the request (specific reference to the previous question).

8. Tens of millions or hundreds of millions of data (there are duplicates), statistics in which the most frequent money n data.

Scenario 1: Tens of millions or billions of data, the current machine memory should be able to save. So consider the use of hash_map/search binary tree/red-black tree, etc. for statistical times. Then is to take out the top n appear the most number of data, you can use the 6th mentioned heap mechanism to complete.

9.10 million strings, some of which are duplicates, need to remove all duplicates, leaving no duplicate strings. How do you design and implement it?

Option 1: This problem with trie tree is more appropriate, Hash_map should also be able to do.

10. A text file, about 10,000 lines, each line a word, asking to count the most frequently appeared in the first 10 words, please give the idea, give time complexity analysis.

Scenario 1: This is about time efficiency. The trie tree is used to count the number of occurrences of each word, and the time complexity is O (n*le) (Le denotes the flat length of the word). And then find the most frequent first 10 words, you can use the heap to achieve, in the previous question already mentioned, time complexity is O (N*LG10). So the total time complexity is the larger of O (n*le) and O (N*LG10).

11. A text file to find the first 10 words that often appear, but this file is longer, said to be hundreds of billions of lines or 1 billion lines, in short, can not read into memory, ask the best solution.

Scenario 1: First of all, according to the hash and the model, the file decomposed into a number of small files, for a single file using the method above to find out each file in the 10 most often appear in the word. Then merge to find the final 10 most common words.

The maximum number of 100 is found in 100w.

Scenario 1: In the previous question, we have already mentioned that a minimum heap containing 100 elements is completed. The complexity is O (100w*lg100).

Scenario 2: Using the idea of fast sorting, after each division only consider the larger part of the shaft, know that more than 100 of the shaft larger than the time, using the traditional sorting algorithm, take the first 100. The complexity is O (100w*100).

Programme 3: Local elimination. Select the first 100 elements, and sort them, as sequence L. Then scan the remaining element x one at a time, compared to the smallest element in the sequence of 100 elements, if the smallest is larger than this, then delete the smallest element and insert the X into the sequence L using the idea of inserting a sort. Loop in turn, knowing that all the elements are scanned. The complexity is O (100w*100).

13. Find Popular Enquiries:

The search engine logs all the search strings used by the user for each retrieval using a log file, with a length of 1-255 bytes per query string. Assuming there are currently 10 million records, these query strings have a higher repeat read, although the total is 10 million, but if you remove duplicates and no more than 3 million. The higher the degree of repetition of a query string, the more popular it is to query its users. Please count the most popular 10 query strings and require no more than 1G of memory to use.

(1) Please describe your idea of solving this problem;

(2) Please give the main processing flow, algorithm, and the complexity of the algorithm.

Scenario 1: Using the trie tree, the number of occurrences of the keyword field in the query string does not appear to be 0. Finally, the occurrence frequency is sorted by the minimum push of 10 elements.

14. A total of n machines, each machine has n number. The maximum number of O (N) per machine is stored and manipulated. How do I find the number in the number?

Scenario 1: Make a general estimate of the range of these numbers, such as assuming that these numbers are 32-bit unsigned integers (shared). We divide 0 to integers into N-range segments, each containing integers. For example, the first Dan 0 to, the second paragraph to, ..., the nth field to. Then, the number of n on each machine is scanned, the numbers belonging to the first section are placed on the first machine, the number belonging to the second section is placed on the second machine, ..., the number of the nth section is placed on the nth machine. Note that the number of bytes stored on each machine in this process should be O (N). Let's count the number of the numbers on each machine in turn, one at a time until we find the K machine, the number of which is greater than or equal to that on the machine, and the cumulative number on the k-1 machine is less than and the number is X. Then we are looking for the median in the K machine, ranked in the first place. Then we sort the number of K machines and find the number, which is the median. Complexity yes.

Scenario 2: First sort the number on each machine. After arranging the order, we use the idea of merging sort to merge the numbers on these n machines to get the final sort. To find the first is to ask. Complexity yes.

15. Maximum Clearance problem

Given n real numbers, the maximum difference between the 2 numbers of n real numbers on axis is obtained, and the linear time algorithm is required.

Scenario 1: The first way to think about this is to sort the N data first, and then scan it to determine the next largest gap. But the method can not meet the requirement of linear time. Therefore, the following methods are adopted:

S finds the largest and smallest data in n data Max and Min.

s with n-2 points such as partitions between [Min, Max], the [min, Max] and so on are divided into n-1 interval (before the closed open interval), these intervals as a bucket, numbered, and the upper bound of the bucket and bucket i+1 the same as the next, that is, the same size of each bucket. The size of each bucket is:. In fact, the boundaries of these barrels constitute a arithmetic (the first is min, the tolerance is), and think that the min into the first bucket, the max into the n-1 bucket.

S puts N in the n-1 bucket: assigns each element to a bucket (numbered index), and the maximum minimum data is divided into each bucket.

S maximum gap: In addition to the maximum minimum data max and Min n-2 data into the n-1 bucket, by the principle of the drawer that at least one bucket is empty, and because each barrel is the same size, so the maximum gap will not appear in the same bucket, must be a bucket upper bound and the climate of a bucket of the lower bound of the gap, And the bucket between the cylinder (even if it's good to have a nice bucket) must be an empty bucket. In other words, the maximum gap is generated between the upper bound of bucket I and the lower bound of the bucket J. The scan is complete.

16. Merge multiple collections into a set without intersection: Given a collection of strings, the format is:. Requires merging collections where the intersection is not empty, requiring that there be no intersection between the completed collections, such as the previous example should output.

(1) Please describe your idea of solving this problem;

(2) give the main processing flow, algorithm, and the complexity of the algorithm;

(3) Please describe possible improvements.

Scenario 1: Use and check the collection. First, all the strings are in a separate and centralized search. Then, by scanning each collection, merge the two adjacent elements in sequence. For example, to first see if AAA and BBB are in the same and search, and if not, then combine their collection and check to see if BBB and CCC are in the same cluster, and if they are not, then merge their collection. Then we scan the other sets, and when all the sets are scanned, the collection of representatives is required. The complexity should be O (NLGN). Improved, you can first record the root nodes of each node, improve the query. When merging, you can combine large and small to reduce complexity.

17. Maximum sub-sequence and maximum sub-matrix problem array problem: Given an array, where the element has positive and negative, find one of the contiguous subsequence, make and maximum.

Scenario 1: This problem can be solved dynamically by the idea of planning. Sets the maximum subsequence that represents the end of the first element, then obviously. Based on this, it can be implemented quickly in code.

Maximum sub-matrix problem: Given a matrix (two-dimensional array), where the data is very small, please find a matrix, so that the matrix and the largest, and output this and.

Scenario 1: It can be solved by thinking similar to the maximum subsequence. If we decide to select the element between column I and J, then in this range, it is actually a maximum subsequence problem.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More