Big data sorting or deduplication Problems

Source: Internet
Author: User

 

1. Given two files a and B, each of them stores 5 billion URLs. Each URL occupies 64 bytes and the memory limit is 4 GB. Can you find the common URLs of files a and B?
Solution 1: it can be estimated that the size of each file is 50 GB × 64 = 320 GB, far greater than the memory limit of 4 GB. Therefore, it is impossible to fully load it into the memory for processing. Consider a divide-and-conquer approach.
S traverses file a, obtains each URL, and stores the URL in the 1000 small files (marked as) based on the obtained values. In this way, the size of each small file is about 300 MB.
S traverses file B and stores the URL in the same way as ATO 1000 small files (recorded ). After such processing, all the URLs that may be the same are in the corresponding small file (), and non-corresponding small files cannot have the same URL. Then we only need to find the same URL in the 1000 pairs of small files.
S. You can store the URL of a small file in hash_set for the same URL in each pair of small files. Then traverse each URL of another small file to see if it is in the hash_set just constructed. If it is, it is a common URL and saved in the file.
Solution 2: if a certain error rate is allowed, you can use the bloom filter. The 4G memory can be approximately 34 billion bits. Map the URLs in one file to the 34 billion bits using the bloom filter, read the URLs of another file one by one, and check whether the URLs are consistent with the bloom filter. If yes, the URL should be a common URL (note that there will be a certain error rate ).

2. There are 10 files, each of which is 1 GB. each row of each file stores the user's query, and the query of each file may be repeated. Sort the query frequency.
Solution 1:
S reads 10 files in sequence and writes the query to the other 10 files according to the hash (query) % 10 results. In this way, the size of each newly generated file is about 1 GB (assuming that the hash function is random ).
S finds a machine with around 2 GB of memory, and uses hash_map (query, query_count) to calculate the number of times each query appears. Sort by the number of occurrences by means of fast/Heap/Merge Sorting. Output the sorted query and the corresponding query_cout to the file. In this way, 10 sorted files (marked as) are obtained ).
S. Merge and sort the 10 files (internal and external sorting are combined ).
Solution 2:
Generally, the total number of queries is limited, but the number of repetitions is large. For all queries, you can add them to the memory at one time. In this way, we can use the trie tree, hash_map, and so on to directly count the number of occurrences of each query, and then perform fast/Heap/Merge Sorting based on the number of occurrences.
Solution 3:
Similar to solution 1, but after hash is completed and divided into multiple files, it can be handed over to multiple files for processing, using a distributed architecture (such as mapreduce), and then merged.

3. There is a 1 GB file with each row containing a word. The word size cannot exceed 16 bytes, and the memory size is limited to 1 MB. Returns the top 100 words with the highest frequency.
Solution 1: read each word X in an ordered file and save it to 5000 small files (marked as) based on this value. In this way, each file is about KB. If the size of some files exceeds 1 MB, you can continue to split the files in a similar way, knowing that the size of the small files to be decomposed cannot exceed 1 MB. For each small file, count the words in each file and the corresponding frequency (trie tree/hash_map can be used ), and take out the 100 words with the maximum frequency (the minimum heap containing 100 nodes can be used), and save the 100 words and the corresponding frequency to the file, thus obtaining 5000 files. The next step is to merge the 5000 files (similar to the merge and sort files.

4. Extract the IP address with the most visits to Baidu on a certain day with massive log data.
Solution 1: the first is this day, and the IP addresses in the logs accessing Baidu are obtained and written to a large file one by one. Note that the IP address is 32-bit and has a maximum of one IP address. You can also use the ing method, such as modulo 1000, to map the entire large file to 1000 small files, find the IP address with the highest frequency in each small text (hash_map can be used to calculate the frequency, and then find the IP address with the highest frequency) and the corresponding frequency. Then, among the 1000 largest IP addresses, find the IP address with the highest frequency, that is, what you want.

5. Locate non-repeated integers among the 0.25 billion integers. The memory is insufficient to accommodate these 0.25 billion integers.
Solution 1: Use 2-Bitmap (2bit for each number, 00 indicates no, 01 indicates one time, 10 indicates multiple times, and 11 indicates no significance). Memory is required, it is also acceptable. Then scan the 0.25 billion Integers to check the corresponding bits in bitmap. If 00 is changed to, 10 remains unchanged. After the descriptions are completed, view the bitmap and output the corresponding digit as an integer of 01.
Solution 2: You can use similar methods to divide small files. Then, find the non-repeated integers in the small file and sort them. Then merge the elements to remove them.

6. massive amounts of data are distributed in 100 computers, and colleges and universities are trying to figure out top 10 of these data sets.
Solution 1:
S finds the top 10 items on each computer, which can be completed by using a heap containing 10 elements (top 10 smaller items, with the maximum heap, top 10 larger ones, with the minimum heap ). For example, if we want to increase the top 10, we should first adjust the first 10 elements to the smallest heap. If we find the top 10 elements, we will then scan the subsequent data and compare it with the heap top elements. If it is larger than the top 10 elements, use this element to replace the heap top and then adjust it to the minimum heap. The final element in the heap is top 10.
S, find the top 10 on each computer, and then combine the top 10 on the 100 computers, a total of 1000 data records, and use the above similar method to find the top 10.

7. How can I find the most repeated data?
Solution 1: First hash, then map the modulo to a small file, find the most repeated one in each small file, and record the number of repetitions. Find out the most repeated data in the previous step (For details, refer to the previous question ).

8. Tens of millions or hundreds of millions of data records (with duplicates) are collected to calculate the n data records with the most frequent occurrences.
Solution 1: Data of tens of millions or hundreds of millions can be stored in the memory of the current machine. Therefore, we recommend that you use hash_map/binary tree search/red/black tree to calculate the number of times. Then we can retrieve the first n data records that appear most frequently. We can use the heap mechanism mentioned in question 6th.

9. 10 million strings, some of which are repeated. You need to remove all the duplicates and keep the strings that are not repeated. How can I design and implement it?
Solution 1: it is more appropriate to use the trie tree, And hash_map should also work.

10. A text file contains about 10 thousand rows and one word per line. The first 10 words that most frequently appear must be counted. Please give your thoughts and analyze the time complexity.
Solution 1: consider time efficiency. Use the trie tree to count the number of times each word appears. The time complexity is O (n * le) (Le indicates the word's level length ). Then we can find out the first 10 words that appear most frequently. We can use the heap method. As mentioned in the previous question, the time complexity is O (n * lg10 ). Therefore, the total time complexity is the greater of O (N * le) and O (N * lg10.

11. Find the first 10 frequently-seen words in a text file. However, this file is long and may contain hundreds of millions of lines or billions of lines. In short, it is impossible to read the memory at a time and ask the optimal solution.
Solution 1: first, based on hash and modulo, the file is divided into multiple small files. For a single file, use the above method to find the 10 most frequently-seen words in each file. Then merge to find the 10 most frequently-seen words.

12. Find the maximum number of 100 in.
Solution 1: we have mentioned in the previous question that a minimum heap containing 100 elements is used. The complexity is O (100 W * lg100 ).
Solution 2: the idea of fast sorting is adopted. After each split, only the portion larger than the axis is considered. When the portion larger than the axis is more than 100, the traditional sorting algorithm is used for sorting, the first 100. The complexity is O (100 W * 100 ).
Solution 3: Adopt the local elimination method. Select the first 100 elements and sort them as sequence L. Then, the remaining element x is scanned at a time to compare with the smallest element in the 100 elements in the sorted order. If it is larger than the smallest element, delete the smallest element, and insert X into the sequence l using the insert sorting idea. It cyclically scans all elements. The complexity is O (100 W * 100 ).

13. Search for popular queries:
The search engine records all the search strings used for each search using log files. The length of each query string is 1-bytes. Suppose there are currently 10 million records, and these query strings have a relatively high number of repeated reads. Although the total number is 10 million, the number of duplicate reads cannot exceed 3 million if the number of duplicate reads is removed. The higher the repetition of a query string, the more users query it, and the more popular it is. Please count the top 10 query strings. The memory required cannot exceed 1 GB.
(1) Describe your solution to this problem;
(2) provide the main processing procedures, algorithms, and complexity of algorithms.
Solution 1: The trie tree is used, and the keyword field stores the number of times that the query string appears, not 0. At last, we sorted the occurrence frequency with the minimum push of 10 elements.

14. A total of N machines, each with N numbers. Each machine can store a maximum of O (n) numbers and operate on them. How do I find the middle number in the number?
Solution 1: First estimate the range of these numbers. For example, assume that these numbers are all 32-bit unsigned integers (total ). We divide the integers 0 to into N range segments. Each segment contains an integer. For example, the first field is 0 to, and the second segment is ,..., The Nth segment is. Then, scan the number of N on each machine and place the number in the first segment on the first machine. Put the number in the second segment on the second machine ,..., Place the number of the nth segment on the nth machine. Note that the number stored on each machine in this process should be O (n. Next we will count the number of machines in sequence and accumulate them at a time until we find the k machine. The number accumulated on this machine is greater than or equal, the sum on the second K-1 machine is smaller
And mark this number as X. Then the median we are looking for is in the first position of the k machine. Then we sort the number of K machines and find the number, that is, the median. The complexity is.
Solution 2: sort the numbers on each machine first. After sorting, we use the thought of merging and sorting to merge the numbers on the N machines to get the final sorting. Finding the nth is what you want. The complexity is n (I.

15. Maximum gap problem
Given n real numbers, the maximum difference between n real numbers on the real axis between the number of vectors 2 requires a linear time algorithm.
Solution 1: the first method that comes to mind is to sort the N pieces of data first, and then scan them again to determine the maximum adjacent gaps. However, this method cannot meet the requirements of linear time. Therefore, use the following method:
S finds the largest and smallest data Max and min in n data.
S using N-2 points of the same interval [min, Max], that is [min, Max] divided into n-1 intervals (before and after the Open interval), These intervals as the bucket, number, the upper bound of the bucket is the same as that of the bucket I + 1, that is, the size of each bucket is the same. The size of each bucket is :. In fact, the boundary of these buckets forms an arithmetic difference sequence (the first item is min, and the tolerance is). It is considered that Min is placed in the first bucket, and Max is placed in the n-1 bucket.
S puts n numbers into n-1 buckets: assigns each element to a bucket (number: index), and obtains the maximum and minimum data allocated to each bucket.
S maximum gap: in addition to the maximum and minimum data Max and Min N-2 data into n-1 barrels, the principle of the drawer shows that at least one bucket is empty, because each bucket has the same size, the maximum gap does not appear in the same bucket. It must be the gap between the upper bound of a bucket and the lower bound of a bucket in the climate, and the buckets between the bins (even if the connection is good) must be empty buckets. That is to say, the maximum gap is generated between the upper bound of bucket I and the lower bound of Bucket J. Scan once.

16. Merge multiple sets into a set without intersection: a set of strings is given. The format is as follows :. Merging a set whose intersection is not empty requires that there be no intersection between the merged sets. For example, the preceding example should be output.
(1) Describe your solution to this problem;
(2) provides the main processing procedures, algorithms, and complexity of algorithms;
(3) describe possible improvements.
Solution 1: Use and query sets. First, all the strings are in the separate query set. Then, the two adjacent elements are merged sequentially according to the scanning of each set. For example, first check whether AAA and BBB are in the same and check the set. If not, check the set where AAA and BBB are located and, then, check whether the BBB and CCC are in the same and check the set. If they are not, check the set where they are located. Next, scan other sets. When all the sets are scanned and the set is queried. The complexity should be O (nlgn. For improvement, You can first record the root node of each node to improve the query. When merging, you can combine large and small, which also reduces complexity.

17. Maximum subsequence and maximum submatrix Problems
Maximum sub-sequence of an array: if an array is given, the elements have both positive and negative values, and find a continuous sub-sequence to make and maximum.
Solution 1: this problem can be solved through dynamic planning. It indicates the maximum subsequence ending with element I. This can be quickly implemented using code.
Maximum sub-matrix problem: Given a matrix (two-dimensional array), where the data is large and small, please find a sub-matrix to make the sum of the sub-matrix largest, and output this sum.
Solution 1: it can be solved using the same idea as the largest subsequence. If we determine the element between column I and column J, it is actually a maximum subsequence problem in this range. How to determine column I and column J can be searched by brute force.

 

From: http://hi.baidu.com/jiaxiaobosuper/blog/item/5715981c8c7f54d3a686694b.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.