Java Big Data processing problem

Source: Internet
Author: User

1. Given a, b two files, each store 5 billion URLs, each URL 64 bytes, memory limit is 4G, let you find a, b file common URL?
Scenario 1: You can estimate the size of each file is 50gx64=320g, which is much larger than the memory limit of 4G. So it is not possible to fully load it into memory for processing. Consider adopting a divide-and-conquer approach.
s traverses file A, takes each URL, and then stores the URL in 1000 small files (in record), based on the obtained value. This is approximately 300M for each small file.
s traverse file B and take the same way as a to store the URLs in 1000 small files (recorded as). After this processing, all the possible same URLs are in the corresponding small file (), the corresponding small file cannot have the same URL. Then we only ask for 1000 of the same URL in the small file.
You can store the URL of one of the small files in Hash_set when you request the same URL for each pair of small files. Then traverse each URL of the other small file, see if it is in the hash_set just built, if so, then is the common URL, stored in the file can be.
Scenario 2: If you allow a certain error rate, you can use Bloom filter,4g memory to probably represent 34 billion bits. Map the URLs in one of the files to these 34 billion bits using the bloom filter, then read the URL of the other file one at a time, check if it is with Bloom filter, and if so, then the URL should be a common URL (note that there will be a certain error rate).

2. There are 10 files, each file 1G, each file of each row is stored in the user's query, each file can be repeated query. Ask you to sort by the frequency of the query.
Scenario 1:
s sequentially reads 10 files and writes the query to another 10 files (recorded as) as a result of the hash (query)%10. The size of each of these newly generated files is approximately 1G (assuming that the hash function is random).
s find a machine that has around 2G in it, then use Hash_map (query, Query_count) to count the occurrences of each query. Use quick/heap/merge sorting to sort by occurrences. Output the sorted query and the corresponding query_cout to the file. This gives you 10 well-sequenced files (recorded as).
s merges the 10 files (in combination with the outer sort).
Scenario 2:
General Query The total amount is limited, but the number of repetitions is more than, perhaps for all of the query, one-time can be added to the memory. In this way, we can use trie tree/hash_map and so directly to count the number of times each query appears, and then do a quick/heap/merge sort by the number of occurrences.
Scenario 3:
Similar to Scenario 1, but after the hash has been broken into multiple files, it can be handed over to multiple files for processing using a distributed architecture (such as MapReduce) and finally merging.

3. There is a 1G size of a file, each line is a word, the size of the word does not exceed 16 bytes, memory limit size is 1M. Returns the highest frequency of 100 words.
Scenario 1: In sequential read file, for each word X, take, and then follow that value to 5,000 small files (in record). So each file is about 200k or so. If one of the files exceeds the 1M size, you can continue to follow a similar approach, knowing that the size of the resulting small file does not exceed 1M. For each small file, statistics appear in each file and the corresponding frequency (can be used trie tree/hash_map, etc.), and take out the most frequent occurrence of the 100 words (can be used with 100 nodes of the smallest heap), and the 100 words and the corresponding frequency deposited into the file, so that the 5,000 files. The next step is to merge the 5,000 files (similar to the merge sort) process.

4. Massive log data, extract the most visited Baidu one day the most number of that IP.
Scenario 1: The first is this day, and is to visit Baidu's log in the IP out, write to a large file. Notice that the IP is 32-bit and has at most one IP. The same can be used to map the method, such as module 1000, the entire large file mapping to 1000 small files, and then find out the frequency of each of the most frequent IP (can be used hash_map frequency statistics, and then find the largest number of frequencies) and the corresponding frequency. Then in the 1000 largest IP, find out the most frequent IP, that is, the request.

5. Find the non-repeating integer in 250 million integers, and the memory is not sufficient to accommodate the 250 million integers.
Scenario 1: The use of 2-bitmap (each number allocation 2bit,00 means that there is no, 01 means one time, 10 means multiple times, 11 meaningless), a total memory memory, and can be accepted. Then scan these 250 million integers to see the relative bitmap in the 01,01, and if the change is 00, the 10,10 remains the same. After the stroke is finished, look at the bitmap, and the corresponding bit is 01 integer output.
Scenario 2: You can also use a similar approach to the problem, the method of dividing small files. Then, in the small file, find the integers that are not duplicated and sort them. Then merge and take care to remove the duplicate elements.

6. Massive data distribution in 100 computers, think of a way to the university statistics of this batch of data TOP10.
Scenario 1:
s TOP10 on each computer, can be completed with a heap containing 10 elements (TOP10 small, with the largest heap, TOP10 large, with the smallest heap). For example, for TOP10, we first take the first 10 elements to the minimum heap, if found, and then scan the back of the data, and compared to the top of the heap, if larger than the heap top element, then use this element to replace the heap top, and then adjust to the minimum heap. The last element in the heap is TOP10.
S to find the TOP10 on each computer, and then the 100 computers on the TOP10 combination, a total of 1000 data, and then use the similar method above to find out TOP10.

7. How do I find the largest number of repetitions in a huge amount of data?
Scenario 1: Hash is done, then the module is mapped to a small file, the number of repetitions in each small file is calculated, and the number of repetitions is recorded. Then find out which one of the most repeated repetitions of the data in the previous step has been asked (refer to the previous question for details).

8. Tens of millions or billions of data (with duplicates), statistics of the most occurrences of the money n data.
Scenario 1: Tens of millions or billions of data, now the memory of the machine should be able to save. So consider using hash_map/to search the binary tree/red-black tree and so on to count the Times. Then it is to take out the first n most occurrences of the data, you can use the 6th problem mentioned in the heap mechanism to complete.

9.10 million strings, some of which are duplicates, need to remove all duplicates, leaving no duplicate strings. How do you design and implement?
Scenario 1: This problem with trie tree more appropriate, hash_map should also be able to do.

10. A text file, about 10,000 lines, one word per line, asking for the most frequent occurrences of the first 10 words, please give the idea, give the time complexity analysis.
Scenario 1: This question is about time efficiency. Count the number of occurrences of each word with the trie tree, and the time complexity is O (n*le) (Le denotes the alignment length of the word). Then is to find the most frequent first 10 words, can be implemented with the heap, the previous question has been mentioned, the time complexity is O (N*LG10). So the total time complexity is the larger of O (n*le) and O (N*LG10).

11. A text file to find the first 10 frequently appearing words, but this time the file is longer, said to be hundreds of billions of lines or 1 billion lines, in short, can not read into memory, ask the best solution.
Scenario 1: First, according to the use of hash and modulo, the file decomposition into a number of small files, for a single file using the method of the problem to find out the 10 most common words in each file. Then merge to find the final 10 most commonly occurring words.

Find the maximum number of 100 in the 100w number.
Scenario 1: In the previous question, we have mentioned, with a minimum heap of 100 elements to complete. The complexity is O (100w*lg100).
Scenario 2: The idea of using a quick sort, after each split only to consider a larger than the axis of the part, know that the larger than the axis of the large part of the time than 100, using the traditional sorting algorithm, the first 100. The complexity is O (100w*100).
Option 3: Adopt a local elimination approach. Select the first 100 elements, and sort, as sequence L. Then scan the remaining element x one at a time, compared to the smallest element in the ordered 100 elements, if it is larger than the smallest one, then delete the smallest element and insert the X into the sequence L with the idea of inserting sort. Loop in turn, knowing that all the elements have been scanned. The complexity is O (100w*100).

13. Search for popular queries:
The search engine logs all the retrieved strings used by the user each time they are retrieved through a log file, with a length of 1-255 bytes for each query string. Assuming that there are currently 10 million records, these query strings are read more repeatedly, although the total is 10 million, but if you remove duplicates and no more than 3 million. The higher the repetition of a query string, the more popular it is for the more users who query it. Please count the most popular 10 query strings, which requires no more than 1G of memory.
(1) Please describe your idea of solving this problem;
(2) Please give the main processing flow, algorithm, and the complexity of the algorithm.
Scenario 1: Using the trie tree, the number of times the keyword field is stored in the query string does not appear as 0. Finally, the occurrence frequency is sorted with the minimum push of 10 elements.

14. A total of n machines, each machine has n number. The maximum number of O (N) per machine is saved and manipulated. How do I find the median in the number?
Scenario 1: First approximate the range of these numbers, for example, assuming that these numbers are 32-bit unsigned integers (in total). We divide 0 integers into n range segments, each containing an integer. For example, the first segment is 0 to, the second is to, ..., and the nth field is to. Then, scan the number of n on each machine, put the number of the first segment on the first machine, the number belonging to the second segment is placed on the second machine, ..., the number of the nth segment is placed on the nth machine. Note that the number of stores on each machine should be O (N). Next we count the number of each machine on the count, one cumulative, until the K machine is found, the number of accumulated on the machine is greater than or equal to, and on the k-1 machine The cumulative number is less than, and the number is recorded as X. So the median we're looking for is in the K machine, in the first place. We then sort the number of K machines and find the number of digits, which is the median we seek. Complexity, yes.
Scenario 2: Sort the number on each machine first. After the sequence, we use the idea of merging and sorting, the number of n machines to merge together to get the final sort. To find the nth is to ask. The degree of complexity is n (i).

15. Maximum Gap problem
Given n real numbers, the maximum difference between the 2 numbers of vectors on the real axis of n real numbers is obtained, which requires a linear time algorithm.
Scenario 1: The first way to think of this is to sort the N data first, and then scan again to determine the maximum adjacent gap. But this method can not satisfy the requirement of linear time. Therefore, the following methods are adopted:
S finds the largest and smallest data in n data Max and Min.
s with n-2 points and other partitions [min, Max], will [Min, Max] and so on will be divided into n-1 interval (front closed after opening interval), these intervals as buckets, numbered, and the upper bound of the bucket and the bucket i+1 the next session of the same, that is, the size of each bucket is the same. The size of each bucket is:. In fact, the boundaries of these buckets make up a arithmetic progression (the first is min, tolerance is), and think that Min is placed in the first bucket, and Max is placed in the n-1 bucket.
S puts the n number in n-1 buckets: Each element is assigned to a bucket (index), and the maximum minimum data that is divided into each bucket is calculated.
s maximum clearance: In addition to the largest minimum data max and Min n-2 data into the n-1 barrels, by the principle of the drawer that at least one barrel is empty, and because each bucket is the same size, so the maximum gap will not appear in the same bucket, it must be the upper bound of a bucket and the climate of a bucket of the Nether Gap, And the barrel between the cylinders (even if it is good to have a better bucket) must be an empty bucket. In other words, the maximum clearance is generated between the upper bound of bucket I and the lower bound of bucket J. The scan can be done again.

16. Merge multiple collections into a collection without intersections: a collection of strings given in the form of:. Requires merging a collection where the intersection is not empty, requiring no overlap between the completed collections, for example, the above example should be output.
(1) Please describe your idea of solving this problem;
(2) give the main processing flow, algorithm, and the complexity of the algorithm;
(3) Please describe the possible improvements.
Scenario 1: Adopt and look up the set. First, all the strings are in separate and centralized. Then merge the two adjacent elements sequentially by scanning each collection. For example, first look at whether AAA and BBB are in the same and concentrate, if not, then merge the same and check sets, and then see if BBB and CCC are in the same and concentrate, and if not, then merge the same and check sets. Then scan the other collections, when all the collections are scanned, and the collection represented by the set is the request. The complexity should be O (NLGN). If you improve, you can first record the root nodes of each node and improve the query. When merging, it is possible to combine large and small to reduce complexity.

17. Maximal sub-sequence and maximal sub-matrix problem
Maximum subsequence problem for an array: Given an array where the elements have positive, negative, find one of the successive sub-sequences, and the maximum.
Scenario 1: This problem can be solved dynamically by planning ideas. Sets the maximum subsequence to indicate the end of the element I, so obviously. Based on this, it can be implemented quickly in code.
Maximal sub-matrix problem: Given a matrix (two-dimensional array), where the data is large and small, please find a sub-matrix, make the sub-matrix and the maximum, and output this and.
Scenario 1: You can use ideas similar to the maximum subsequence to solve. If we decide to select the element between column I and column J, then in this range, it is actually a maximum subsequence problem. How to ascertain that column I and column J can be carried out by means of a violent search.

Java Big Data processing problem

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.