Job search-How big data is handled

Source: Internet
Author: User

1. Top K problem: In the massive data to find the highest frequency of the first k number, or from the mass of data to find the largest number of first k, such problems collectively referred to as the top K problem.

For top K class problems, it is usually better to divide the +hash+ into small top piles

Eg: Find out the top 10,000 of the 100 million floating-point numbers.

Method One: Sort out the top 10,000. Each float accounts for 4b,1 billion floating point 400MB, for memory less than 400MB the method cannot read all the data into memory at once, and the ordering is to sort all the elements, doing a lot of useless.

Method Two: Local elimination method. Save the first 10,000 numbers in a container, then compare the remaining numbers to the smallest number in the container, and if all subsequent elements are smaller than the 10,000 in the container, then the 10,000 numbers in the container are the maximum 10,000 numbers. If a subsequent element is larger than the smallest element in the container, the smallest element within the container is removed, the element is inserted into the container, and the last 100 million numbers are traversed, resulting in the number of saved in the result container as the final result.

Method Three: Divide and conquer the method. Divide 100 million data into 100 parts, 1 million per copy, find the largest 10,000 of each data, and finally find the largest 10,000 in 100*10000 data. If 1 million is chosen well enough, you can filter out 99% of the data in 100 million data.

Method Four: Minimum heap. Read 10,000 data to construct a small top heap size of 10000, and then traverse the subsequent numbers, and compared with the top (minimum) heap, if smaller than the smallest number, continue to read the subsequent numbers, if larger than the heap top number, replace the heap top element and re-adjust the heap to the small top heap. Until 100 million data traversal is complete.

If you are seeking the 100 million most occurrences of the 100 numbers, MapReduce can directly distribute the data to different machines for processing is not able to get the correct results, because one data may be divided into different hosts, and the other can be completely sniper to a machine.

2. Repeat the problem: look for duplicate elements in a large amount of data or remove duplicate occurrences. For this kind of problem, it can be realized by bitmap method generally. For example, a file is known to contain some phone numbers, each number is 8 digits, and the number of different numbers is counted.

WORKAROUND: A 8-bit integer can represent a maximum decimal value of 99999999, and if each number corresponds to one of the bitmaps, then the storage of the eight-bit integer takes approximately 99Mbit, because 1byte=8bit, so 99Mbit of the resultant memory is 99/8=12.375MB memory, You can use 12.375 of the memory to represent the contents of all 8-digit phone numbers.

3. Sorting method: Mass data sorting. For example, the numbers in this file are sorted for 900 million distinct integers in a file.

Method One: Bitmap method

Job search-How big data is handled

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.