Search for Median in massive data

Source: Internet
Author: User

Question: there are 10g integers in a file, which are arranged in disorder and need to find the median. The memory limit is 2 GB. Just write out the idea (the memory limit is 2 GB, which means that 2 GB space can be used to runProgramWithout considering the memory occupied by other software on this machine ).

Off
Median: the value in the center after data sorting. Divide the data into two parts. One part is greater than the value and the other part is smaller than the value. Median position: when the number of samples is odd, median =
(N + 1)/2; when the number of samples is an even number, the median is the mean of n/2 and 1 + n/2 (then the median of 10 Gb, the mean value of the 5G and the 5th G + 1 values ).

Analysis: it is obviously a very engineering question, which is different from the question of finding the median in general.
1. the original data cannot be read into the memory, or you can use a quick selection. If the number range is suitable, you can also consider the bucket sorting or counting sorting, but here it is assumed that it is a 32-bit integer, there are still 4G values, and a 16g array is required to count.

2. if we look at the number of N numbers to find the maximum K number, if the number of K can be read into the memory, we can use the minimum or maximum heap, But here K = n/2, there is a 5g number, still cannot be read into memory.

3.

Next, if neither n nor K can be read into the memory at a time, let's provide a solution: set K <K, and the number of K can be fully read into the memory, then first build the number of K heap, first
Find the number 0th to K, scan the array again to find the number k + 1 to 2 K, and then scan until the number of K is found. Although each time is about nlog (K), you need to scan Ceil (K/K)
Scan five times.

Solution: assume it is a 32-bit unsigned integer.
1. read 10g integers once, map the Integers to M segments, and use a 64-bit unsigned integer to give each corresponding field count.
Description
Ming: the integer range is 0-2 ^ 32-1. A total of 4G values are mapped to 256 m segments, and each segment has 16 (4g/m =
16) Type of value. Each 16 values is counted as one segment, ranging from 0 ~ 15 is section 1st, 16 ~ 31 is section 2nd ,...... 2 ^ 32-16
~ 2 ^ 32-1 is the 256m segment. The maximum value of a 64-bit unsigned integer is 0 ~ 8g-1. Overflow is not considered here. The total memory usage is MB × 8B = 2 GB.

2. accumulate the Count of each segment from the beginning to the end. When the sum of the accumulated and more than 5 GB is stopped, find this section (that is, the Section reached when the accumulation stops, and the section where the median is located) value range, set to [A, A + 15], and record the total number accumulated to the previous section, set to M. Then, release the memory occupied by this section.

3. read 10g integers again and count each value in [A, A + 15], that is, 16.

4. accumulate the new count sequentially. The sum of each count is set to n. When the m + n value exceeds 5 GB, the Count corresponding to this count is the median.

Summary:
1. The above method only needs to read integers twice, and each integer is only a constant time operation. In general, it is a linear time.

2. Consider other cases.
If
It is a signed integer. You only need to change the ing. If 64 is an integer, the range of each segment is increased. More counts should be considered during the second reading. If a count overflow exists, you can identify
The section or integer represents the desired value. Here, you only need to do the corresponding processing. Oh, I forgot to look for a 5th G + 1 big number. I believe that with the above results, it is not difficult to find this number.

3. Space-time trade-offs.
The cost of 256 CIDR blocks may only work with 2 GB of memory (not actually, haha ). You can increase the range of segments, reduce the number of segments, and save some memory. Although you increase the count of a single value in the second part, but the first part has accelerated the count for each segment (overall CHANGE ?? To be tested ).

4. Use bitwise operations whenever possible during ing. Since the start point of each segment is an integer power of 2, it is also convenient to map.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.