There are 10G integers in a file, ordered in random order, to find the median

Source: Internet
Author: User

Topics: There are 10G integers in a file, which are ordered in random order to find the median. The memory limit is 2G. Just write the idea (memory limit of 2G means that you can use 2G of space to run the program, regardless of the memory of other software on this machine).

About median: The data is sorted, the position is in the middle of the number. The data is divided into two parts, part of which is larger than the value, and part is less than that. Median Position: median = (n+1)/2 when the number of samples is odd; When the number of samples is even, the median is the mean value of N/2 and 1+N/2 (then the median of the 10G number, which is the mean of the 5th G large number and the number of the first 5g+1).

Analysis: Obviously an engineering very strong topic, and the general search for the median number of topics there are several different.
1. The original data cannot be read into the memory, otherwise you can use a quick choice, if the number of the appropriate range can also consider the bucket sort or count sort, but this is assumed to be a 32-bit integer, there are still 4G values, need a 16G size array to count.

2. If you look at the number of K from the N number, if the K number can be read into memory, you can use the minimum or maximum heap, but here K=N/2, there are 5G number, still can't read into memory.

3. connected to the number of N and K can not be read into the memory, "programming beauty" gives a solution : set k<k, and K number can be fully read into memory, then first build k number of heap, first find the No. 0 to k large number, and then scan the array to find the first 1 to 2k, then scan until the number of K is found. Although it is approximately nlog (k) per time, it needs to be scanned for ceil (k/k) times, where it is scanned 5 times . (The beauty of programming to find the largest number of K)

Solution: First assume that there is a 32-bit unsigned integer.
1. Read 10G integers, map integers to 256M sections, and use a 64-bit unsigned integer to count each corresponding segment.
Description: integer range is 0-2^32-1, there is a total of 4G values, mapped to 256M sections , then each section has a 4g/256m = 16 values, each of the 16 values for a period, 0~15 is the 1th paragraph, 16~31 is the 2nd paragraph, ... 2^32-16 ~2^32-1 is the No. 256 m segment. The maximum value for a 64-bit unsigned integer is 0~8g-1, where overflow is not considered first. The total memory consumption is 256MX8B=2GB.

2. The count of each paragraph is accumulated from the front to the back, when the accumulated sum and more than 5G stop, find out this section (that is, the segment reached at the end of the cumulative stop, is also the segment of the median), set to [A,a+15], and record the total number of the previous segment, set to M. Then, release the memory occupied by this segment.

3. Read the 10G integer again, counting each value within [a,a+15], which is 16.

4. The new count is incremented sequentially, each time and set to N, when the value of M+n is more than 5G, the number corresponding to this count is the median.

Summarize:
1. The above method only reads two times the integer, for each integer also is only the constant time operation, overall is the linear time.

2. Consider other circumstances.
If you have signed integers, just change the mappings. If 64 is an integer, then the range of each segment is increased, then more counts are considered in the second reading. If a certain count overflow, then can be identified in the section or the representative of the whole number of the request, here just do the corresponding processing. Oh, forget to find the 5g+1 big number, believe that with the above results, it is not difficult to find this number.

3. Space-time tradeoffs.
Spending 256 segments may just coincide with 2GB of memory (not really, hehe). You can increase the range of extents, reduce the number of segments, save some memory, although the second part of the count of individual values, but the first part of the count of each segment accelerated (overall change??). Pending test).

4. Use bit operations when mapping, since the starting point of each section is an integer power of 2, it is also convenient to map.

Title: Design a data structure that includes two functions, inserting data and obtaining the median.

Using Dagen and small Gan, where Dagen maintains a smaller half of the data, small Gan maintains a larger half of the data.

Then, according to the corresponding situation, the two heaps are appropriately stacked to meet the number of elements in the two heaps. Time complexity O (LGN)

Extension

Design a stack that, in addition to common stack operations, has an operation that returns the median.

Also use Dagen and small Gan to maintain the median. Time complexity O (LGN)

There are 10G integers in a file, ordered in random order, to find the median

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.