Title: There are 10G integers in a file, ordered in order to find the median. The memory limit is 2G. Writing only ideas (memory limit of 2G means that you can use 2G of space to run the program, regardless of the other software on this machine memory consumption).
About Median: The number of positions in the middle after the data is sorted. The data is divided into two parts, one part greater than the number, and one part is less than the number. Position of median: when the number of samples is odd, the median = (n+1)/2; When the number of samples is even, the median is the mean value of the N/2 and 1+N/2 (then the median of the number of 10G, the number of the 5th G large and the number of the 5g+1 large).
Analysis: is obviously a very strong engineering problem, and the general search for the median of the topic there are several differences.
1. The original data can not be read into memory, or you may use a quick choice, if the range of numbers is appropriate can also consider bucket sorting or counting sort, but here is assumed to be 32-bit integers, there are still 4G of values, need a 16G size of the array to count.
2. If you look at the number of numbers from N to find K-large, if the number of K can be read into memory, you can use the smallest or maximum heap, but here K=N/2, there are 5G number, still can not read into memory.
3. On the connection, the number of N and K can not be read into memory at one time, "The beauty of programming" gives a scheme: set k<k, and K number can be fully read into memory, then first build K number of the heap, first find the No. 0 to k large number, and then scan the array to find the number of k+1 to 2k, Then scan until you find the number K. Although each time is about Nlog (k), but need to scan ceil (k/k), here to scan 5 times.
Solution: First assume a 32-bit unsigned integer.
1. Read 10G integers, map integers to 256M extents, and use a 64-bit unsigned integer to count each corresponding section.
Description: The integer range is 0-2^32-1, a total of 4G values, mapped to 256M sections, then each sector has a (4g/256m = 16) of the value, each 16 value is a paragraph, 0~15 is the 1th paragraph, 16~31 is the 2nd paragraph, ... 2^32-16 ~2^32-1 is the No. 256 m segment. The maximum value of a 64-bit unsigned integer is 0~8g-1, where the overflow is not considered first. Memory 256MX8B=2GB is consumed in total.
2. The count of each paragraph is cumulative from the front to the back, when the cumulative and over 5G stops, find out the range of values in this section (that is, the segment reached when the cumulative stop is also the section of the median), set to [a,a+15], and the total number of records added to the previous section, set to M. Then, release the memory occupied by this section.
3. Read the 10G integer again, counting each value in [a,a+15], that is, 16 counts.
4. The new count is added sequentially, each time and set to N, when the value of the m+n is over 5G, the number of this count corresponds to the median number.
1. The above method only reads two times the integer, to each integer also is only the constant time operation, generally is the linear time.
2. Consider other circumstances.
If you have a signed integer, just change the mapping. If 64 is an integer, increase the range of each section, consider more counts when the second reading. If a certain count overflow, then can be identified in the section or represent the whole number of the request, here just do the appropriate processing. Oh, forget to find the number of 5g+1, I believe that with the above results, it is not difficult to find this number.
3. Time and space balance.
Spending 256 segments may just fit in with 2GB of memory (not really, hehe). You can increase the section range, reduce the number of sections, save some memory, although the second part of the count of a single number, but the first part of the count of each section accelerated (overall change. to be measured).
4. Mapping as far as possible with a bit operation, because each section starting point is 2 of the integer power, mapping is also very convenient.
1, the integer divided into 256M segments, each paragraph can be 64-bit integer to save the number of data, 256m*8 = 2G memory, first clear 0
2, read the 10G integer, map the integer to the 256M segment, and increase the count of the corresponding segment
3, scan the 256M segment of the count, find the median of the segment and the median segment before the count of all segments, you can release the memory of other segments
4, the possible integer value for the median segment is already relatively small (if it is a 32bit integer, of course, if it's a 64bit integer, you can fragment again, make a count of each integer, read the 10G integer again, read only the integer corresponding to the median segment, and set the count.
5, scan the new count once to find the median.
If it is a 32bit integer, read the 10G integer 2 times, scan 256M count once, the last count because the number is very small, can be ignored
(Set to 32bit Integer, treated as unsigned integer
The integer is divided into 256M segments. Integer range is 0-2^32-1 total 4G values, 4g/256m = 16, every 16 number of 0-15 is 1, 16-31 is a paragraph, ...
The integer is mapped to the 256M segment. If the integer is 0-15, increase the first paragraph count, and if the integer is 16-31, increase the second paragraph count, ...
In fact, can not be divided into 256M paragraph, can be divided into the number of paragraphs less a write, so that in the scanning of the number of paragraphs will be faster, but also to save some memory.
Fragment count, find the range of data in which the median is located, and then focus. The specific algorithm is as follows:
1. Integer int, according to 32-bit computer, account for 4Byte, can represent 4G a different value. The total number of raw data is 10G, 8Byte is required to ensure that the total count. And the memory is 2G, so divided into 2g/8byte=250m a different group, each group of statistics 4g/250m=16 number of adjacent numbers. That is, the construction of a two-word group (that is, each element accounted for 8Byte) statistical count, the array contains 250M elements, a total of space 8byte*250m=2g, exactly equal to memory 2G, that is, can be read into memory. The first element counts the total number of numbers appearing in the 0-15 interval, the second element counts the total number of digits in the 16-31 interval, the total number of digits in the last element statistic (4g-16) to the (4g-1) interval, and iterates through the original data of 10G to get the array value.
2. Define a variable sum, initialized to 0. Begins the traversal of the first element of the array and adds the element value to sum. If the value of an element is added before the value of the element is added, the median is,sum<5g; to the 16 adjacent digits of the statistic for the element, and the sum value before the value of the element is added (at which time the sum is the maximum value less than 5G). If this element is the first M element in the array (m starts from 0), the corresponding interval is [16m,16m+15].
3. Once again defines a double word group statistic count, the array contains 16 elements, each statistic (16m) to (16m+15) interval of each number appears, the other number ignores. This iterates through 10G of raw data again to get the array value.
4. The initial value for defining a variable sum2,sum2 is sum (that is, the maximum value of less than 5G as recorded in the second step above). The first element of the new array is traversed and the element value is added to the sum2. If you add the value of an element before it is added to the value of the element, sum2>5g, the median is the number corresponding to the,sum2<5g; element. If this element is the nth element in the new array (n is calculated from 0), then the corresponding number is 16m+n, which is the median number of these 10G digits.
The algorithm process as above, needs to traverse 2 times the original data, that is, O (2N), but also need to traverse the 2 array before and After, O (k). Total time complexity O (2n+k)
topics are as follows: only 2G memory pc, in a file containing 10G integers, from which to find the median, write an algorithm. algorithm: 1. Using the method of sorting, and then to find the median 2. There is also a way to use the heap first to seek the 1th G, and then use the element to find the 2nd G large, and then use the 2G large, the 3rd G large ... Of course, such words do not need to be sorted, but disk operations are more specific also need to analyze the efficiency of the lower and the sort of the disk IO will be more than the creation of a 1g integer maximum heap, if the element is less than the maximum of the heap, so you can get the first 1g large element and then use this element, rebuild the heap again, The condition of the heap is added to the element greater than this 1g, this way the heap can get the 2g big one ... 3. Reference to Cardinal sort thought I could use a bit to judge the count, from the highest to the lowest, to make it easier to express our hypothesis as unsigned integers, i.e. 0x00000000~ 0xFFFFFFFF increments, you can iterate through all the data, and record the maximum number of digits 0 and 1 (the highest bit 0 is definitely less than the highest bit 1) is N0, N1
Then you can tell by the size of N0 and N1 whether the median is 0 or 1
assuming n0> N1, then compute N00 and N01,
if N00> (N01+N1), indicate the maximum median of two digits
and then compute N000 and N001 .... In turn, you can find the median if you improve, set multiple counters
It seems like a disk IO can also count out n0,n00,.... Value 4. Draw on bucket sort idea
An integer is assumed to be 32-bit unsigned number
The first scan divides the 0~2^32-1 into 2^16 intervals, recording the number of integers in each interval
find the median specific interval 65536*i~65536* (i+1)- 1
The second scan can find the exact median value
The first scan has identified the median specific interval 65536*i~65536* (i+1)-1
Then the second scan to count the number of occurrences of each number in that interval