Search for Median in a 10G integer File

Source: Internet
Author: User

Search for Median in a 10G integer File

Http://hxraid.iteye.com/blog/649831

Question: 10 Gb in a file
Integers in disorder. The median must be located. The memory limit is 2 GB. Just write out the idea (the memory limit is 2 GB, that is, 2 GB space can be used to run the program, regardless of the memory occupied by other software on this machine ).

Analysis: to find the median, it is easy to think about sorting. Byte-based bucket sorting is a feasible method (see bucket sorting):

Thought: each 1 byte of an integer is used as a keyword. That is to say, an integer can be split into four keys, and the larger the maximum keys, the larger the integer. If the High-Level keys are the same, the higher-level keys are compared. The entire comparison process is similar to the Lexicographic Order of strings.

Step 1: Read the 10G integer into the memory every 2 GB, and then traverse the 536,870,912 data at a time. Each data uses bitwise operations ">" to retrieve up to 8 bits (31-24 ). The 8 bits (0-255) represents a maximum of 255 buckets. You can determine the number of buckets to be dropped according to the 8-bit value. Finally, write each bucket to a disk file and count the data volume in each bucket in the memory. Naturally, this quantity only requires 255 integer spaces.

Cost: (1) I/O cost of reading 10 Gb of data into the memory in sequence (this is unavoidable and the CPU cannot be computed directly on the disk ). (2) traverse 536,870,912 pieces of data in the memory, which is an O (n) linear time complexity. (3) Write 255 buckets to 255 Disk File spaces. This cost is extra, that is, the 10 GB Data Transfer Time is doubled.

Step 2: Calculate the median in the first few buckets based on the number of 255 buckets in the memory. Obviously, the median of 2,684,354,560 is the first. Assume that the data volume of the first 127 barrels is added, and less than 1,342,177,280 is found. The data volume of the first 128th barrels is increased by more than 1,342,177,280. It indicates that the median must be in the 128th bucket of the disk. And the number of digits in the bucket is 127,-N (0. N (0-127) indicates the sum of the data volume of the first 127 buckets. Then, read the integers in the 128th files into the memory. (On average, the size of each file is estimated to be around 10g/128 = 80 m, of course not necessarily, but the possibility of exceeding 2G is very small ).

Cost: (1) The cost of O (m) is required to accumulate the data volume in 255 buckets cyclically, where M <255. (2) I/O cost of reading a file about 80 Mb.

Note: In abnormal cases, the file No. 128th to be read is still larger than 2 GB, so the entire read can still be read in batches according to the first step.

Step 3: Continue to sort buckets (23-16) with the next 8 bits in the memory integer ). The process is the same as the first step, which is also 255 buckets.

Step 4: continue until the sorting of the lowest byte (7-0bit) ends. I believe that this time can be used in the memory for a fast sort.

The time complexity of the entire process lies in the linear level of O (N) (without any loop nesting ). However, the main time is spent on the second memory-disk data exchange in step 1, that is, 10 Gb of data is written back to the disk in 255 files. Generally, if the memory can accommodate a file with a median after step 2, you can directly sort the file quickly. For more information about the efficiency of fast sorting, see the data "comparison-based internal sorting summary" in my blog.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.