10G integer file to find the median or K-large number

Source: Internet
Author: User

Source: http://hxraid.iteye.com/blog/649831

Title: There are 10G integers in a file that are ordered in order to find the median. The memory limit is 2G. Just write the idea (memory limit of 2G means that you can use 2G of space to run the program, regardless of the memory of other software on this machine).

Analysis: Since we are looking for the median, it is simple to sort the idea. Then byte-based bucket sequencing is a viable option (see bucket sequencing):

Thought: To reshape every 1byte as a keyword, that is, a shape can be split into 4 keys, and the higher the maximum number of keys, the larger the integer. If the high keys are the same, compare the keys at the second high. The entire comparison process is similar to the dictionary order of strings.

The first step is to read the 10G integer into memory every 2G, and then traverse the 536,870,912 data at a time. Each data is removed with a bitwise operation of ">>" up to 8 bits (31-24). This 8bits (0-255) represents a maximum of 255 barrels, then the number of buckets can be determined based on the value of 8bit. Finally, each bucket is written to a disk file, and the amount of data in each bucket is counted in memory, and naturally this number requires only 255 shaping space.

Cost: (1) The IO cost of 10G data reading into memory sequentially (this is unavoidable, the CPU cannot be directly operated on disk). (2) Traversing 536,870,912 data in memory, which is the linear time complexity of O (n). (3) Write 255 buckets to 255 disk file space, the cost is extra, that is, the time to pay more than 10G data transfer.

Second step: Based on the number of 255 barrels in memory, calculate the median in the first few barrels. It is clear that the median of 2,684,354,560 is the number of the first 1,342,177,280. Assuming that the amount of data in the first 127 buckets is added, it is found to be less than 1,342,177,280, plus a 128th bucket of data, greater than 1,342,177,280. Description, the median must be in the 128th bucket of the disk. And on the 1,342,177,280-n (0-127) digits of this bucket. N (0-127) represents the sum of the data volumes of the first 127 buckets. The integer in the 128th file is then read into memory. (on average, the size of each file is estimated to be around 10g/128=80m, although not necessarily, but the likelihood of exceeding 2G is small).

Cost: (1) cyclic calculation of the amount of data accumulated in 255 barrels, the cost of O (M), where m<255. (2) read into an IO cost of about 80M file size.

Note that in the case of Metamorphosis, the 128th file that needs to be read is still greater than 2G, then the entire read-in can still be read in batches of the first step.

Step three: Continue to sort the buckets (23-16) with the secondary height of the integer in memory at 8bit. The process is the same as the first step and is also 255 barrels.

Fourth step: Keep going until the lowest byte (7-0bit) of the bucket sort ends. I believe this is a time when you can use a quick line in memory.

The time complexity of the entire process is at the linear level of O (n) (without any nesting of loops). But the main time is spent in the first step of the second memory-disk data exchange, that is, 10G of data 255 files written back to disk. In general, if after the second step, the memory can hold a file with the median, the direct fast line can be. For the efficiency of the quick row, look at the data in my blog, "Summary of internal sorting based on comparison".

10G integer file to find the median or the K-largest number

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.