Programmer programming art (algorithm volume): Chapter 10: how to sort disk files with 10 ^ 7 data volumes

Source: Internet
Author: User
Tags bitset coding standards

Prelude

After a few days of painstaking meditation, I finally decided to officially rename the original programmer's preference series as the programmer's programming Art series. At the same time, I changed my name to the studio. The reason for renaming is that we have three considerations: 1. Serving the interview service cannot be our ultimate or major goal. 2. I would like to answer one interview question, the process of ACM and other program design questions is regarded as an art. 3. The extraction of art itself is a very, very difficult process, but we are happy to accept this challenge.

At the same time, this series of programming arts-algorithm volumes are roughly divided into three parts: The first part-programming, such as interview questions/ACM questions/poj questions and other programming questions, as long as it is good, it is worthwhile to design or explore the questions, we will not refuse. At the same time, we are constantly looking for more efficient algorithms to solve practical problems. The second part-Algorithm Research, mainly based on my original work I wrote earlier-13 classical algorithm research series, strives to be easy to understand and analyze various classic algorithms in detail, and programming. Part 3: Coding literacy, mainly including some coding standards in the coding process of programmers and other issues that need attention.

If possible, this TAOPP series will take the form of TAOCP to produce the first, second ,.... Where does programming art come from? What is the proper data structure for programming? Looking for more efficient algorithms? Or, are there good coding standards? Hopefully, this TAOPP series will eventually give you a complete answer.

OK. If anyone has any comments on this programming Art series or finds any problems, vulnerabilities, or bugs in this programming Art Series, please feel free to raise them. We will accept them modestly and appreciate them, create better value and better service for others.

Section 1: How to sort Disk Files
Problem description:
Input: a file containing up to n non-repeated positive integers, where each number is less than or equal to n and n = 10 ^ 7.
Output: Lists All input integers in ascending order.
Condition: a maximum of 1 MB of memory is available, but the disk space is sufficient. The running time must be less than 5 minutes, and 10 seconds is the best result.

Analysis: let's solve this problem step by step,
1. Merge and sort. You may want to merge and sort disk files, but the question requires that you only have 1 MB of memory space available. Therefore, this method cannot be used.
2. Bitmap solution. A friend familiar with bitmap may think of using bitmap to represent this file set. For example, as described in the programming book Pearl River, a 20-bit long string is used to represent a simple set of non-negative integers whose elements are smaller than 20. The border is represented by the following string {1, 2, 3, 5, 8, 13 }:

0 1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0

The positions corresponding to the numbers in the preceding set are set to 1, and those without corresponding numbers are set to 0.

By referring to the bitmap solution in the book of programming, we can consider the sorting of Disk Files with 10 ^ 7 data volumes, each 7-digit decimal integer represents an integer smaller than 10 million. We can use a 10 million-bit string to represent this file. Here, when and only if integer I exists in the file, the I-th is 1. The bitmap scheme is based on the particularity of the problem: 1. The input data is limited to a relatively small range. 2. The data is not repeated, 3. Each record is a single integer and there is no other data associated with it.
Therefore, the bitmap solution is divided into the following three steps to solve this problem:

Step 1: set all the bits to 0, so that the Set Initialization is empty.
Step 2: Create a set by reading each integer in the file and set each corresponding bit to 1.
Step 3: Check each bit. If this bit is 1, the corresponding integer is output.
After the preceding three steps, an ordered output file is generated. If n is the number of digits in the bitmap vector (1000 in this example), the program can use pseudocode to represent the following:
View plaincopy to clipboardprint?
// Pseudo code of the Disk File Sorting bitmap Solution
// Copyright @ Jon Bentley
// July, updated, 2011.05.29.
// Step 1: Initialize all bits to 0
For I = {0,... n}
Bit [I] = 0;

// Step 2: Create a set by reading each integer in the file and set each corresponding bit to 1.
For each I in the input file
Bit [I] = 1;

// Step 3: Check each digit. If this digit is 1, the corresponding integer is output.
For I = {0... n}
If bit [I] = 1
Write I on the output file
// Pseudo code of the Disk File Sorting bitmap Solution
// Copyright @ Jon Bentley
// July, updated, 2011.05.29.
// Step 1: Initialize all bits to 0
For I = {0,... n}
Bit [I] = 0;

// Step 2: Create a set by reading each integer in the file and set each corresponding bit to 1.
For each I in the input file
Bit [I] = 1;

// Step 3: Check each digit. If this digit is 1, the corresponding integer is output.
For I = {0... n}
If bit [I] = 1
Write I on the output file

The above is just to briefly introduce the abstract description of the pseudo-code of the Bitmap Algorithm. Obviously, the problem we are facing is not that simple. Next, we will try to write the complete code for the specific problem of sorting disk files by two trips, as shown below.

View plaincopy to clipboardprint?
// Copyright @ yansha
// July and 2010.05.30.

// The bitmap solution solves the Sorting Problem of files with 10 ^ 7 data volumes
// If duplicate data exists, only one of the other items can be displayed and ignored.
# Include <iostream>
# Include <bitset>
# Include <assert. h>
# Include <time. h>
Using namespace std;

Const int max_each_scan = 5000000;

Int main ()
{
Clock_t begin = clock ();
Bitset <max_each_scan> bit_map;
Bit_map.reset ();

// Open the file with the unsorted data
FILE * fp_unsort_file = fopen ("data.txt", "r ");
Assert (fp_unsort_file );

Int num;
// The first time scan to sort the data between 0-4999999
While (fscanf (fp_unsort_file, "% d", & num )! = EOF)
{
If (num <max_each_scan)
Bit_map.set (num, 1 );
}

FILE * fp_sort_file = fopen ("sort.txt", "w ");
Assert (fp_sort_file );

Int I;
// Write the sorted data into file
For (I = 0; I <max_each_scan; I ++)
{
If (bit_map [I] = 1)
Fprintf (fp_sort_file, "% d", I );
}

// The second time scan to sort the data between 5000000-9999999
Int result = fseek (fp_unsort_file, 0, SEEK_SET );
If (result)
Cout <"fseek failed! "<Endl;
Else
{
Bit_map.reset ();
While (fscanf (fp_unsort_file, "% d", & num )! = EOF)
{
If (num> = max_each_scan & num <10000000)
{
Num-= max_each_scan;
Bit_map.set (num, 1 );
}
}

For (I = 0; I <max_each_scan; I ++)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.