External sorting of large Microsoft exam documents (Implementation and comparison of two-way merge and K-way merge)

Source: Internet
Author: User

Both sorting methods first split an unordered large external file into several blocks and read them into memory.

Sort each part in order and place it in a new external file.

  1. The two-way merge approach is to merge files of external sorting into a file of double size, and then merge the files of double size. Until it is finally merged into a file.
  2. K-path merge is to merge the sub-files in external order once. First, extract the first data from each file and put it in a priority queue. Select the smallest data and output it to the external result file. And read the next data from the file corresponding to the minimum data. The key to this method is to associate the data read from the file every time with the corresponding file. In this way, you can continue reading data. Another important thing is that when a file is read, place the maximum data Max to the priority queue as a marker. When the data read from a priority queue is Max, it indicates that all files have been read. The result file output is complete.

The C ++ code for merge:

//// Sorts N (10000000) Integers by two-way merging. Each time, the two files are always merged and sorted into one. String get_file_name (INT count_file) {stringstream s; S <count_file; string count_file_string; S> count_file_string; string file_name = "data"; file_name + = count_file_string; return file_name ;} /// use the two-way merge method to divide n integers into the parts with each size per. Then merge them progressively // void push_random_data_to_file (const string & filename, unsigned long number) // {// If (number <100000) // {// vector <int> A; // push_rand (A, number, 0, number); // write_data_to_file (A, filename. c_str (); // else // {// vector <int> A; // const int per = 100000, n = number/per; // push_rand (A, Number % per, 0, number); // write_data_to_file (A, filename. c_str (); // For (INT I = 0; I <n; I ++) // {//. clear (); // push_rand (A, 100000, 0,100000); // write_data_append_file (A, filename. c_str (); //} // void split_data (const string & datafrom, deque <string> & file_name_array, unsigned long per, Int & count_file) // {// unsigned long position = 0; // while (true) // {// vector <int> A; //. clear (); // read a piece of data from the file to the array // If (read_data_to_array (datafrom, A, position, per) = true) // {// break; //} // Position + = per; // sort the data in the array in the memory // sort (. begin (),. end (); // ofst Ream fout; // string filename = get_file_name (count_file ++); // file_name_array.push_back (filename); // fout. open (filename. c_str (), IOS: In | IOs: Binary); // output the sorted array to an external file. // write_data_to_file (A, filename. c_str (); // print_file (filename); // fout. close (); //} // void sort_big_file_with_binary_merge (unsigned long N, unsigned long per) // {// unsigned long traverse = N/per; // vector <int> A; // create a large amount of data and put it in a file. // cout <"to" <<N <"integers are sorted by two-way merging, the size of each path is "<per <Endl // <" all data is divided into "<traverse <" Files "<Endl; //// singletontimer: instance (); /// splits the file to be sorted into small files, sort the data in the memory and put it in the disk file. // string datafrom = "data.txt"; // deque <string> file_name_array; // int count_file = 0; // split_data (datafrom, file_name_array, Per, count_file); // singletontimer: instance ()-> Print ("splits the files to be sorted into small files, sort in memory and put it in the disk file "); // merge and sort, two-way merge. // While (file_name_array.size ()> = 2) // {// obtain the content of the two ordered files and combine them into an ordered file, until it is merged into an ordered file // string file1 = file_name_array.front (); // delimiter (); // string file2 = file_name_array.front (); // file_name_array.pop_front (); // string fileout = get_file_name (count_file ++); // file_name_array.push_back (fileout); // merge_file (file1, file2, fileout); // print_file (fileout ); /// singletontimer: instance ()-> Print ("Get the content of two ordered files and combine them into one ordered file, until it is finally merged into an ordered file "); // cout <" the final file stores all sorted data, of which the first one hundred are: "<Endl; // print_file (file_name_array.back (), 100 );////}

C ++ code for K-channel merge:

/// K-path Merge Sorting large file 1000*10000 /// void write_random_data_to_file (unsigned long number) /// {// cout <"writing" <number <"to file data... "<Endl; // unsigned long traverse = Number/100000; // cout <traverse <" s times have to write. "<Endl; // create a large amount of data into the file // vector <int> A; // If (number <100000) // {// push_rand (A, number, 0, number); // write_data_to_file (a, "data "); /// // else // {// push_rand (A, 100000,0, 1000000); // write_data_t O_file (a, "data"); // cout <"the" <0 <"Times finished. "<Endl; // For (unsigned long I = 1; I <traverse; I ++) // {//. clear (); // push_rand (A, 100000,0, 100000); // write_data_append_file (a, "data "); // cout <"the" <I <"Times finished. "<Endl // <(traverse-1-i) <" times left. "<Endl; //} // cout <number <" integers wrote to file data finished. "<Endl; /// // test ////////////////// /// print_file ("data", 100); // sort (. begin (),. end (); // print (. begin (),. end (); //} // list <string> divide_big_file_1__small_sorted_file (long number) // {// vector <int> A; //. clear (); // long position = 0; // int count_file = 0; // list <string> file_name_array; /// get part files and file names // while (true) // {//. clear (); // If (read_data_to_array ("data.txt", A, position, number) = true) // {// break; ///} // Position + = number; // sort (. begi N (),. end (); // string filename = get_file_name (count_file ++); // file_name_array.push_back (filename); // write_data_to_file (A, filename. c_str (); // cout <"sorted file" <(count_file-1) <"builded. "<Endl; //} // return file_name_array; //} // void k_way_merge_sort (const list <string> & file_name_array) /// {///// get ifstreams and put them to list <ifstream> readfiles // vector <ifstream> readfiles; // For (list <string>:: Const_iterator I = file_name_array.begin (); // I! = File_name_array.end (); I ++) // {// readfiles. push_back (ifstream (); // readfiles. back (). open (I-> c_str (), IOS: Binary | IOs: In ); //} // init priority queue by read one data from each file /// initialize the priority queue: read the first data from each file // priority_queue <pair <int, int>, vector <pair <int, int>, greater <pair <int, int >>> prioritydata; // For (vector <ifstream >:: size_type I = 0; // I <readfiles. size (); I ++) // {// int temp; // readfiles [I]. read (reinterpret_cast <char *> (& temp), sizeof (INT); // prioritydata. push (make_pair (temp, I); //} // merge sort file // ofstream fout; // fout. open ("result", IOS: Binary); // while (true) // {// int onedata = prioritydata. top (). first; // If (onedata = numeric_limits <int> (). max () // {// break; //} // else // {// fout. write (reinterpret_cast <const char *> (& onedata), sizeof (INT); // read an integer from file I // int I = prioritydata. top (). second; // prioritydata. pop (); // int temp; // readfiles [I]. read (reinterpret_cast <char *> (& temp), sizeof (INT); // If (readfiles [I]. EOF () // {// when this file is read to the end, put it in the priority queue. // prioritydata. push (make_pair (numeric_limits <int> (). max (), I); //} // else // {// otherwise, the read data is directly put into the priority queue // prioritydata. push (make_pair (temp, I); //} // close all open files // fout. close (); // For (vector <ifstream>: size_type I = 0; // I <readfiles. size (); I ++) // {// readfiles [I]. close (); //} // void sort_big_file_with_k_way_merge (unsigned long N, unsigned long partitionfilesize) // {// write_random_data_to_file (N ); // timer t; // k_way_merge_sort (divide_big_file_1__small_sorted_file (partitionfilesize); // you can divide the files to be sorted into small files, sort the data in the memory and put it in the disk file. /// assume that the memory is only 1 MB, integers // cout <n/partitionfilesize <"merge sort large files" <n <", memory sorting" <partitionfilesize <Endl; // print (T. elapsed (); // print ("seconds"); // print_file ("result", 1000 );//}

 

Output result and comparison:

K-channel merge

4 channels

209 seconds

8-way

190 seconds

16 routes

223 seconds

2. Merge

4 sub-Files

257 seconds

8 sub-Files

281 seconds

From the results of the above experiments, the two-way merge method is not optimal in external sorting. Because it always merges two files at a time, this results in a large number of times that all data is traversed. In external sorting, because the data volume is large, the number of traversal times directly affects the sorting time. K-path merge emphasizes merging K sub-files in sorted order into a final result file at a time. Therefore, the sub-files are traversed twice, read once, and write once. Other time is mainly spent on the departure and adjustment of the priority queue. Therefore, the value of K cannot be too large or too large, resulting in the adjustment of the heap taking up too much time, too small, leading to excessive memory consumption by internal sorting. The above results show that the 8-way Merge Sorting speed is the fastest.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.