External sorting of large Microsoft exam documents (Implementation and comparison of two-way merge and K-way merge)

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Both sorting methods first split an unordered large external file into several blocks and read them into memory.

Sort each part in order and place it in a new external file.

The two-way merge approach is to merge files of external sorting into a file of double size, and then merge the files of double size. Until it is finally merged into a file.
K-path merge is to merge the sub-files in external order once. First, extract the first data from each file and put it in a priority queue. Select the smallest data and output it to the external result file. And read the next data from the file corresponding to the minimum data. The key to this method is to associate the data read from the file every time with the corresponding file. In this way, you can continue reading data. Another important thing is that when a file is read, place the maximum data Max to the priority queue as a marker. When the data read from a priority queue is Max, it indicates that all files have been read. The result file output is complete.

The C ++ code for merge:

//// Sorts N (10000000) Integers by two-way merging. Each time, the two files are always merged and sorted into one. String get_file_name (INT count_file) {stringstream s; S <count_file; string count_file_string; S> count_file_string; string file_name = "data"; file_name + = count_file_string; return file_name ;} /// use the two-way merge method to divide n integers into the parts with each size per. Then merge them progressively // void push_random_data_to_file (const string & filename, unsigned long number) // {// If (number <100000) // {// vector <int> A; // push_rand (A, number, 0, number); // write_data_to_file (A, filename. c_str (); // else // {// vector <int> A; // const int per = 100000, n = number/per; // push_rand (A, Number % per, 0, number); // write_data_to_file (A, filename. c_str (); // For (INT I = 0; I <n; I ++) // {//. clear (); // push_rand (A, 100000, 0,100000); // write_data_append_file (A, filename. c_str (); //} // void split_data (const string & datafrom, deque <string> & file_name_array, unsigned long per, Int & count_file) // {// unsigned long position = 0; // while (true) // {// vector <int> A; //. clear (); // read a piece of data from the file to the array // If (read_data_to_array (datafrom, A, position, per) = true) // {// break; //} // Position + = per; // sort the data in the array in the memory // sort (. begin (),. end (); // ofst Ream fout; // string filename = get_file_name (count_file ++); // file_name_array.push_back (filename); // fout. open (filename. c_str (), IOS: In | IOs: Binary); // output the sorted array to an external file. // write_data_to_file (A, filename. c_str (); // print_file (filename); // fout. close (); //} // void sort_big_file_with_binary_merge (unsigned long N, unsigned long per) // {// unsigned long traverse = N/per; // vector <int> A; // create a large amount of data and put it in a file. // cout <"to" <<N <"integers are sorted by two-way merging, the size of each path is "<per <Endl // <" all data is divided into "<traverse <" Files "<Endl; //// singletontimer: instance (); /// splits the file to be sorted into small files, sort the data in the memory and put it in the disk file. // string datafrom = "data.txt"; // deque <string> file_name_array; // int count_file = 0; // split_data (datafrom, file_name_array, Per, count_file); // singletontimer: instance ()-> Print ("splits the files to be sorted into small files, sort in memory and put it in the disk file "); // merge and sort, two-way merge. // While (file_name_array.size ()> = 2) // {// obtain the content of the two ordered files and combine them into an ordered file, until it is merged into an ordered file // string file1 = file_name_array.front (); // delimiter (); // string file2 = file_name_array.front (); // file_name_array.pop_front (); // string fileout = get_file_name (count_file ++); // file_name_array.push_back (fileout); // merge_file (file1, file2, fileout); // print_file (fileout ); /// singletontimer: instance ()-> Print ("Get the content of two ordered files and combine them into one ordered file, until it is finally merged into an ordered file "); // cout <" the final file stores all sorted data, of which the first one hundred are: "<Endl; // print_file (file_name_array.back (), 100 );////}

C ++ code for K-channel merge:

/// K-path Merge Sorting large file 1000*10000 /// void write_random_data_to_file (unsigned long number) /// {// cout <"writing" <number <"to file data... "<Endl; // unsigned long traverse = Number/100000; // cout <traverse <" s times have to write. "<Endl; // create a large amount of data into the file // vector <int> A; // If (number <100000) // {// push_rand (A, number, 0, number); // write_data_to_file (a, "data "); /// // else // {// push_rand (A, 100000,0, 1000000); // write_data_t O_file (a, "data"); // cout <"the" <0 <"Times finished. "<Endl; // For (unsigned long I = 1; I <traverse; I ++) // {//. clear (); // push_rand (A, 100000,0, 100000); // write_data_append_file (a, "data "); // cout <"the" <I <"Times finished. "<Endl // <(traverse-1-i) <" times left. "<Endl; //} // cout <number <" integers wrote to file data finished. "<Endl; /// // test ////////////////// /// print_file ("data", 100); // sort (. begin (),. end (); // print (. begin (),. end (); //} // list <string> divide_big_file_1__small_sorted_file (long number) // {// vector <int> A; //. clear (); // long position = 0; // int count_file = 0; // list <string> file_name_array; /// get part files and file names // while (true) // {//. clear (); // If (read_data_to_array ("data.txt", A, position, number) = true) // {// break; ///} // Position + = number; // sort (. begi N (),. end (); // string filename = get_file_name (count_file ++); // file_name_array.push_back (filename); // write_data_to_file (A, filename. c_str (); // cout <"sorted file" <(count_file-1) <"builded. "<Endl; //} // return file_name_array; //} // void k_way_merge_sort (const list <string> & file_name_array) /// {///// get ifstreams and put them to list <ifstream> readfiles // vector <ifstream> readfiles; // For (list <string>:: Const_iterator I = file_name_array.begin (); // I! = File_name_array.end (); I ++) // {// readfiles. push_back (ifstream (); // readfiles. back (). open (I-> c_str (), IOS: Binary | IOs: In ); //} // init priority queue by read one data from each file /// initialize the priority queue: read the first data from each file // priority_queue <pair <int, int>, vector <pair <int, int>, greater <pair <int, int >>> prioritydata; // For (vector <ifstream >:: size_type I = 0; // I <readfiles. size (); I ++) // {// int temp; // readfiles [I]. read (reinterpret_cast <char *> (& temp), sizeof (INT); // prioritydata. push (make_pair (temp, I); //} // merge sort file // ofstream fout; // fout. open ("result", IOS: Binary); // while (true) // {// int onedata = prioritydata. top (). first; // If (onedata = numeric_limits <int> (). max () // {// break; //} // else // {// fout. write (reinterpret_cast <const char *> (& onedata), sizeof (INT); // read an integer from file I // int I = prioritydata. top (). second; // prioritydata. pop (); // int temp; // readfiles [I]. read (reinterpret_cast <char *> (& temp), sizeof (INT); // If (readfiles [I]. EOF () // {// when this file is read to the end, put it in the priority queue. // prioritydata. push (make_pair (numeric_limits <int> (). max (), I); //} // else // {// otherwise, the read data is directly put into the priority queue // prioritydata. push (make_pair (temp, I); //} // close all open files // fout. close (); // For (vector <ifstream>: size_type I = 0; // I <readfiles. size (); I ++) // {// readfiles [I]. close (); //} // void sort_big_file_with_k_way_merge (unsigned long N, unsigned long partitionfilesize) // {// write_random_data_to_file (N ); // timer t; // k_way_merge_sort (divide_big_file_1__small_sorted_file (partitionfilesize); // you can divide the files to be sorted into small files, sort the data in the memory and put it in the disk file. /// assume that the memory is only 1 MB, integers // cout <n/partitionfilesize <"merge sort large files" <n <", memory sorting" <partitionfilesize <Endl; // print (T. elapsed (); // print ("seconds"); // print_file ("result", 1000 );//}

Output result and comparison:

K-channel merge	4 channels	209 seconds
	8-way	190 seconds
	16 routes	223 seconds
2. Merge	4 sub-Files	257 seconds
2. Merge	8 sub-Files	281 seconds

From the results of the above experiments, the two-way merge method is not optimal in external sorting. Because it always merges two files at a time, this results in a large number of times that all data is traversed. In external sorting, because the data volume is large, the number of traversal times directly affects the sorting time. K-path merge emphasizes merging K sub-files in sorted order into a final result file at a time. Therefore, the sub-files are traversed twice, read once, and write once. Other time is mainly spent on the departure and adjustment of the priority queue. Therefore, the value of K cannot be too large or too large, resulting in the adjustment of the heap taking up too much time, too small, leading to excessive memory consumption by internal sorting. The above results show that the 8-way Merge Sorting speed is the fastest.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

External sorting of large Microsoft exam documents (Implementation and comparison of two-way merge and K-way merge)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

External sorting of large Microsoft exam documents (Implementation and comparison of two-way merge and K-way merge)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support