External sort merge sort loser tree

Source: Internet
Author: User

I. Definition issues

External sorting refers to the sorting of large files, that is, the records to be sorted are stored on the external storage, the files to be sorted cannot be loaded into memory at one time, and multiple data exchanges between memory and external memory are required to achieve the purpose of sorting the entire file. The most commonly used algorithm for external sorting is the multi-merge sort, which decomposes the original file into a number of parts that can be loaded into memory at once, each of which is transferred into memory to complete the sorting. The sorted sub-files are then sorted in a multi-merge way.

Second, the processing process

(1) According to the size of the available memory, the file containing N records on the external memory is divided into several sub-files of length L, the sub-files are read into the memory sequentially, and sorted by the effective internal sorting method, then the ordered sub-files obtained after sorting are re-written to external memory;

(2) The sequential sub-files are merged, so that they gradually from small to large, until the entire ordered file.

Let's start with an example of how the merge in the outer sort is done.
Suppose there is a file with 10,000 records, first 10 internal sorting to get 10 initial merge segment R1~r10, each of which contains 1000 records. Then they are shown in 10.11 of the 22 merge, until an ordered file is obtained as

Three, multi-way merge sorting algorithm and loser tree

The multi-path merge sorting algorithm is involved in the common data structure book. From 2 to multiple (k), increase k can reduce the external memory information read and write time, but the K merge segment to select the smallest record needs to compare k-1 times, in order to get U records of an ordered segment of the total Need (u-1) (k-1) times, if the number of merges is S times, then the N records of the file out of the row, The total number of comparisons performed during the internal merge process is S (n-1) (k-1), i.e. (rounding up) (logkm) (k-1) (n-1) = (rounding up) (log2m/log2k) (k-1) (n-1), and (k-1)/ The log2k increases with K, so the internal merge time increases with the K growth, offsetting the time of external memory read and write reduction, which leads to the use of the "loser Tree" of the loser. In the process of internal merging, the number of minimum records in K-merge segments is reduced to (rounding up) (log2k) times to the total number of comparisons (rounding up) (log2m) (n-1), regardless of K.

The loser tree is a fully binary tree, so the data structure can take one-dimensional arrays. The number of its elements is k leaf node, k-1 A comparison node, 1 champions node total 2k. Ls[0] is the champion node, Ls[1]--ls[k-1] is the comparison node, ls[k]--ls[2k-1] is the leaf node (while another pointer index b[0]--b[k-1] points). In addition BK is an additional auxiliary space, does not belong to the loser tree, the initialization of the Minkey value.

The process of the multi-merge sorting algorithm is roughly:

1): The first element in the K-merge segment is then stored in the b[0]--b[k-1] leaf node space, then call Createlosertree to create the loser tree, after the creation of the smallest keyword subscript (that is, the number of the merged segment) is deposited in ls[0]. And then keep looping:

2) The minimum key that is stored in the ls[0] is derived from which merge segment ordinal is q, the first element of the merge segment is output to an orderly merge segment, and then the next element keyword is placed in the leaf node where the previous element is located b[q], call adjust along the b[q] The leaf node adjusts to the loser tree until the new smallest keyword is chosen, and its subscript is also in ls[0]. Loop this procedure until all elements are written into an orderly merge section.

Four, pseudo-code:

void Adjust (Losertree &ls, int s)
/* from the leaf knot.B[s] to the parent node of the root nodeLs[0] adjusting the loser tree*/
{int T, temp;
t= (s+k)/2; /*t forB[s] The parent node is subscript in the loser tree,K is the number of merge segments*/
while (t>0)/* If the root is not reached, continue*/
{if (b[s]>b[ls[t])/* Compare to the data indicated by the parent node*/
{/*ls[t] record the segment number where the loser is located,s indicates the new winner, the winner will go to the next level of comparison*/
Temp=s;
S=LS[T];
Ls[t]=temp;
}
T=T/2; /* Step back to the root to find the parent node */
}
Ls[0]=s; /*ls[0] Record the number of times the minimum key word is located */
}


void K_merge (int ls[k])
/*LS[0]~LS[K-1] is the internal comparison node of the loser tree.B[0]~B[K-1] Store separatelyThe current record of the K initial merge segment*/
/* functionGet_next (i) is used toI merge segment reads and returns the current record*/
{int b[k+1), i,q;
for (i=0; i<k;i++)
{B[i]=get_next (i); /* Read separatelyThe first keyword of a K-merge segment*/  }
B[k]=minkey; /* Create a loser tree*/
for (i=0; i<k; i++)/* settingsThe initial value of the loser in LS*/
Ls[i]=k;
for (i=k-1; i>=0; i--)//* FROMb[k-1]......b[0] adjust loser */
           Adjust (LS, i);              /* the loser tree is created, Minimum keyword ordinal deposit ls[0]
     while (B[ls[0]]!=maxkey)
      {   q=ls[0];                        /*q */
for the merge segment where the current minimum keyword is located           PRINFTF ("%d", b[q]);
          b[q]=get_next (q);
          adjust (ls,q);                /*q to adjust the loser tree, select the new minimum keyword */
     }

For example, a detailed process. The losers of the 2 sub-node comparisons are placed in their parent nodes, and the winners are sent to the parent node of their parent to make comparisons, which is the loser tree. B[0] put the ultimate winner.

V. Summary

Finally, the process of using a multi-merge sort for external sorting is roughly described: the large file is divided into L segments based on limited memory resources, then the L segments are read into memory in turn, and each segment is sorted by an efficient internal sorting algorithm, and the ordered result is written directly to the external memory file for the initial ordered merge segment. In order to select the appropriate sorting algorithm, we need to take into account the auxiliary space required by the internal ordering and the limited memory space to decide what to divide the large file into several segments. Next choose the appropriate way K to this L merge section of the multi-merge sort, each merge to make K merge segment into 1 large merge section write file, repeated several times merged to get the whole ordered file. In the multi-path merging process, the memory space only need to maintain a size of 2k loser tree, the data is taken and put are corresponding external memory read and write, so that a large chunk of data read into the memory, the memory of a large chunk of data written to the file compared to save time, I do not know whether this requires programmer programming or OS can be done directly through the virtual page file. Find out the computer composition principle of the textbook review, I think that the virtual page file management to solve the problem is completely irrelevant. Segment-page Virtual storage is the logical space for the program to be managed in a segment, and the file to be sorted is not part of the logical space of the program itself. In fact, this issue should be considered in terms of the cache provided by the disk itself. Now the disk generally has a few m to more than 10 m of the cache, take advantage of the spatial local and temporal rules of data access, using a pre-read strategy, one time to read a piece of data into the cache, read and write again first check whether the cache can hit, if you can hit the disk does not need to read. If the cache space is not sufficient to improve the read/write rate, the programmer will need to write a program to read the chunk data.

External sort merge sort loser tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.