Sequence (two) key index, bucket sort, bitmap, loser tree (Photo detailed explanation--loser tree)

Source: Internet
Author: User
Tags sorts

Sequence (two)



The sorting algorithm above has one property: in the last result of sorting, the order of each element depends on the comparison between them . We call this sort of sorting algorithm a comparative sort .

Whatever the lower bound of the time complexity of the comparison is NLGN.

The sorting algorithm below determines the sort order using arithmetic rather than comparison . Therefore, the Nether NLGN are not applicable to them.

Key Index counting method (counting sort)


Count sort if each of the N input elements is an integer in the 0 to K interval, where K is an integer.

thought : For each INPUT element x, determine the number of elements less than x. With this information, you can put x directly into its position in the output array.

Like what:

Students are divided into several groups, labeled 1, 2, 3, 4, etc., in some cases we would like to have the whole class sorted by group ordinal.



1. Frequency statistics:


The first step is to use the int array cout[] to calculate how often each key appears.

For each element in the array. Use its key to access the corresponding element in count[] and add 1. (that is, the index of the key value as cout[] assumes that the key value is R. COUNT[R+1] plus 1. (Why do you need to add 1?) Explained later)

for (i=0; i<n; i++)

Count[a[i].key () +1]++;

count[0~5]:0 0 3 5 6 6

2. Convert the frequency to an index:


Next. We will use count[] to calculate the starting index position of each key in the sorted result.

In this demo sample. Because there were 3 people in the first group and 5 in the second group, the classmates in the third group had a starting position of 8 in the sorted result array.

For each key value R, the sum of the frequency of the key less than r+1 is the sum of the frequency of the key less than R plus Count[r], so it is very easy to convert count[from left to right into an indexed table for sorting.

for (int r=0; r<r; r++)

Count[r+1] + = Count[r];

count[0~5]:0 0 3 8 14 20

3. Data classification:


After the count[] array is converted to an index table, the entire element (student) is moved to a secondary array of aux[] for sorting . The position of each element in aux[] is determined by the corresponding count[] value of its key (group), and the value of the corresponding element in count[] is added 1 after the move to ensure that count[r] is always the index position of the next key R element in aux[]. this process simply iterates through the data to produce a sort result .

(The stability of such implementations is critical-the same elements of the key are clustered together after sorting, but the relative order does not change.) )

for (int i=0; i<n; i++)

Aux[count[a[i].key ()]++] = A[i];

4. Write Back:


So we've finished sorting the elements as we move them to the secondary array. So the final step is to copy the result of the order back into the original array.

for (int i=0; i<n; i++)

A[i] = Aux[i];

features : The key index counting method is an efficient and often neglected sort method for small integer key ordering .

The key index counting method does not need to compare, only when the range r is within a constant factor of n, it is a linear time-level sorting method.

Base Sort


Sometimes we need to sort the same string for the same length.

Such situations are often seen in sort applications-such as phone numbers, bank accounts, IP addresses, and so on, which are typical fixed-length strings.

Sorting such strings can be done by ordering a low-priority string . Assume that the length of the string is all W. That's right- to-left with the character of each position as the key , using the key index notation (or insert sort) to sort the string w times.

(to ensure that the cardinality is sorted correctly, the one-digit sorting algorithm must be stable.) For example: Count sort, insert sort)

feature : Is the cardinality sort better than a comparison based sorting algorithm (such as high-speed sorting)?

The time complexity of a cardinal sort is linear (n), and the result looks better than the expected execution time cost (NLGN) of high-speed sorting. however , the constants that are implied in these two expressions are different.

While processing n keyword, the number of loops that the cardinality sort runs is less than the high-speed sort. But every round of it takes much longer. and high-speed sequencing is often more efficient than base sorting to use the hardware cache.

In addition, the base sort using count sort as the intermediate stable sort is not an in-situ sort. The comparison of very many NLGN time is the original sort. Therefore, when the capacity of main memory is more valuable, we may prefer to sort the original order like high-speed sorting.

Bucket Sort


Bucket sort (bucket sort) If the input data is uniformly distributed, it is distributed independently on the [0,m] interval. On average, its time cost is O (n).

thought : Bucket sorting divides the [0,m] interval into n equal-sized sub-ranges. or buckets .

Then, place the N input numbers in each bucket separately.

Because the input data is evenly and independently distributed across the [0,m] interval, there is generally no case where a very large majority falls in the same bucket.

In order to get the output result. We sort the numbers in each bucket, and then we go through each bucket. It is possible to list the elements in each bucket according to the order.

(The bucket-leveling algorithm also requires a temporary array of b[0..n-1] to hold the linked list (that is, the bucket), and if there is a mechanism for maintaining these lists)

(a bit like a hash table of the zipper method of processing.) )




bit diagram


thought : A numeric value represented by the relative position (index) of a bit.

That is, it is like using an array's subscript to represent a numeric value. Just to save memory we use a bit position to mark a number.

For example: We are able to store the set {1, 2, 3, 5, 8, 13} in the following string: 0 1 1 1 0 10 0 1 0 0 0 0 1 0 0 0 0 0 0 The individual bits of the number in the collection are set to 1, and the other bits are all set to 0.

features : The problem with the use of the bit graph method is that (the situation is less common in sorting problems):

The input range is relatively small and does not include repeated data. And no data is associated with the record.

"Application Examples"

Consider a problem: sort a disk file. (Detailed descriptive narration for example below)

input :

The input is a file, at most including n non-repeating positive integers, each positive integer is less than n, here n=10^7. These integers are not associated with their corresponding records.

(That is, sort these integers only)

Output :

Outputs a sorted list of integers in an ordered form.

constraints :

At most, there is only 1MB of usable main memory. But there is plenty of free disk space. 10 seconds is the most appropriate execution time.

Seeing the disk file sort, we first think of the classic multi-merge sort.

(I'll talk about it later)

An integer of 32 bits, we can store 250,000 numbers in 1MB space. Therefore, we will use a program with 40 channels in the input file. In the first channel it reads random integers between 249999 into memory and (at most) sorts 250,000 integers, then writes them to the output file.

The second channel sorts integers from 250000 to 499999, and so on until the 40th channel, it sorts integers between 9750000 and 9999999. In memory. We use the high-speed sort, then merge the ordered sequence, and finally get the overall order.

However, this approach is less efficient. It takes 40 times to read the input file, as well as an externally merged IO overhead.

How to reduce the number of IO operations to improve the efficiency of the program? Read all of these 10 million numbers at once into memory?

Using a bitmap, we will use a 10 million bit bit to represent the file. In the bit string, if and only if the integer i is in the file, the I bit is opened (set to 1).

Given a bitmap data structure that represents a collection of integers in a file. We are able to divide the process of writing this program into three natural phases. The first phase closes all bits and initializes the collection to an empty set.

The second stage reads each integer in the file and opens the corresponding bit, creating the set.

The third stage examines each bit. Assuming that a bit is 1, write the corresponding integer and create the sorted output file.

Summary of internal sorting methods


Stability


It is assumed that a sorting algorithm can preserve the relative position of repeated elements in an array, which can be called stable .

This nature is very important in many cases.

( for example :

Consider an Internet business program that needs to deal with a large number of events that contain geolocation and time stamps.

First, we store them all in one array at the time of the event, so that they are sorted in chronological order in the array.

Now, based on geo-location segmentation, it is assumed that the sorting algorithm is not stable, and that the trades of each of the sorted cities may no longer be sorted in chronological order.

Whether the algorithm is stable

Select Sort No

Insert Sort is a

Hill Sort No

High Speed Sort No

Three-direction high-speed sort No

Merge Sort is a

Heap Sort No

Key Index Count is a

The cardinality sort is

High-speed sequencing is the fastest general-purpose sorting algorithm.

High-speed sequencing is fast due to the very small number of instructions in its inner loop (and it can also take advantage of the cache, since it always has sequential access to the data), so its execution time increases in order of magnitude ~CNLGN, where c is smaller than the corresponding constants of other linear-to-number-order algorithms.

And. After using the three-direction slicing, the high-speed sequencing has a linear level of input to some of the distributions that may appear in the actual application, while the other sorting algorithms still require a linear logarithmic time.

Assuming stability is important and space is not a problem, merge ordering may be the best.

---------------------------------------------------------------------- External Sort ---------------------------- -----------------------------------------


(Why do we have to sort outside?) Why not when inserting data is organized according to certain data structure, convenient to find and order. This is like a static search tree, no useful function.)

The external sort is basically composed of two relatively independent stages.

first , by the available memory size, the files containing N records on the external memory are divided into sub-files of length L, which are read into memory sequentially and sorted using a valid internal sorting method. And writes the ordered sub-files that are sorted once again to external memory. These ordered sub-files are usually referred to as merge segments .

and then . The merging segments are merged into each other to make the merging segments gradually from small to large. Until the entire ordered file is obtained.

"Example" if you have a file that contains 10,000 records. First, 10 initial merge segments are r1~r10 by 10 internal sorting. Each section contains 1000 records. Then they were 22 merged. Until an ordered file is obtained.


watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvewfuz195dwxlaq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">


Each trip merges from the M-merge segment to get a M/2 merge segment.

This method of merging is called 2-way balanced merging.

If the 10 initial merge segments obtained in the above example are 5-way balanced, it is visible that only two merges are required. The total IO read/write times in the outer row are significantly reduced.

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvewfuz195dwxlaq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">


Under normal circumstances, when the m initial merge segment is K - Way balanced, the number of times to merge is s = logkm

Visible. If you add K or decrease m, you can reduce s.

The general merging merge, each obtains merges the orderly section in the record, must carry on the k-1 time comparison. Obviously, it is necessary to perform (u-1) (k-1) times in order to obtain a merge segment with U records.

The total number of times in the internal merge process is:

LOGKM (k-1) (u-1) TMG = (log2m/log2k) (k-1) (u-1) TMG

Therefore, to simply add K will result in the internal merging time, which will offset the increase in k and reduce the external memory information read and write time benefit.

However, if you use the loser Tree (Tree of Loser) in the K-way merge, you will only need to perform log2k when selecting the smallest keyword record in the records.

The total merge time becomes log2m (u-1) TMG This type is independent of K. It no longer grows with the growth of K.


Loser Tree


It is a variant of the tree selection sort.

Each non-terminal node represents the "loser" in its left and right child nodes.

And let the winner to participate in a higher level of the game. You can get a " Loser Tree " (the so-called "winner" is the element you want to choose).

Take the example of a loser tree that merges 5-way (k=5):

Array Ls[0...k-1] represents a non-terminal node in the loser tree. The ls[1 of the root node in the loser tree] is the "champion", and the other nodes record the index value of the "loser" in the left and right sub-tree. B[0...K-1] is the first element of each merge sequence to be compared.

Ls[] In addition to the first element, the other elements are represented as completely binary trees.

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvewfuz195dwxlaq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">

How does it correspond to the b[of the leaf node?

The parent node of leaf node b[x] is ls[(x+k)/2].



Establishment of the loser tree:


1. Initialize the loser tree: Set the ls[0..k-1] all to Minkey (the smallest possible value, the "absolute winner")

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvewfuz195dwxlaq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">

We set up a b[k]= minkey. The index value in the B array is recorded in ls[]. So the initial 5.

2, from each leaf knot point upstream , adjust the loser tree value.

The winner s (the initial leaf node value) is compared with the value of its parent node, who loses (the larger) who is in the upper (with the parent node), and the winner is recorded in S . (to decide the winner, record the loser, the winner goes up)

For leaf node b[4]. The results of the adjustment are as follows:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvewfuz195dwxlaq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">

For leaf node b[3], the results of the adjustment are as follows

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvewfuz195dwxlaq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">

Similarly, for leaf node b[2], the results of the adjustment are as follows



Similarly, for leaf node b[1], the results of the adjustment are as follows

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvewfuz195dwxlaq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast ">


Similarly, for leaf node b[0]. The results of the adjustment are as follows



void Createlosertree (Losertree &ls) {    b[k].key = Minkey;    Set the initial value of "loser" in LS for    (i=0; i<k; ++i)        ls[i] = k;            From b[k-1]...b[0] Adjust loser    for (i=k-1; i>=0;-I.)        adjust (LS, i);} void Adjust (losertree &ls, int m) {    //along the path from the leaf node b[m] to the root node ls[0] the loser tree for    (i = (M + k)/2; i>0; i=i/2)  //ls[ I] is b[m] 's parent node    {        if (B[m].key > B[ls[i]].key)             Exch (M, Ls[i]);         M save new Winner's index    }    ls[0] = m;}

PostScript

Choose the smallest from the number of N, why do we use the loser tree?

First, we think of using a priority queue. But it should deal with such a multi-merging situation. Efficiency is not high.

Heap structure: The elements to be processed are in the tree nodes (in leaf and non-leaf nodes).

Loser Tree: Its pending elements are on the leaf nodes of the tree, and their non-leaf nodes record the results of their last sub-nodes.

In this case, one of the leaf nodes of the heap structure is not a corresponding fixed sequence of a pending merge. Once the maximum value is selected, the first element of each merge sequence is taken out, and the rebuild heap is re-adjusted. Can not take advantage of the previous comparative results.

And the loser tree. A leaf node is fixed to a corresponding merge sequence. Thus, if the first element of the sequence is selected. The next element of the sequence can be added directly into the node. Then the path along the tree is compared.

Summary: The stacking structure is suitable for inserting irregular, select the maximum value.

The loser tree is suitable for inserting multiple sequences and selecting the maximum value.


Sequence (two) key index, bucket sort, bitmap, loser tree (Photo detailed explanation--loser tree)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.