Use "merge" to improve "quick sorting"

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use "merge" to improve "quick sorting"

[Time: 2003-11-2 Source: plainsong]

Sorting and searching are the two most commonly used algorithms in programming. c ++ programmers are lucky because there are various general sorting functions in the C ++ standard library; delphi programmers only have tlist. sort and tstringlist. sort is available, so Delphi programmers usually add sorting functions to their common function units.

As a "general sorting function", it is natural to choose from a "comparison-based" sorting algorithm, and "quick sorting" often makes it the first choice for programmers with its efficiency advantage. However, quick sorting also has a problem that is often another concern ...... In the worst case, the time complexity is O (n2 ). Therefore, pessimistic programmers usually use "heap sorting" (partial_sort is recommended in the C ++ standard library, because sort may be implemented by "Fast sorting", while partial_sort is usually implemented by "heap sorting ").

In order to "eat fish without losing the bear's paw", many programmers are trying to improve "quick sorting ", it can achieve tolerable performance in the worst case without affecting its excellent average performance. The most successful one should be "intro sort ", here I am talking about another method.

We know that the main idea of "Fast sorting" is to recursively divide the sequence and use a pivot value to divide the sequence into "Large sequence" and "small sequence ". If this division is evenly divided into two parts each time, then it achieves the best performance; and if every time is "malformed division"-the sequence is divided into two parts of the length of 1 and N-1, it degrades into Recursive Implementation of the "bubble sort", performance can be imagined. To improve it, we need to deal with this situation. One idea is to improve the Division so that it does not produce "malformed division"; the other is to check the Division results, and if it is malformed division, perform special processing. Here is the second method.

The basic idea is to check the length of the two sub-sequences produced by the Division result. If the ratio of one sub-sequence to the other exceeds a certain limit, this is considered a malformed division ", we continue to use "Quick Sort" for shorter subsequences, and divide the longer subsequences into two subsequences for separate sorting, and then merge them again. The merge of two ordered sequences can be implemented as linear time complexity. Therefore, the time complexity of O (N * logn) can still be obtained when all sequences are malformed.

The basic algorithm is described as follows:

Procedure sort (data, size );

Begin

M: = partition (data, size );

Moredata: = @ data [M + 1];

Moresize: = size-m-1;

Size: = m;

If size> M then

Begin

Swap (moredata, data );

Swap (moresize, size );

End;

Sort (data, size );

If moresize Div max_ratio> size then

Begin

Sort (moredata, moresize Div 2 );

Sort (@ moredata [moresize Div 2], moresize-moresize Div 2 );

Merge (moredata, moresize Div 2, moresize );

End

Else

Sort (moredata, moresize );

End;

Partition is a well-known division subroutine for "quick sorting". Merge (data, first, size) converts [0, first) and [first, size) in data) the two sequences are merged into an ordered sequence and stored in data. The above algorithm considers that the value of m at the position of partition is the pivot value, that is, the sequence can be divided into [0, M-1], [M, m] and [M + 1, size-1. If the implementation of partition cannot guarantee this, moredata should be data [m], and moresize should also be size-M.

Now, let's take a brief look at this sorting. The best case is that the merge will not happen when the partition is evenly divided every time, which is equivalent to "quick sorting ". In the worst case, each partition will divide the sequence into two parts of the length of 1 and N-1, then it becomes the Recursive Implementation of the "two-way Merge Sorting ", however, a comparison is performed about n times before each merge to separate an element from the merge. Therefore, the time complexity of this sort is still O (N * logn), and the number of comparisons is about twice that of the merge sort.

For max_ratio selection, based on my experience, it should be better to select numbers between 4 and 16. I chose 8.

In fact, the implementation is more complicated than the above mentioned. First, when the sequence length is not greater than 16, insert sorting is used, and second, a recursion is removed. Then we did some efficiency tests. The data is 500,000 strings with a length of 10, which are roughly as follows:

Random Distribution with 500000 elements:
Sort: Time: 3.31; comparison times: 10843175.
Quicksort: Time: 3.28; number of comparisons: 10936357.
Introsort: Time: 3.35; number of comparisons: 10958355.
Mergesort: Time: 4.20; comparison times: 13502620.

Forward distribution with 500000 elements:
Sort: Time: 1.71; comparison times: 8401712.
Quicksort: Time: 1.91; number of comparisons: 9262161.
Introsort: Time: 1.80; number of comparisons: 8401712.
Mergesort: Time: 1.72; comparison times: 4766525.

Backward distribution with 500000 elements:
Sort: Time: 2.38; comparison times: 11737937.
Quicksort: Time: 2.54; number of comparisons: 12619014.
Introsort: Time: 2.38; number of comparisons: 11293745.
Mergesort: Time: 1.69; comparison times: 4192495.

500000 elements with the same value:
Sort: Time: 1.41; comparison times: 8401712.
Quicksort: Time: 1.47; number of comparisons: 9262161.
Introsort: Time: 1.40; number of comparisons: 8401712.
Mergesort: Time: 1.43; comparison times: 4766525.

Waveform distribution (wavelength 1000) with 500000 elements:
Sort: Time: 2.52; comparison times: 10658948.
Quicksort: Time: 2.97; number of comparisons: 12971845.
Introsort: Time: 3.02; number of comparisons: 12672744.
Mergesort: Time: 2.71; comparison times: 7978745.

Peak Distribution (the first half is a forward direction, and the second half is a reversal of the first half), with 500000 elements:
Sort: Time: 2.42; comparison times: 10401407.
Introsort: Time: 5.13; number of comparisons: 19211813.
Mergesort: Time: 1.88; comparison times: 5176855.

Valley distribution (reverse of the peak distribution), with 500000 elements:
Sort: Time: 2.29; comparison times: 10944792.
Introsort: Time: 5.29; number of comparisons: 17898801.
Mergesort: Time: 1.90; comparison times: 5282136.

Because the worst distribution of this sort is not very good looking, I modified the Partition Function to divide the forward order into unfavorable distribution, and then tested it on the same machine:

Forward distribution with 500000 elements:
Sort: Time: 2.77; comparison times: 12011738.
Introsort: Time: 4.31; number of comparisons: 19212426.
Mergesort: Time: 1.73; comparison times: 4766525.

According to my analysis, the sorting can be divided into two parts: quicksort and mergesort. the time spent in the two parts is independent of each other, so I subtract the time from the time in mergesort, then the time of mergesort with the random distribution (take 4.20) is 5.24, and the result is about the same.
From this result, the worst case of introsort is basically the same.

Here, I would like to thank leemars and zhanyv of csdn for their great help. They have discussed the improvement of sorting with me on the forum. Leemars provides me with an excellent partition function.

The complete implementation code is as follows:

Type
Tpointerlist = array [0 .. 32767] of pointer;
Ppointerlist = ^ tpointerlist;
Ppointer = ^ pointer;
Tlessthenproc = function (left, right: pointer): Boolean;

Const

Sort_max = 16;

Max_ratio = 8;

{*************************************** ***********************************
Function: Partition
Function: divides a sequence into two subsequences. All values of the latter subsequence are not greater than any values of the former subsequence.
Returns the index of the subsequence.
Parameters:
Data: ppointerlist, source sequence.
Size: integer, sequence length.
Lesstc: tlessthenproc, used to define the comparison function of order
Note:
Used for "quick sorting"
Pivot strategy: select the median of values 0, 0.5, and 1.
Return Value guarantee:
A <result is not lessthen (data [Result], data [a]);
A> result is not lessthen (data [a], data [Result]);
**************************************** **********************************}
Function partition (data: ppointerlist; Size: integer; lessthen: tlessthenproc): integer;
VaR
M: integer;
Value: pointer;
Begin
M: = size Div 2;
Dec (size );
If lessten (data [0], data [m]) then
Swap (data [m], data [0]);
If lessten (data [size], data [0]) then
Swap (data [size], data [0]);
If lessten (data [0], data [m]) then
Swap (data [m], data [0]);
Value: = data [0];
Result: = 0;
While result <size do
Begin
While (result <size) and lessthen (value, data [size]) Do
Dec (size );
If result <size then
Begin
Data [Result]: = data [size];
INC (result );
End;
While (result <size) and lessthen (data [Result], value) Do
INC (result );
If result <size then
Begin
Data [size]: = data [Result];
Dec (size );
End;
End;
Data [Result]: = value;
End;

{*************************************** ***********************************
Function: Merge
Function: combines two ordered sequences into a sequential sequence.
Parameters:
Srcfirst, srcsecond: ppointerlist, two source sequences. If the same value exists during merge,
The srcsecond value is placed behind the srcfirst value.
DeST: ppointerlist, which stores the sequence of Merged Results and must have enough space.
Sizefirst, sizesecond: integer, two source sequence lengths
Lesstc: tlessthenproc, used to define the comparison function of order
**************************************** **********************************}

Procedure merge (srcfirst, srcsecond, DEST: ppointerlist;
Sizefirst, sizesecond: integer; lessthen: tlessthenproc );
VaR
I: integer;
Isfirst: Boolean;
Begin
Isfirst: = true;
If (sizefirst = 0) or (lessthen (srcsecond [0], srcfirst [0]) then
Begin
Swap (pointer (srcfirst), pointer (srcsecond ));
Swap (sizefirst, sizesecond );
Isfirst: = Not isfirst;
End;
While sizefirst> 0 do
Begin
If sizesecond = 0 then
I: = sizefirst
Else
Begin
I: = 0;
While (I <sizefirst) and
(Isfirst and not lessthen (srcsecond [0], srcfirst [I])
Or (not isfirst and lessthen (srcfirst [I], srcsecond [0]) do
INC (I );
End;
Move (srcfirst ^, DEST ^, sizeof (pointer) * I );
Dec (sizefirst, I );
Srcfirst: = @ srcfirst [I];
DeST: = @ Dest [I];
Swap (pointer (srcfirst), pointer (srcsecond ));
Swap (sizefirst, sizesecond );
Isfirst: = Not isfirst;
End;
End;

{*************************************** ***********************************
Function: sortinsert
Function: Insert a value to the ordered sequence to ensure that the value is still ordered after insertion.
Parameters:
Data: ppointerlist, ordered sequence, must contain size + 1 Element
Size: integer, original sequence length
Value: The newly inserted value.
Lesstc: tlessthenproc, used to define the comparison function of order
**************************************** **********************************}
Procedure sortinsert (data: ppointerlist; Size: integer; Value: pointer; lesstter: tlessthenproc );
VaR
J: integer;
Begin
If lessten (value, data [0]) then
J: = 0
Else
Begin
J: = size;
While (j> 0) and lessthen (value, data [J-1]) Do
Dec (j );
End;
Move (data [J], data [J + 1], sizeof (pointer) * (size-j ));
Data [J]: = value;
End;

{*************************************** ***********************************
Function: mergepart
Function: merges two adjacent ordered subsequences in a sequence into an ordered subsequence and stores them in the original position.
Parameters:
Data: ppointerlist, source sequence.
Partsize: integer, the length of the first ordered subsequence.
Size: integer, total sequence length.
Lesstc: tlessthenproc, used to define the comparison function of order
Note:
If the free space is sufficient, call merge implementation; otherwise, call sortinsert.
**************************************** **********************************}

Procedure mergepart (data: ppointerlist; First: integer; Size: integer; lessthen: tlessthenproc );
VaR
Buffer: ppointerlist;
I: integer;
Begin
Buffer: = allocmem (size * sizeof (pointer ));
If buffer <> nil then
Begin
Move (Data ^, buffer ^, size * sizeof (pointer ));
Merge (@ buffer [0], @ buffer [first], Data, first, size-first, lesstge );
Freemem (buffer );
End
Else
Begin
Dec (size );
For I: = partsize to size do
Sortinsert (data, I, data [I], lesstsert );
End;
End;

{*************************************** ***********************************
Function: insertionsort
Function: simple insert sorting
Parameters:
Data: ppointerlist, source sequence.
Size: integer, sequence length.
Lesstc: tlessthenproc, used to define the comparison function of order
**************************************** **********************************}
Procedure insertionsort (data: ppointerlist; Size: integer;
Lesstc: tlessthenproc );
VaR
I: integer;
Begin
Dec (size );
For I: = 1 to size do
Sortinsert (data, I, data [I], lesstsert );
End;

{*************************************** ***********************************
Function: Sort
Function: Sort
Parameters:
Data: ppointerlist, source sequence.
Size: integer, sequence length.
Lesstc: tlessthenproc, used to define the comparison function of order
Note:
Use quick sorting.
Insert sorting is used when the length of a sub-sequence is not greater than sort_max.
When the child sequence length ratio is greater than max_ratio, the eldest son sequence is divided into two parts for sorting and then merged.
**************************************** **********************************}
Procedure sort (data: ppointerlist; Size: integer; lesstc: tlessthenproc );
VaR
M: integer;
Otherdata: ppointerlist;
Othersize: integer;
Begin
Assert (Data <> nil );
While size> sort_max do
Begin
M: = partition (data, size, lessttion );
If (M <= size Div 2) then
Begin
Otherdata: = @ data [M + 1];
Othersize: = size-m-1;
Size: = m;
End
Else
Begin
Otherdata: = data;
Othersize: = m;
Data: = @ otherdata [M + 1];
Size: = size-m-1;
End;

If (othersize Div max_ratio> size) then
Begin
M: = othersize Div 2;
Sort (otherdata, M, lesstatio, maxratio );
Sort (@ otherdata [m], othersize-M, lesstatio, maxratio );
Mergepart (otherdata, M, othersize, lesstart );
End
Else
Sort (otherdata, othersize, lesstatio, maxratio );
End;
Insertionsort (data, size, lesstort );
End;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use "merge" to improve "quick sorting"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use "merge" to improve "quick sorting"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support