In the previous chapters, the bubble sort, insert sort, select sort, and merge sort are described separately. In the introduction of hashing technology, the bucket sorting, counting sorting and odd sorting were also introduced. When discussing the priority queue, this structure, combined with the heap, introduces heap ordering and more general tournament ordering.
This chapter will learn some of the more advanced sorting algorithms and discuss some of the derivative issues associated with them.
Quick Sort
The core idea of fast sequencing is the Division method. For any sequence that is to be sorted, they are divided into two sub-sequences of the front and back, and the recursive implementation of these two smaller sub-sequences is sorted.
The quick sort and merge sort are the typical cases of divide-and-conquer law, but they have a great difference. For fast sorting, the independence of sub-problems is more distinct, requiring that any element in the previous sequence should not exceed any of the elements in the latter sequence. If this condition is satisfied, then after sorting the previous sequence and the latter sequence recursively, we will naturally get the ordered sequence of the whole by simply stringing them together, thus accomplishing the initial sort task.
The task and difficulty of the fast sorting core is how to complete the sub-task or sub-sequence division. Merge sort is just the opposite, and its calculation and difficulty are all about merging the solution of subtasks.
Pivot point
For the sub-sequence to be divided, we need a "pivot point", which is an element that is already in place, which divides the sorted sequence into two sub-sequences to be sorted.
The steps to construct the pivot point are as follows:
- Select the first element m of the sequence as the "candidate" for the pivot point.
- Specify two pointers "Lo" and "hi". Divide the sequence into three parts: L, U, G. where l elements are less than or equal to m,g are greater than or equal to M. The element size in U is unknown, when the initial state, the whole sequence is u,l, G is empty.
- Move the lo and hi to the inside in turn to get close to each other. Each time the lo and hi are moved, an element in U is compared to M, and if it is less than m, it is placed in L, otherwise put in G. Loop through this process until lo and hi coincide, where u is empty and all its elements are in L or G.
- Place m at the lo and hi coincident. At this point, M becomes a true pivot point, where it is positioned after the sort.
Sorting process
The sorting process is mainly the movement of Lo and hi and the expansion of the accompanying L and G.
First select 6 as the pivot point candidate, at which point the first element is logically considered null. Then we expand G, we see that the element that hi points to is 7 greater than 6, so 7 is classified as G, and hi points to the new element 1. Since 1 is less than 6, we put 1 into L, and Lo points to the new element 3. Since 3 is less than 6, 3 is classified as L,lo point 8. 8 is greater than 6, so move the 8 into the G,hi to the left and point to 5. Continue repeating this process until lo and hi both point to the same empty element, where 6 is placed at the same time. The sorting process is complete. You can see that all elements before 6 are not greater than 6, and then the elements are not less than 6.
Recursive
After the first tectonic pivot point is completed, the sub-sequence is then constructed again. Recursively this process until the subsequence is a single element, at which time the entire sequence is sorted and recursive exits. As you can see, a quick sort can be seen as the process of finding the "pivot point" of an element in place.
Performance analysis
- As can be seen from the sorting process, the quick sort is unstable and the repeating elements may be reversed in order after sorting.
- The space required for the quick order is constant, with only two pointers and one pivot point candidate for storage space.
- The time complexity of the merge sort is O (Nlogn), because the sub-tasks of the Merge Sort division are equal in size, close to N/2. So just logn the second son process. The N/2 is not guaranteed to be a sub-task, because the selection of the pivot point is random. In the worst case, each sub-sequence is divided into a size of 0, and the other one is N-1. In this way, the time complexity is O (n^2). On average, the time complexity of the Clippers is also O (NLOGN).
Algorithm implementation
There are many implementations of the fast-line, the implementation of the selection here is different from the previous, not according to the arrangement of L, U, G, but the arrangement of L, G, U. Of course the basic steps are the same.
Code:
Quick Sort gif demo (source Wikipedia)
Select
The selection of selection is a kind of problem which is derived and generalized by sort, and the common characteristic of this kind of problem is that we need to find a special element from a set of elements that can be compared with each other. For example, find an element that is small and large in a particular order, or find the element in the middle of a value that happens to be listed in the center. If you've got the sorting sequence for the entire data set, the problem can be solved naturally. However, because of the high complexity of the sequencing, we have to bypass it and look for more efficient algorithms.
- K-selection: In any set of comparable elements, from small to large, find the order of K.
- Median: An ordered sequence of N of length s, finds an element called the median number s[ ? \ n/2? ] (N/2 is rounded down).
Median is a special case of k-selection and one of the most difficult cases.
- Majority: the majority. In an unordered vector, if more than half of the element m is the same, the element M is called the majority.
The element m must first be the median in the vector. If we take the median first, we can traverse the number of the series to determine if it is a majority.
Select the majority
We set a prefix p, which satisfies a condition: half of the element x is the same. Then after removing the prefix p, the remainder of vector a A-P has the same majority as the original vector a. So, we can reduce the sequence size by looking for the number of a-p.
The concrete implementation of the idea is: from the first element of a to scan continuously, once the condition of P is reached, the P is deleted. And iterate the process over and over again, and the resulting x is the majority of the original vector.
The algorithm is implemented as follows
The time complexity of this algorithm is O (n) and the space complexity is O (1).
K-selectionquickselect
Borrow ideas from a quick sort to construct a "pivot point" in the sequence. The position of the pivot point is the position after the sequence has been sorted. If the pivot point is greater than the specified K, the description s[k] in L, we can delete G. Conversely, s[k] in G, we can erase L. It is also a "cut-and-conquer" process that reduces the size of the problem.
The time complexity of this algorithm is O (NLOGN) in the worst case, and clearly cannot meet the requirements. We need to optimize it.
Linearselect
Linearselect is a quickselect optimization algorithm, and its efficiency can be achieved linearly. This algorithm sets a relatively small constant Q. The following steps are selected:
- Recursive base: If the length of the vector A is less than Q, select the other trivial selection algorithm s[k].
- Divide vector A into n/q sub-sequences
- Sort each sub-sequence and get the number of digits
- Recursive call Linearselect get the median number of n/q m
- The vector A is divided into three parts: an element less than M is put into a subset L, an element equal to M is placed in a subset E, and an element greater than M is placed in a subset of G
- Delete E, G (l), and recursively call Linearselect if the s[k] you are looking for is located in L or G
- If the s[k to find is located at E, the element m in E is returned directly, and the lookup ends.
Linearselect Performance Analysis:
The 2nd step is O (n)
The 3rd step is O (n) = O (1) *o (n/q), each subsequence length is constant Q, and its ordering time complexity is O (1).
The 4th step is T (n/q), this step is recursive and the problem scale is reduced to n/q
The 5th step is O (n), and the vector A is scanned once.
The 6th step is T (3/4n), which is also recursive, and the problem size is reduced to up to 3/4 of the original.
Therefore, T (n) = cn + T (n/q) + t (3/4n), C is constant.
If you take C = 5, then t (n) = cn + T (N/5) + t (3/4n) = O (20CN) = O (n)
Hill sort
Hill sort, also called descending incremental sorting algorithm, is a more efficient and improved version of insertion sequencing. Hill Sort is a non-stable sorting algorithm.
The hill sort is based on the following two-point nature of the insertion sort, which proposes an improved method:
- The insertion sort is efficient in the case of almost sequenced data operations, i.e. the efficiency of linear sequencing can be achieved
- But the insertion sort is generally inefficient because the insertion sort can only move data one at a time
--wikipedia
The feature of hill sorting is to treat the sequence as a matrix, sort by column, and reduce the size of the column continuously, while increasing the scale of the row. Finally, a matrix with a column number of 1 is obtained.
- The column-by-column ordering is called w-sorting, and the sorted columns are called w-ordered. Finally only one column of sort is called 1-sorting
- The width of a series of matrices in this process wk,wk-1 and continues to w3,w2 and W1. These widths together form the step sequence. The step sequence must meet the monotonic increment, and the first item must be 1.
- In addition to the monotonic increment and the first item must be 1, the step sequence has no additional requirements. As a result, the performance of hill sequencing varies with different step sequences. Therefore, the hill sort can be considered a class of algorithms.
Sorting process
Take the step sequence as {8,5,3,2,1} for example. Each time the sequence is converted into a matrix with a column length of WK, and the matrix is sorted by column. The sort is completed and then converted back to a single line sequence by row.
As you can see, the order of the sequence is improved after each step, and finally a sequence is completed.
We can also find that the last time is equivalent to ordering the entire sequence, so what is the meaning of the previous k-1 order? This is actually the core idea of the hill sort.
Matrix Conversion
In the process of matrix conversion, we do not need to introduce the data structure of two-dimensional vectors. A one-dimensional vector can be converted into an i-dimensional vector by simply calculating the rank of the vector element logically.
W-sorting: Inserting sort
The order of sequence is improved by each w-sorting, and it is most suitable to sort the matrix by column by inserting sort. Because the insertion sort is sensitive to the input sequence: the less the input sequence is reversed, the higher the performance.
Disadvantages of the shell sequence
The shell sequence is a sequence of steps proposed by the inventor of the hill sort, which is a sequence of equal ratios in multiples of 2.
The time complexity of this sequence in the worst case is O (n^2). The reason for its inefficiency is that because each wk is an even number, the elements in the odd and even numbers do not intersect in each ordering process (except for the last one). That is, the odd-numbered elements are sorted only with the same odd-numbered elements, and the even-numbered elements are ordered only with the same even-numbered elements, which inevitably results in a large number of reverse pairs in the last order, thus reducing the efficiency.
From the analysis of the shell sequence, we see that the reason for inefficiency is the design of the step sequence. What kind of sequence is a good sequence? The answer is that the neighboring items have to be mutually vegetarian, which minimizes the repetition of the last sorting process.
Postage issues
Suppose in a country, sending a regular mail requires five cents, and sending a postcard just Sanmao five. Stamps issued in this country are only four cents in value and two of a dime three. If you need to post a regular mail or postcard, can you just use these two types of stamps to make up the corresponding postage?
We can get the answer with a simple calculation: just six stamps with four cents in value and two stamps with a gross of three are the equivalent of postage five cents for regular mail. However, for postcards, there is no plan to post postcards corresponding to the postage Sanmao five. As a result, we can see the use of a specific set of stamps, for some postage can be exactly the same, and some postage in any case can not be pooled.
Linear combination
In the example above, we found that all stamps may be combined in 4m + 13n. This expression is called a "linear combination", and this combination gives you all the postage you can get out of it.
We use a more general expression: F = mg + nh. Here we are more concerned with the number that cannot be derived from the linear combination, defining the set of these numbers as N (g,h), and the G and H reciprocity. The maximum value in this collection is recorded as X (G,h).
By the conclusion of the number theory, we can conclude that H and G are interdependent when x (g,h) = (g-1) * (h-1)-1 = gh-g-H.
For example, x (4,13) = 35. We know that the postage of the Sanmao Five is not up to four cents and a three stamp, and all postage that is greater than Sanmao five can be collected.
Theorem K
- H-ordered: In a sequence, any element with a pair of H is kept in the order of the previous small size. That is, the interval between H is ordered.
It can be seen that the sequence of hill sort on wk = h is sorted by column, that is, conforms to h-ordered. And this process is called h-sorting in the hill sort. Therefore, any sequence after h-sorting is bound to be h-ordered.
Assuming that in the previous round of sequencing, the sequence reached the h-rodered, in the next round of sequencing, the sequence reached the g-ordered. So, in the g-ordered, do you still keep h-ordered? The answer is yes. Knuth in the famous ACP book (Vol.3 p.90).
If a sequence is both h-ordered and g-ordered, it is also called (g,h)-ordered. This means that the elements in the sequence with any pair of distances (G + H) remain ordered. Further, the linear combination of G and H also satisfies: (mg + nh)-ordered. Thus, we get a conclusion that any pair of elements that can be represented as linear combinations is necessarily sequential.
Next we analyze the elements of rank I in the sequence
If the sequence is already (g,h)-ordered, and G,h. So for I,elements other than the distance (g-1) (h-1) are ordered. Conversely, there may be reverse elements that exist only within the distance of I (g-1)(h-1).
And as the hill sort progresses, the g,h decreases, and the (g-1) * (h-1) decreases, meaning that the number of reverse pairs is declining. This is the core reason why we use the insertion sort.
From the above analysis, we can get the conclusion that the step sequence item reciprocity can make the sequence in reverse order as little as possible, thus improving the efficiency of hill sort.
The more optimized step sequences for the hill sort design are: PS sequence, Pratt sequence, and Sedgewick sequence.
Hill Sort Animation Demo: http://student.zjzk.cn/course_ware/data_structure/web/flashhtml/shell.htm
Hill Sort Implementation code reference: https://zh.wikipedia.org/wiki/%E5%B8%8C%E5%B0%94%E6%8E%92%E5%BA%8F#.E6.AD.A5.E9.95.BF.E5.BA.8F.E5.88.97
12th Chapter • Sort