Sorting Overview
Sorting is often used in data processing.
An important operation occupies a large proportion of the system running time in the computer and its application systems; sorting also plays a major role in promoting the development of algorithm analysis. Hundreds
But there is no ideal method. This chapter introduces the following commonly used sorting methods and analyzes and compares them.
1. Insert sorting (insert sorting, semi-insert sorting, and Hill sorting );
2. Exchange sorting (Bubble sorting and quick sorting );
3. Select sorting (directly select sorting and heap sorting );
4. Merge and sort;
5. Base sorting;
Learning focus
1. master the basic concepts of sorting and the characteristics of various sorting methods, and can be flexibly applied;
2. Master insertion sorting (direct insertion sorting, semi-insertion sorting, and Hill sorting), exchange sorting (Bubble sorting and quick sorting), and selection sorting (Direct selection sorting and heap sorting) and the method for Binary Merge Sorting and Its Performance Analysis;
3. Understand the base sorting Method and Its Performance Analysis method.
Sorting (SORT) or classification
Sorting is to sort records in files in ascending (or descending) Order of keywords. The exact definition is as follows:
Input: N records R1, R2 ,..., RN, whose corresponding keywords are K1, K2 ,..., KN.
Output: RIL, ri2 ,..., So that ki1 is less than or equal to ki2... ≤Kin. (Or ki1 ≥ ki2 ≥... ≥ Kin ).
1. Sorted object-File
Object To be sorted-a file consists of a group of records.
A record is composed of several data items (or fields. One item can be used to identify a record, which is called a keyword. The value of this data item is called a key ).
Note:
When it is not easy to produce confusion, the keyword is abbreviated as the keyword.
2. Rationale for sorting operations-keywords
The keyword used for sorting. It can be a number or a character.
Keyword selection should be based on the requirements of the problem.
[Example] each candidate is recorded as a record in the score statistics of the college entrance examination. Each record contains the admission ticket number, name, score of each subject, and total score. To uniquely identify a candidate's record, you must use the admission ticket number as the keyword. To rank by the total score of a candidate, you must use "total score" as the keyword.
Sorting Stability
If the keywords of the record to be sorted are not the same, the sorting result is unique. Otherwise, the sorting result is not unique.
In the file to be sorted, if multiple records with the same keywords exist, the relative sequence of these records with the same keywords remains unchanged after sorting, the sorting method isStableIf the relative order between records with the same keywords changes, this sorting method is calledUnstable.
Note:
The ordering algorithm is stable for all input instances. That is, if one of all possible input instances does not meet the stability requirements, the Sorting Algorithm is unstable.
Classification of sorting methods
1. Based on whether data is involved in the internal and external storage swap points
In the sorting process, if the entire file is stored in the memory for processing, the sorting does not involve the internal and external storage exchanges of data, it is calledInternal sorting(Inner sorting); otherwise, if the data needs to be exchanged between internal and external storage, it is calledExternal sorting.
Note:
① Inner sorting applies to small files with a small number of records
② External sorting is suitable for large files with too many records that cannot be stored in memory at a time.
2. Division of internal sorting methods by policy
It can be divided into five categories: insert sorting, select sorting, exchange sorting, Merge Sorting, and allocation sorting.
Sort Algorithm Analysis
1. Basic operations of sorting algorithms
Most sorting algorithms have two basic operations:
(1) compare the two keywords;
(2) Change the pointer to the record or move the record itself.
Note:
The implementation of the (2) basic operations depends on the Storage Method of the records to be sorted.
2. Common storage methods for files to be sorted
(1) using a sequence table (or directly using a vector) as the storage structure
Sorting Process: record itself is physically rearranged (that is, records are moved to the appropriate location by comparing and determining between keywords)
(2) Using the linked list as the storage structure
Sorting Process: you do not need to move the record. You only need to modify the pointer. This sort is usually called a linked list (or chained) sort;
(3) store the records to be sorted in order, but create a secondary table (for example, an index table consisting of a keyword and a pointer to the record position)
Sort process: you only need to physically rearrange the table categories of the secondary table (that is, move only the table categories of the secondary table without moving the record itself ). This method is applicable to sorting methods that are difficult to implement on the linked list and still need to avoid moving records in the sorting process.
3. Performance Evaluation of sorting algorithms
(1) Criteria for Evaluating sorting algorithms
There are two criteria for evaluating sorting algorithms:
① Execution time and necessary auxiliary space
② Complexity of the algorithm itself
(2) spatial complexity of sorting algorithms
If the auxiliary space required by the sorting algorithm is not dependent on the problem scale N, that is, the auxiliary space is O (1), it is called in-place sorting (in-placesou ).
The auxiliary space required for non-local sorting is O (n ).
(3) time overhead of the Sorting Algorithm
The time overhead of most sorting algorithms is mainly the comparison between keywords and the moving of records. The execution time of some sorting algorithms depends not only on the scale of the problem, but also on the status of data in the input instance.
Object sequential Storage Structure Representation
# Define n l00 // The file length, that is, the number of records to be sorted
Typedef int keytype; // hypothetical keyword type
Typedef struct {// record type
Keytype key; // key
Infotype otherinfo; // other data items. The type of infotype depends on the specific application.
} Rectype;
Typedef rectype seqlist [n + 1]; // seqlist is of the sequence table type. The 0th units in the table are generally used as the Sentinel.
Note:
If the keyword type does not have a comparison operator, you can define a macro or function to represent the comparison operation in advance.
[Example] When the keyword is a string, you can define the macro "# define LT (A, B) (stromp (a), (B) <0 )". Then, "A <B" can be replaced by "LT (A, B)" in the algorithm. If C ++ is used, it is more convenient to define the overloaded operator "<.
Sort by average time into four categories:
(1) Order of squares (O (n2)
It is generally called simple sorting, such as direct insertion, Direct selection, and Bubble sorting;
(2) linear rank (O (nlgn)
Such as fast, heap, and Merge Sorting;
(3) Order of O (N1 + percentile)
Occurrence is a constant between 0 and 1, that is, 0 <occurrence <1, such as Hill sorting;
(4) linear order (O (N) sorting
Such as bucket, box, and base sorting.
Comparison of sorting methods
In simple sorting, it is best to insert directly and sort quickly. When the file is in positive order, both direct insertion and bubble are the best.
Factors Affecting sorting performance
Because different sorting methods adapt to different application environments and requirements, the following factors should be taken into account when selecting an appropriate sorting method:
① Number of records to be sorted N;
② Record size (size );
③ Keyword structure and its initial state;
④ Requirements on stability;
⑤ Conditions for language tools;
⑥ Storage structure;
7. Time and auxiliary space complexity.
Selection of sorting methods under Different Conditions
(1) If n is small (such as N ≤ 50), direct insertion or direct selection of sorting can be used.
When the record size is small, the direct insertion sorting is better; otherwise, the sorting should be selected because the number of records to be moved is less than the direct insertion.
(2) If the initial state of the file is basically ordered (positive), direct insertion, bubble or random quick sorting should be selected;
(3) If n is large, the time complexity is O (nlgn.
Quick sorting is the best method in comparison-based internal sorting. When the keyword to be sorted is a random distribution, the average time of quick sorting is the shortest;
The auxiliary space required for heap sorting is less than that for quick sorting, and the worst possible case for quick sorting is not displayed. The two sorting types are unstable.
If sorting is required to be stable, Merge Sorting is optional. However, this chapter does not recommend sorting algorithms that merge data from a single record. Generally, they can be used together with directly inserted sorting algorithms. Use Direct insertion of sorting to obtain a long ordered sub-file, and then merge the sub-files. Because direct insertion sorting is stable, the improved Merge Sorting is still stable.
4) in the comparison-based sorting method, after comparing the two keywords each time, only two possible transfer occurs. Therefore, a binary tree can be used to describe the comparison and determination process.
When the N Keywords of a file are randomly distributed, any sort algorithm by means of "comparison" requires at least O (nlgn) time.
Box sorting and base sorting only one step will lead to M possible transfer, that is, to load a record into one of M boxes, so in general, box sorting and base sorting may complete N records in O (n) Time
Sort the records. However, box sorting and base sorting are only applicable to keywords with obvious structural features such as strings and integers. When the value range of a keyword belongs to an infinite set (such as a real number keyword ),
The box sorting and base sorting methods are used. In this case, only the "comparison" method is used for sorting.
If n is large, the number of keywords recorded is small and can be decomposed, the base sorting is better. Although bucket sorting has no requirement on the structure of keywords, the average time can reach the line only when the keywords are randomly distributed.
Level. Otherwise, the level is square. At the same time, note that the box, bucket, and base allocation sorting assume that if the keyword is a number, the value is non-negative. Otherwise, map it to the box (bucket) add the corresponding
Time.
(5) some languages (such as Fortran, COBOL, or basic) do not provide pointers and recursion, resulting in merging, fast (they are easy to implement with recursion) and base (with pointers) and other sorting algorithms become complex. In this case, you can consider using other sorting methods.
(6) The sorting algorithm given in this chapter stores the input data in a vector. When the record size is large, you can use a linked list as the storage structure to avoid a large amount of time to move the record. For example, insert a row
Sort, merge sort, and base sort are easy to implement on the linked list to reduce the number of records moved. However, some sorting methods, such as fast sorting and heap sorting, are difficult to implement on the linked list. In this case, you can extract
Keyword to create an index table, and then sort the index table. However, the simpler method is to bring up an integer vector T as an auxiliary table. Before sorting, t [I] = I (0 ≤ I <n). If the Sorting Algorithm
To switch R [I] and R [J], you only need to switch T [I] and T [J]. After sorting, vector t indicates the sequential relationship between records:
R [T [0]. Key ≤ r [T [1]. Key ≤... ≤ R [T [n-1]. Key
If the final result is:
R [0]. Key ≤ r [1]. Key ≤... ≤ R [n-1]. Key
After sorting, You can rearrange the records in the order specified in the secondary table. The time for completing this sort is O (n ).