11th Chapter Sort and find
- The concept of algorithms
An algorithm (algorithm) is a series of steps that convert a set of inputs into a set of outputs, each of which must be completed within a limited time. For example, a group of numbers is sorted from small to large, the input is a set of raw data, the output is the sorted data, and the calculation steps include actions such as comparing, moving data, and so on.
Algorithm is used to solve a class of computational problems, and attention is a kind of problem, not a specific problem. Because the algorithm is used to solve a class of problems, it must be able to correctly solve any one of these problems, the algorithm is correct. There are two possible incorrect algorithms, one for the problem of some input, the algorithm will be infinite computation, will not terminate, and the second is some input to the problem, the algorithm terminates when the output is the wrong result. Sometimes the incorrect algorithm is also useful, if it is difficult to find the correct algorithm for a problem, and an incorrect algorithm can be terminated in a limited time, and the error can be controlled within a certain range, then such an algorithm also has practical significance. For example, sometimes the cost of finding the optimal solution is very high, and the algorithm that can give the suboptimal solution is often chosen.
2. Insert Sort
Programming inserts a sort of array to move the data after the insertion point back one cell. The sorting algorithm is as follows:
To see the sorting process more clearly, we inserted a print statement at the beginning of each loop, and then we inserted the print statement at the end of the sort. The operating result is:
How to strictly prove that the algorithm is correct? In other words, as long as the iterative execution of the algorithm's for loop body, the execution of LEN-1 times, it must be able to order the array a, regardless of the original data of array A, how to prove this point? We can use the concept of loop invariant and mathematical induction to understand the algorithm of the cyclic structure, if a judgment condition satisfies the following three criteria, it is called loop invariant:
- The judgment condition is true before the first time the loop body is executed
- If "The judgment condition is true" after the N-1 cycle (or before the nth cycle), then there is a way to prove that the judgment condition is still true after the nth cycle.
- If the judging condition is true at the end of all loops, then there is a way to prove that the algorithm solves the problem correctly.
As soon as we find this loop invariant, we can prove that the algorithm of a cyclic structure is correct. The loop invariant with the sorting algorithm above is the judgment condition: the sub-sequence a[0..j-1] is ordered before the J-cycle. In the above print result, I handle sequence a[0..j-1] bold representation. Below we verify the following three guidelines for loop invariant:
- Before the first execution of the loop, J=1, subsequence A[0..j-1] has only one element a[0], and the sequence of only one element is clearly ordered.
- Before the J-cycle, if "sub-sequence a[0..j-1" is ordered "this premise is established, now to insert key=a[j], according to the procedure of the algorithm, the a[j-1], a[j-2], a[j-3] and so on than key elements are moved backward one, Until a suitable one is found, the a[0..j of the subsequence at the end of the loop can be proved to be orderly.
- When the loop is over, J=len, if the "subsequence a[0..j-1" is ordered "this premise is set, that is to say a[0..len-1] is a good order, that is, the entire array a LEN elements are ordered.
It can be seen that with these three articles, it is possible to use mathematical induction to prove that the cycle is correct.
3. Time complexity analysis of the algorithm
Solving the same problem can have many kinds of algorithms, compared to evaluate the algorithm's good or bad, an important criterion is the time complexity of the algorithm. Now look at the execution time of the insertion sorting algorithm, which, according to the custom, enters Len below the n notation. Assume that the execution time of each statement in the loop is C1, C2, C3, C4, C5, respectively, five constants:
It is obvious that the outer for loop execution times are n-1, assuming that the inner-layer while loop executes m times, the total execution time is roughly estimated to be (n-1) * (c1+c2+c5+m* (C3+C4)). Of course, the assignment and judgment in parentheses after the for and while () also takes time, and I don't have a constant to represent it, which does not affect our rough estimate.
There is a problem here, M is not a constant, nor depends on the input length n, but depends on the specific input data. In the best case, the original data of array A is already sorted, the while loop does not execute at once, the total execution time is (C1+C2+C5) *n-(C1+C2+C5), which can be represented as an+b form, and is a linear function of n (Linear functions). So what about the worst case scenario (worst)? The worst-case scenario is that the original data for the exponential group A is exactly the same as the large-to-small sequence, and the m in the above is replaced by the number of execution times.
The original data of array A is rare for the best and worst cases, and if the original data is random, it can be called the average (Average case). If the original data is random, then each loop will compare the sorted subsequence A[1..j-1] with the newly inserted element key, where the average half of the elements in the subsequence is larger than key and the other half is smaller than key, and you can replace m in the above to calculate how much execution time is. The final conclusion: in the worst case and average case, the total execution time can be expressed in the form of n two functions (quadratic function).
In analyzing the time complexity of the algorithm, we are more concerned about the worst case scenario than the best case, for the following reasons:
- The worst-case scenario gives the upper bound of the algorithm execution time, and we can be sure that no matter what input is given, the execution time of the algorithm does not exceed this upper bound, providing a traversal for comparison and analysis.
- For some algorithms, the worst-case scenario is the most common scenario, such as finding an algorithm for a particular information in a database, and the worst case scenario is that there is no such information in the database, and it is not found, and some applications often look for a message that does not exist in the database.
- While the worst-case scenario is a pessimistic estimate, for many problems, the average and worst-case time complexity is similar, such as in the example of insert ordering, where the average and worst-case time complexity are two functions of the input length n.
Comparing the values of two polynomial and (n positive integers) we can conclude that the highest exponent of n is the most important determinant, the constant term, the low power term and the coefficient are secondary. For example 100n+1 and, although the latter coefficient is small, when n is smaller, the former value is larger, but when n>100, the latter value is far greater than the former. If the same problem can be solved by two algorithms, one of which has a linear function, the time complexity of the other is two times, and when the input length of the problem is large enough, the former is obviously better than the latter. So. Therefore, we can express the time complexity of the algorithm in a more coarse way, and omit the coefficients and the lower power, and the linear function is recorded as two times function. A class of functions that represent and g (n) The same magnitude, such as all two functions f (n) and g (n) = belong to the same magnitude, can be used to represent even some of the functions that are not two and belong to the same magnitude, for example. The concept of "the same magnitude" can be used to illustrate (the figure is derived from the [introduction to the algorithm]):
If you can find two positive constants and make n large enough (that is, when) f (n) is always sandwiched between and, that is, f (n) and g (n) are the same magnitude, F (n) can be used to denote. Take a two-time function example, for example, to prove that it belongs to this set, we must determine, and, these constants do not change with N, and when later, always set up. For this we divide from each side of the infinitive and get it. See:
This is easy to see, no matter how much n, the function must be less than 1/2, so =1/2, when the N=6 function value is 0,n>6 when the function is greater than 0, can take =7,=1/14, so there are 1/2-3/n. Through this process of proving, it can be concluded that when n is large enough, anything is sandwiched between and, relative to the term, the effect of bn+c can be neglected, a can be compensated by selecting the appropriate one.
Several common time-complexity functions are ordered from small to large orders of magnitude:
Where LGN usually represents the logarithm of the base N of 10, but for-notation, and no difference, in algorithm analysis LGN usually represents the logarithm of the base N of 2.
In addition to-notation, the time complexity of the algorithm is commonly used to have a big-o notation. We know that the insertion sort is in the worst case and the average time complexity is, in the best case, the order of magnitude is smaller, then the time complexity of inserting sorting in various cases is summed up. The meaning and "equals" are similar, while the meaning of Big O is similar to "less than equals".
4. Merge sort
The Insert sort algorithm takes an incremental (Incremental) strategy to solve the problem by adding an element to the sorted subsequence each time, gradually sorting the entire array, and its time complexity. The following is another typical sort algorithm-the merge sort, which takes the strategy of divide and Conquer (Divide-and-conquer), the time complexity is better than the insertion sorting algorithm. The steps to merge a sort are as follows:
1) Divide: The input sequence of length n is divided into two sub-sequences of N/2 length.
2) Conquer: The two sub-sequences were sorted by merge.
3) Combine: Merges two sorted sub-sequences into one final sort sequence.
It is a recursive process to call the merge sort itself when describing the steps of merge sort.
Merge sort:
The execution results are:
The sort function puts A[start. end] is divided evenly into two sub-sequences, respectively, A[start. Mid] and A[mid+1..end], the two sub-sequences recursively call the sort function, and then call the merge function to merge the ordered two sub-sequences, since two sub-sequences have been sequenced, the process of merging is very simple, Each loop takes a comparison of the smallest elements in two sub-sequences, takes the smaller elements out of the final sort sequence, and, if the elements of one of the sub-sequences have been exhausted, puts the remaining elements of the other subsequence into the final sort sequence. To make it easier to understand the program, I inserted a print statement at the beginning and end of the sort function, and I can see that the invocation process is:
The s in the figure represent the sort function, m represents the merge function, and the entire control flow is called and returned in the direction indicated by the dashed line. Because the sort function recursively calls itself two times, the call relationship between functions is a tree-like structure. Drawing this diagram is just to show the process of merging and sorting more clearly, the reader must not start to understand the recursive function completely, but to grasp the base case and the recursive relationship to understand.
Merge sort is a better algorithm than insert sort, although the merge function has many steps, it introduces large constants, coefficients and low-order items, but for large input length n, these are not the main factors, the merge sort is, the average of the insertion sort is, this determines that the merge sort is the faster algorithm.
5. Linear Search
Some of the problems can be solved using algorithms. For example, write a indexof function, find out the position of a letter from any input string and return to that position, and return 1 if it is not found:
6. Binary Search
The idea of "narrowing the search by half every time" is called Binary lookup (Binary search).
In general, a contract (contract) between the caller (Caller) and the callee (or the Callee of the function is called), before the function is called, Caller needs to fulfill certain obligations to Callee, such as ensuring that a is ordered and that a[ Start.. end] are valid array elements without access out of bounds, this is called precondition, and then in callee some invariant are maintained (maintenance), These invariant guarantee that callee will be able to fulfill certain obligations to caller at the end, such as ensuring that "if number exists in array A, it will be able to find and return its position, and if number does not exist in array A, it will be able to return-1", which is called post Condition If the documentation for each function is very clearly documenting what precondition, maintenance, and postcondition are, then each function can be written and tested independently, and the entire system will be easy to maintain.
Testing whether a function is correct requires the precondition, maintenance, and postcondition to be tested, such as the BinarySearch function, even if it is written very correctly, Both the maintenance of the invariant also guaranteed the postcondition, if the call it caller not guaranteed precondition, the final result is still wrong. We write two predicate functions for testing, and then insert the relevant tests into the binary search function:
An assert is a macro definition in the header file assert.h that executes to the ASSERT (is_sorted ()), and if the is_sorted () return value is true, then when nothing has happened, proceed down if the is_sorted () return value is False ( For example, to change the order of the array), the error exits the program:
Using assertions (assertion) in the appropriate place in the code can help us to test the program effectively.
The test code is only useful for development and debugging, and if the released software has to run these test code, it can seriously affect performance, so C specifies that if you define a Ndebug macro (representing Nodebug) before including Assert.h, You can disable the Assert macro definition in assert.h, and the assert in the code has no effect:
There is another way, you do not have to modify the source file, directly at compile time with the option-dndebug, the equivalent of defining Ndebug macros at the beginning of the file.
Sorting and finding