This paper mainly introduces the principle of adaptive lifting Learning algorithm (Adaptive boosting,adaboost), because it involves classifiers, and classifiers are often based on some sample features, so this paper introduces the most commonly used Adaboost algorithm combined with haar-like characteristics. The algorithm is widely used in human face detection, and then it is applied in other related target detection.
1.Adaboost AlgorithmPrevious existence
of 1.1 AdaBoost algorithmbefore you get to know AdaBoost, look at the boosting algorithm. To answer a question of whether or not, random guesses can get a 50% correct rate. If a method can obtain a slightly higher accuracy than a random guess, the process of getting this method is weak learning, and if a method can significantly increase the correct rate of guessing, it is called the process of obtaining this method is strong learning. "Weak" and "strong" are very image of the expression of these two processes. in 1994, Kearns and Valiant proved that in the Valiant PAC (Probably approximately Correct) model, the weak learning algorithm can be increased to arbitrary precision through integration as long as there is enough data. In fact, in 1990, Schapire first constructed a polynomial-level algorithm that promoted the weak learning algorithm to a strong learning algorithm, which was the original boosting algorithm. Boosting meaning for ascension, strengthening, now generally refers to the improvement of weak learning to strong learning of a class of algorithms. In 1993, Drucker and Schapire used neural networks as weak learners for the first time, using boosting algorithm to solve practical problems. As pointed out earlier, the weak learning algorithm is improved to arbitrary precision through integration, Kearns and Valiant proved in 1994 years. Although the boosting method was proposed in 1990, it was truly mature and began only 1994 years later.
the 1.2 of AdaBoost algorithmIn 1995, Freund and Schapire proposed the adaboost algorithm, which is a great improvement to the boosting algorithm. AdaBoost is one of the boosting family's representative algorithms, all called adaptive boosting. adaptively, adaptively, the algorithm adjusts the assumed error rate according to the feedback of weak learning, so adaboost does not need to know the lower bound of the assumed error rate beforehand. Because of this, it does not need any prior knowledge about the performance of the weak learner, and it has the same efficiency as the boosting algorithm, so it has been widely used in this paper.
the principle of 1.3 AdaBoost algorithmthe traditional lifting algorithm needs to solve two problems: (1) for the same training data set, how to change its sample distribution to achieve repetitive training purposes; (2) The organic combinatorial problem of weak classifiers. for both of these issues,AdaBoost gives an "adaptive" solution. First of all, for the same training set by assigning the sample weight and in each round according to the classification results to change its weight to obtain the same training set of different sample distribution purposes. The practice is to give each training sample a weight to advertise its importance, with a larger weight of the sample to get greater probability of the correct classification, so that in each round of training focused on the sample will be different, so that the same sample set of different distribution purposes. The updating of the sample weights is based on the weak learner's classification of the samples in the current training set, in particular, to improve the weights of those samples that were incorrectly categorized by the previous round of the weak classifier, and to reduce the weights of the correctly categorized samples, so that the next round of weak classifier training more attention to the wrong sample, so that the classification problem is weak "Divide and conquer". Secondly, the combination of weak classifiers takes a weighted majority vote method. Specifically, the weak classifier with small classification error rate will increase the combined weight, so that it has a greater "influence" in the voting, while the weak classifier with large error rate will reduce its combined weight. In this way, these weak classifiers, which focus on the different characteristics of different samples during training, are weighted together by their classification error rate to form a final classifier with more powerful classification performance (strong classifier). As you can see, using the AdaBoost algorithm can extract the more influential features of the classification and focus on the key training data. The adaboost algorithm is described below:Suppose input: DataSet d={(X1,y1), (x2,y2),..., (Xm,ym)}, where yi=0,1 represents negative samples and positive samples; the number of cycles of learning is T. Process:1, initialize the sample weight: for yi=0,1 of the sample to initialize its weight is ω1,i=1/m,1/l; where M and L represent the number of negative samples and the number of positive samples, respectively. 2. For T=1,... T:① weight Normalization,
② to each characteristic j, trains a weak classifier HJ (how to train a weak classifier, which is later mentioned), calculates the weighted error rate of all features εf
③ from in the weak classifier determined by ②, a weak classifier HT with minimum εt is found; ④ Update weights for each sample
here, if the sample Xi is correctly classified, then ei=0, otherwise ei=1, and
3, the final formation of the strong classifier composed of:
among them: before using the AdaBoost algorithm to train the classifier, it is necessary to prepare positive and negative samples, select and construct the feature set according to the sample characteristics. The training process of the algorithm shows that when the weak classifier classifies the samples correctly, the weights of the samples are reduced, and the weights of the samples are increased when the classification errors occur. In this way, the subsequent classifier strengthens the training of the wrong sub-samples. Finally, all weak classifiers are combined to form strong classifiers, which are measured by comparing the weighted sum of these weak classifiers with the average voting results.
2. Cascade ClassifierThe use of a strong classifier in practical applications often does not accurately solve some complex classification problems, and at the same time achieve high detection rate and low false detection rate requirements, usually using a cascaded strong classifier (cascade classifier) to solve this problem. Cascade classifier Strategy is that a number of strong classifiers from simple to complex arrangement, so that each level of strong classifier has a higher detection rate, and the requirement of false detection rate can be lowered, assuming that the detection rate of each level is 99.9%, the false detection rate of 50%, then a 15-level cascade classifier detection rate of 0.999 ≈0.9851,0.5≈ 0.00003, so you can meet the requirements of reality.
During the training process, each level classifier in cascaded strong classifier is trained by AdaBoost algorithm, for the first level strong classifier, the training data is the whole training sample, the high detection rate is specified and the rate of false inspection is not less than the random result. Therefore, in the course of training, only a small number of high-performance characteristics can be used to achieve the specified requirements; for the second-level strong classifier, the negative sample of the training data is updated to a sample of the original negative sample of the first-order strong classifier, so that the training of the next-level strong classifier will be carried out for samples that are difficult to will use to produce slightly more features and weak classifiers, so continue to get the last from simple to complex permutation of the Cascade strong classifier.
when the input image is detected, the In general, multi-area and multi-scale detection of images is required. The so-called multi-area, is to take the appearance of the window to pan operation, to detect each region. Since the positive samples used during training are normalized to fixed-size images, multi-scale detection must be used to resolve targets that are larger than the size of the training sample. Multi-scale detection generally has two strategies, one is to make the sub-window size fixed, by constantly scaling the picture to achieve, obviously, this method needs to complete the image scaling and eigenvalue recalculation, efficiency is not high, and another method, through the continuous expansion of the window to the training sample size, to achieve the purpose of multi-scale sampling detection, This avoids the weakness of the first method. However, in the process of window enlargement detection, there will be multiple detection of the same target, so the merging of regions is required.
Regardless of the search method, will be detected from the image sampling a large number of sub-window images, these sub-window images will be the Cascade classifier filter at the first level, only check out as a non-negative region to enter the next detection, or as a non-target area discarded, Only through all cascaded strong classifiers and judged to be positive regions is the final detected target area. The detection process for Cascade classifiers is as follows:
Figure 1: Cascading classifier detection
3.haar-like featuresThe most commonly used Haar-like features include the original rectangular features that Paul Viola and Michal Jones used in human face detection, first presented by Papageorgiou C, and Rainer Lienhart and Jochen The extended rectangle feature proposed by Maydt . Figure 2. Haar-like features
The calculation of the Haar-like eigenvalue is the summation of all the pixel values in the black rectangle within the rectangle template in the graph and minus the values of all the pixels within the white rectangle. Haar-like features can effectively extract the texture features of the image, and each template extracts the eigenvalues of different positions and scales by panning and zooming. So the number of Haar-like features is huge, for a given wxh picture, the number of its rectangular features is:
where WxH is the feature template size
Represents the maximum scale at which the feature template zooms in both horizontal and vertical directions. The number of features for 45 degrees is:
The derivation of this formula is rather difficult to understand. Here's what I understand:
First you need to know two points:
1, for a feature in Figure 2, the feature itself can be scaled horizontally and vertically, the horizontal direction needs to be scaled in W, the vertical direction should be scaled in H, that is, the width to height ratio may be different from the original feature aspect ratio, but after scaling, the width and height are proportional to the width and height of the original feature. Therefore, for a w*h rectangle in Figure 2, there is a x*y way of amplification.
2, to a feature rectangle in Figure 2 and its scaled feature rectangle, its position in the image is different, Haar features are different, so you need to slide the calculation in the window. Example 2 in the 1 (a) feature size of 2*1, for 24*24 image, horizontal sliding 23 steps, vertical sliding 24 steps, so there are 23*24 characteristics.
Understand how these two points are going to get around a whole lot of the derivation of the original formula?
In this reference a Netizen's comparison of the simple and clear derivation process:
This formula is obtained from the probability aspect, because the Haar feature box does not have the width and height to make the proportional limit, therefore its two sides the choice of the number is the independent same distribution event (this block does not know how to describe in the language, if does not understand to see the calculation step directly). So take the height edge as an example, the Haar feature box is h, the training image is H, then:
1) Feature box magnification 1 time times (no magnification): (h-h+1) features
2) feature box enlarged twice times (only to enlarge H side, the same below): has (h-2h+1) characteristics
3) Feature box magnification 3 times times: With (h-3h+1) features
so until magnified h/h times
4) Feature box amplification h/h times: There are 1 characteristics, that is (h-h/h*h+1) characteristics
all the above additions are:
h-h+1+h-2h+1+h-3h+1+......+h-h/h*h+1=h*h/h+h/h-h*h/h* (1+h/h)/2
set H/h=y, the upper can be changed to Y (H+1-h (1+y)/2), in the same way, the width of the same processing can be obtained X (W+1-W (1+x)/2), because this selection in line with the independent distribution of events, so the total number can be directly multiplied to get the formula in the article.
After reading his expression has a kind of enlightened feeling! Obviously, after defining the form of a feature, the number of rectangular features is only related to the size of the subwindow. in the 24x24 detection window, can produce a number of 100,000 of the characteristics of these special requests for the calculation of the value is very large.
because of the huge quantity, it is very important to solve the fast calculation of the feature. Paul Viola and Michal Jones proposed using integral graphs to achieve fast operation of features. The pixel points in the constructed integration Graph store the sum of all the pixel values in the upper left, namely:
where I (x, y) represents the pixel value of the image (x, y) position. Obviously, the integration graph can be implemented incrementally by iterative operations:
the boundary is initialized to:
you can try to draw a small square on a piece of paper, assuming x=2,y=2, use the above formula to launch the SAT (x, y), with what you get on paperthe SAT (x, y) comparison, it will suddenly dawned. after the integral graph is obtained, the calculation of the pixels in the rectangular region will be done with only four lookups and subtraction operations, as shown in 3: Figure 3: Calculation of the pixel value of the rectangular region in the integration graphassuming that the four vertices of region D are a,b,c,d, the pixels within the region D are:
It can be seen that the integral graph can accelerate the calculation of eigenvalues.
How does the 4.Adaboost algorithm combine the haar-like features?The previous introduction of the AdaBoost algorithm only introduced how to train a strong classifier, and the strong classifier is actually composed of a number of weak classifiers. How did the weak classifier get trained?
4.1 weak classifier
after determining the number of rectangular features and eigenvalues in the training subwindow, it is necessary to train a weak classifier H (x,f,p,θ) for each feature F; So, by scanning this sort of table from beginning to end, we can select the threshold (optimal threshold) that minimizes the classification error for the weak classifier, that is to choose the best weak classifier.
Figure 4: Training and selecting the best classifier algorithm
Special Don't say Ming: In the early preparation of training samples, the sample needs to be normalized and grayscale to the specified size, so that each sample is a grayscale image and the size of the sample is consistent, ensuring that each Haar feature (describing the location of the feature) is presented in each sample.
For the rectangular feature in this algorithm, the characteristic value F (x) of the weak classifier is the characteristic value of the rectangular feature. Because the size of the selected training sample set is equal to the size of the test subwindow during training, the size of the test sub-window determines the number of rectangular features, so the characteristics of each sample in the training sample set are the same and the quantity is the same, and a feature has a fixed eigenvalue for a sample.
In this way, for each Haar feature, traverse all positive and negative samples, calculate the corresponding eigenvalues for each image, and finally sort the eigenvalues from small to large. Then, in the case of the current sample weight distribution, the optimal threshold of f is determined, so that the weak classifier (feature F) has the lowest classification error for all training samples.
Of course, not every Haar feature can differentiate between positive and negative samples, and the Haar feature as a classifier, its classification error rate is very high. However, the foregoing mentioned: the combination of multiple weak classifiers into a strong classifier, although a single weak classifier classification error rate is very high, but after a layer of filtering, can still reduce the error rate to meet the actual requirements. The requirement for a weak learner is simply that it can differentiate between positive and negative samples at a slightly lower than 50% error rate.
According to this requirement, in each round of training, all error rates below 50% of the rectangular feature (weak classifier) can be found. Choosing the best weak classifier is the weak classifier that selects the lowest classification error of all training samples in all weak classifiers. In the AdaBoost algorithm, Iterative T-time is the selection of the best weak classifier, and finally to a certain method to promote the strong classifier. 5. Summary
This paper first describes the history and principle of the adaboost algorithm, which leads to how to train to get a strong classifier. Then, based on the requirement of practical application, the strategy of Cascade classifier is put forward, in which each level classifier in Cascade classifier is trained by AdaBoost algorithm. Then we discuss how the weak classifier mentioned in the adaboost algorithm "multiple weak classifiers are combined into a strong classifier" is trained, and the Haar-like feature is drawn out. This paper introduces the characteristics of Haar, the size of a subwindow, how to find its Haar feature, and the method of fast calculation of Haar feature--integral image method, aiming at the huge number of Haar features. Finally, the method and process of using adaboost algorithm combined with haar-like feature training to get weak classifier and strong classifier are expounded. There is inevitably a place for reference and reference, and they are very helpful to me in learning this piece of knowledge. Write this blog is just to deepen their impressions and record some of the points needing attention, do not do his use, if you can help to learn this piece of knowledge of friends, icing on the cake, why not!
AdaBoost algorithm combined with haar-like features