AdaBoost algorithm combined with haar-like featuresI. Characteristics of Haar-like
The most commonly used Haar-like features include the original rectangular features that Paul Viola and Michal Jones used in human face detection, first presented by Papageorgiou C, and Rainer Lienhart and Jochen The extended rectangle feature proposed by MAYDT.
Figure 1. Haar-like Features
The calculation of the Haar-like eigenvalue is the summation of all the pixel values in the black rectangle within the rectangle template in the graph and minus the values of all the pixels within the white rectangle. Haar-like features can effectively extract the texture features of the image, and each template extracts the eigenvalues of different positions and scales by panning and zooming. So the number of Haar-like features is huge, for a given wxh picture, the number of its rectangular features is:
Where WxH is the feature template size
Represents the maximum scale at which the feature template zooms in both horizontal and vertical directions. The number of features for 45 degrees is:
The derivation of this formula is rather difficult to understand. Here's what I understand:
First you need to know two points:
1, for a feature in Figure 2, the feature itself can be scaled horizontally and vertically, the horizontal direction needs to be scaled in W, the vertical direction should be scaled in H, that is, the width to height ratio may be different from the original feature aspect ratio, but after scaling, the width and height are proportional to the width and height of the original feature. Therefore, for a w*h rectangle in Figure 2, there is a x*y way of amplification.
2, to a feature rectangle in Figure 2 and its scaled feature rectangle, its position in the image is different, Haar features are different, so you need to slide the calculation in the window. Example 2 in the 1 (a) feature size of 2*1, for 24*24 image, horizontal sliding 23 steps, vertical sliding 24 steps, so there are 23*24 characteristics.
Understand how these two points are going to get around a whole lot of the derivation of the original formula?
In this reference a Netizen's comparison of the simple and clear derivation process:
This formula is derived from probability, because the Haar feature box does not have a proportional limit to the width and height, Therefore, the choice of both sides of the number is independent of the same distribution event (this block does not know how to describe the language, if not understood, please look at the calculation step). Therefore, take the height edge as an example, the Haar feature box is h, the training image is H, then:
1) feature box magnified 1 time times (no magnification): (h-h+ 1) features
3) feature box Magnified 3 times times: There are (h-3h+1) features
so until you zoom in h/h times
4) feature box magnification h/h times: There are 1 features, namely (h-h/h*h+1) features
All of the above are added:
h-h+1+h-2h+1+h-3h+1+......+h-h/h*h+1=h*h/h+h/h-h*h/h* (1+H/h )/2
set h/h=y, then the upper can become Y (h+1-h (1+y)/2), similarly, the width edge of the same processing can be obtained by X (w+1 -W (1+x)/2), because this selection conforms to the independent distribution event, so the total number can be directly multiplied to get the formula in the article.
After reading his expression has a kind of enlightened feeling! Obviously, after defining the form of a feature, the number of rectangular features is only related to the size of the subwindow. In the 24x24 detection window, can produce a number of 100,000 of the characteristics of these special requests for the calculation of the value is very large.
Because of the huge quantity, it is very important to solve the fast calculation of the feature. Paul Viola and Michal Jones proposed using integral graphs to achieve fast operation of features. The pixel points in the constructed integration Graph store the sum of all the pixel values in the upper left, namely:
where I (x, y) represents the pixel value of the image (x, y) position. Obviously, the integration graph can be implemented incrementally by iterative operations:
The boundary is initialized to:
You can try to draw a small square on a piece of paper, assuming that x=2,y=2, using the above formula to launch the SAT (x, y), compares the SAT (x, y) you get on paper.
After the integral graph is obtained, the calculation of the pixels in the rectangular region will be done with only four lookups and subtraction operations, as shown in 3:
Figure 2: Calculation of the pixel value of the rectangular region in the integration graph
Assuming that the four vertices of region D are a,b,c,d, the pixels within the region D are:
It can be seen that the integral graph can accelerate the calculation of eigenvalues.
Second, AdaBoost algorithm 1, AdaBoost algorithm
Before you get to know AdaBoost, look at the boosting algorithm.
To answer a question of whether or not, random guesses can get a 50% correct rate. If a method can obtain a slightly higher accuracy than a random guess, the process of getting this method is weak learning, and if a method can significantly increase the correct rate of guessing, it is called the process of obtaining this method is strong learning. "Weak" and "strong" are very image of the expression of these two processes.
In 1994, Kearns and Valiant proved that in the Valiant PAC (Probably approximately Correct) model, the weak learning algorithm can be increased to arbitrary precision through integration as long as there is enough data. In fact, in 1990, Schapire first constructed a polynomial-level algorithm that promoted the weak learning algorithm to a strong learning algorithm, which was the original boosting algorithm. Boosting meaning for ascension, strengthening, now generally refers to the improvement of weak learning to strong learning of a class of algorithms. In 1993, Drucker and Schapire used neural networks as weak learners for the first time, using boosting algorithm to solve practical problems. As pointed out earlier, the weak learning algorithm is improved to arbitrary precision through integration, Kearns and Valiant proved in 1994 years. Although the boosting method was proposed in 1990, it was truly mature and began only 1994 years later.
2, AdaBoost algorithm of the proposed
In 1995, Freund and Schapire proposed the adaboost algorithm, which is a great improvement to the boosting algorithm. AdaBoost is one of the boosting family's representative algorithms, all called adaptive boosting. adaptively, adaptively, the algorithm adjusts the assumed error rate according to the feedback of weak learning, so adaboost does not need to know the lower bound of the assumed error rate beforehand. Because of this, it does not need any prior knowledge about the performance of the weak learner, and it has the same efficiency as the boosting algorithm, so it has been widely used in this paper.
3, the principle of adaboost algorithm
The traditional lifting algorithm needs to solve two problems:(1) for the same training data set, how to change its sample distribution to achieve repetitive training purposes; (2) The organic combinatorial problem of weak classifiers.
For these two problems, AdaBoost gives an "adaptive" solution. First of all, for the same training set by assigning the sample weight and in each round according to the classification results to change its weight to obtain the same training set of different sample distribution purposes. The practice is to give each training sample a weight to advertise its importance, with a larger weight of the sample to get greater probability of the correct classification, so that in each round of training focused on the sample will be different, so that the same sample set of different distribution purposes. The updating of the sample weights is based on the weak learner's classification of the samples in the current training set, in particular, to improve the weights of those samples that were incorrectly categorized by the previous round of the weak classifier, and to reduce the weights of the correctly categorized samples, so that the next round of weak classifier training more attention to the wrong sample, so that the classification problem is weak "Divide and conquer".
Secondly, the combination of weak classifiers takes a weighted majority vote method. Specifically, the weak classifier with small classification error rate will increase the combined weight, so that it has a greater "influence" in the voting, while the weak classifier with large error rate will reduce its combined weight. In this way, these weak classifiers, which focus on the different characteristics of different samples during training, are weighted together by their classification error rate to form a final classifier with more powerful classification performance (strong classifier). As you can see, using the AdaBoost algorithm can extract the more influential features of the classification and focus on the key training data.
The adaboost algorithm is described below:
Suppose input: DataSet d={(X1,y1), (x2,y2),..., (Xm,ym)}, where yi=0,1 represents negative samples and positive samples; the number of cycles of learning is T.
Process:
1, initialize the sample weight: For yi=0,1 The sample is initialized with the weight of ω1,i=1/m,1/l, where M and L are expressed as the number of negative samples and the number of positive samples respectively.
2. For T=1,... T:
① weight Normalization,
② to each characteristic j, trains a weak classifier HJ (how to train a weak classifier, which is later mentioned), calculates the weighted error rate for all features Εf
③ from the weak classifier determined by ②, a weak classifier HT with minimum εt is found;
④ update weights for each sample
Here, if the sample Xi is correctly classified, then ei=0, otherwise ei=1, and
3, the final formation of the strong classifier composed of:
which
Before using the AdaBoost algorithm to train the classifier, it is necessary to prepare positive and negative samples, select and construct the feature set according to the sample characteristics. The training process of the algorithm shows that when the weak classifier classifies the samples correctly, the weights of the samples are reduced, and the weights of the samples are increased when the classification errors occur. In this way, the subsequent classifier strengthens the training of the wrong sub-samples. Finally, all weak classifiers are combined to form strong classifiers, which are measured by comparing the weighted sum of these weak classifiers with the average voting results.
4. AdaBoost algorithm Classifier
weak classifiers are called weak classifiers because we do not expect the selected classifier to have a strong classification effect. As an example, for a given problem, in a certain round of training process, the resulting weak classifier for training samples of the classification rate may be only 51%. That is to say, as long as the classification rate of the weak classifier is slightly better than the stochastic prediction (the random prediction of the classification rate is 50%). After each round of training, according to the weak classifier chosen by this round to the correct classification of training samples or not, modify the weights of each sample, so that the sample weight of the classification error increases. After all the training process is over, the resulting strong classifier is formed by a weighted vote of the weak classifiers produced by each wheel.
How do you guarantee a higher weight for a classifier with good sorting effect, and a lower weight for a classifier with poor classification? Adaboost provides a powerful mechanism to hook up weak classifiers and features, to select a well-performing classifier, and to give the corresponding weights. A straightforward way to hook up weak classifiers and features is to make a one by one relationship between weak classifiers and features, which means that a weak classifier relies on only one feature. in order to implement this mechanism, each round of classifier training process is to choose a rectangular feature, so that this feature can best be the training samples and anti-training samples separated. The weak learning process in each round of training, for each feature (there are many features, such as 24*24 's picture, the number of features up to 160000), we must determine an optimal threshold, so that the threshold value of the best classification of the sample. In this way, each round of the training process can be a classification of the best characteristics (that is, choose from 160,000 characteristics of the lowest error rate), and the corresponding weak classifier of the wheel is the selected weak classifier.
Three, AdaBoost algorithm combined with Haar-like feature 1, weak classifier composition
After determining the number of rectangular features and eigenvalues in the training subwindow, it is necessary to train a weak classifier H (x,f,p,θ) for each feature F;
2. Training and selection of weak classifier
A weak classifier is a mixture of a characteristic f (x) and a threshold of θ. Training a weak classifier, in the case of the current weight distribution, determines the optimal threshold of f (x), so that the weak classifier has the lowest classification error for all training samples. Choosing the best weak classifier is the weak classifier (characteristic) that selects the lowest classification error of all training samples in all weak classifiers.
So, by scanning this sort of table from start to finish, you can select the threshold (optimal threshold) that minimizes the classification error for the weak classifier, as shown in
Figure 3: An algorithm for training a weak classifier
Special note: In the early preparation of training samples, the sample should be normalized and grayscale to the specified size, so that each sample is a grayscale image and the size of the sample is consistent, ensuring that each Haar feature (describing the location of the feature) in each sample appears.
Iv. Cascade Classifier
The use of a strong classifier in practical applications often does not accurately solve some complex classification problems, and at the same time achieve high detection rate and low false detection rate requirements, usually using a cascaded strong classifier (cascade classifier) to solve this problem. Cascade Classifier's strategy is that a number of strong classifiers from simple to complex arrangement, so that each level of strong classifier has a high detection rate, and the requirement of false detection rate can be lowered, assuming that the detection rate of each level is 99.9%, the false detection rate of 50%, Then a level 15 cascade classifier has a detection rate of 0.99915≈0.9851;0.515≈0.00003, so that it can meet the requirements of reality.
During the training process, each level classifier in cascaded strong classifier is trained by AdaBoost algorithm, for the first level strong classifier, the training data is the whole training sample, the high detection rate is specified and the rate of false inspection is not less than the random result. Therefore, in the course of training, only a small number of high-performance characteristics can be used to achieve the specified requirements; for the second-level strong classifier, the negative sample of the training data is updated to a sample of the original negative sample of the first-order strong classifier, so that the training of the next-level strong classifier will be carried out for samples that are difficult to will use to produce slightly more features and weak classifiers, so continue to get the last from simple to complex permutation of the Cascade strong classifier.
Special Note: Each strong classifier in the Cascade classifier contains several weak classifiers, and each weak classifier is trained using the aforementioned AdaBoost algorithm combined with the Haar-like feature.
During the training process, each level classifier in cascaded strong classifier is trained by AdaBoost algorithm, for the first level strong classifier, the training data is the whole training sample, the high detection rate is specified and the rate of false inspection is not less than the random result. Therefore, in the course of training, only a small number of high-performance characteristics can be used to achieve the specified requirements; for the second-level strong classifier, the negative sample of the training data is updated to a sample of the original negative sample of the first-order strong classifier, so that the training of the next-level strong classifier will be carried out for samples that are difficult to will use to produce slightly more features and weak classifiers, so continue to get the last from simple to complex permutation of the Cascade strong classifier.
In the detection of the input image, it is generally necessary to detect the image in multi-area and multi-scale. The so-called multi-area, is to take the appearance of the window to pan operation, to detect each region. Since the positive samples used during training are normalized to fixed-size images, multi-scale detection must be used to resolve targets that are larger than the size of the training sample. Multi-scale detection generally has two strategies, one is to make the sub-window size fixed, by constantly scaling the picture to achieve, obviously, this method needs to complete the image scaling and eigenvalue recalculation, efficiency is not high, and another method, through the continuous expansion of the window to the training sample size, to achieve the purpose of multi-scale sampling detection, This avoids the weakness of the first method. However, in the process of window enlargement detection, there will be multiple detection of the same target, so the merging of regions is required.
Regardless of the search method, will be detected from the image sampling a large number of sub-window images, these sub-window images will be the Cascade classifier filter at the first level, only check out as a non-negative region to enter the next detection, or as a non-target area discarded, Only through all cascaded strong classifiers and judged to be positive regions is the final detected target area. The detection process for Cascade classifiers is as follows:
The strong classifiers trained by the Adaboost algorithm have a minimized error rate, rather than a high detection rate. In general, high detection rates are at the cost of high false recognition rates, resulting in increased error rates. A simple and effective method to improve the detection rate of the high-level classifier is to reduce the threshold of the strong classifier; to reduce the false-recognition rate of the layer I-level classifier to FI, a simple and effective method is to improve the threshold of the strong classifier, which is inconsistent with the detection rate of the high-level classifier. Through the analysis of the experimental results, the number of weak classifiers is increased, the detection rate of the strong classifier is improved and the error rate is decreased, but the increase of the number of weak classifiers will cause the computation time to increase. Therefore, there are two balances to consider when constructing a cascading classifier:
- Increasing the number of weak classifiers increases the calculation time while reducing the false recognition rate
- Decreasing the threshold of the strong classifier increases the detection rate and also increases the false recognition rate.
The two balanced points of the above equilibrium are to be found when constructing each layer of the Cascade classifier's strong classifier.
AdaBoost algorithm combined with haar-like features