From: http://blog.csdn.net/weixingstudio/article/details/7631241
Haar features and integral Diagram 1. Introduction of AdaBoost method 1.1 proposal and Development of boosting method
Before learning about the AdaBoost method, let's take a look at the boosting method.
To answer a question of "yes" or "no", a 50% accuracy rate can be obtained by random guesses. If a method can obtain a slightly higher accuracy rate than random prediction, it can be called the process of obtaining this method as weak learning. If a method can significantly improve the accuracy rate of prediction, it is called the process of obtaining this method as strong learning. In 1994, Kearns and valiant proved that in the valiant PAC (probably approximatelycorrect) model, as long as there is enough data, the weak learning algorithm can be improved to any precision through integration. In fact, schapire first constructed a polynomial-Level Algorithm in 1990, which promoted the weak learning algorithm to a strong learning algorithm, namely the original boosting algorithm. Boosting is a type of algorithm that promotes and enhances weak learning. In 1993, Drucker and schapire used neural networks as weak learning devices for the first time and used the boosting algorithm to solve practical problems. As mentioned above, the integration of weak learning algorithms to arbitrary precision was proved by Kearns and valiant in 1994. Although the boosting method was proposed in 1990, it is truly mature, it started only after January 1, 1994. In 1995, Freund proposed a more efficient boosting algorithm.
1.2 proposal of AdaBoost Algorithm
In 1995, Freund and schapire proposed the Adaboost algorithm, which is a major improvement to the boosting algorithm. AdaBoost is one of the representative algorithms of the boosting family. It is called adaptive boosting. Adaptively, that is, adaptive. This method adjusts the hypothesis error rate according to the weak learning result feedback. Therefore, AdaBoost does not need to know the lower limit of the hypothesis error rate in advance. Because of this, it does not require any prior knowledge about the performance of the weak learner, and it has the same efficiency as the boosting algorithm. Therefore, it has been widely used since its proposal.
AdaBoost is a classifier based on the cascade classification model. The cascade classification model can be expressed as follows:
Cascade classifier Introduction: a cascade classifier is used to connect multiple strong classifiers for operations. Each strong classifier is weighted by several weak classifiers. For example, some strong classifiers can contain 10 weak classifiers, while others can contain 20 weak classifiers, generally, a strong classifier in the next cascade mode contains about 20 weak classifiers. Then, when 10 strong classifiers are joined, a cascade strong classifier is formed, this cascade strong classifier includes a total of 200 if classifier. Because each strong classifier has a very high accuracy for negative samples, once the detected target negative sample is found, the following strong classifier will not be called continuously, reducing the detection time. Because many of the areas to be detected in an image are negative samples, the cascade classifier discards the complex detection of many negative samples at the beginning of the classifier, therefore, the speed of the cascade classifier is very fast. Only the positive sample is sent to the next strong classifier for re-test. This ensures the pseudo positive (false positive) of the final output positive sample) the possibility is very low.
In some cases, cascade classifier is not applicable. In this case, a strong classifier is simply used. Generally, a strong classifier contains about 200 weak classifiers to achieve the best effect. However, the effect of the cascade classifier is similar to that of a single strong classifier, but the speed is greatly improved.
Cascade structure classifier consists of multiple weak classifiers, each of which is more complex than the previous one. Each classifier allows almost all positive examples to pass through, while filtering out most negative examples. In this way, the number of positive examples to be detected at each level is smaller than that at the previous level, and a large number of non-detection targets are excluded, greatly improving the detection speed.
Secondly, AdaBoost is an iterative algorithm. Initially, the weights of all training samples are set to equal, and a weak classifier is trained under the sample distribution. In the (= 1, 2, 3 ,... T, T indicates the number of iterations.) In the next iteration, the sample weight is determined by the result of the first iteration. At the end of each iteration, there is a process of adjusting the weight, and the samples that are incorrectly classified will get a higher weight. In this way, the incorrect sample is highlighted and a new sample distribution is obtained. In the new sample distribution, the weak classifier is trained again to obtain the new weak classifier. After a t-cycle, the T-weak classifier is obtained. By adding the T-weak classifier to a certain weight, the final strong classifier is obtained.
2. rectangular Feature 2.1 Haar feature \ rectangular feature
The Adaboost algorithm uses the rectangular feature of the input image, also known as the Haar feature. The following describes the features of rectangular features.
Feature Selection and feature value calculation are two important aspects that affect the speed of AdaBoost detection and training algorithms. Some features of the face can be simply depicted by rectangular features. Use Figure 2 to demonstrate:
The two rectangular features in the image represent some features of the face. For example, the middle one indicates that the color of the eye area is darker than that of the cheek area, and the right one indicates that the sides of the nose side are darker than the nose side. Similarly, other targets, such as eyes, can also be represented by some rectangular features. Using features is superior to simply using pixels, And it is faster.
Given a limited amount of data, feature-based detection can encode the state of a specific region, and feature-based systems are much faster than pixel-based systems.
Rectangular features are sensitive to some simple graphical structures, such as edges and line segments. However, they can only describe the structures of specific directions (horizontal, vertical, and diagonal), so they are relatively rough. For example, some features of the face can be simply depicted by rectangular features. For example, the eyes are usually darker than the cheek; the sides of the nose beam are darker than the nose beam; and the mouth is darker than the surrounding color.
For a 24x24 detector, there are more than 160,000 rectangular features in it. You must use a specific algorithm to select a suitable rectangular feature and combine it into a strong classifier to detect the face.
Common rectangular features include two rectangular features, three rectangular features, and four rectangular features,
From the chart, we can see that two rectangular features reflect edge features, three rectangular features reflect linear features, and four rectangular features reflect specific direction features.
The Feature Template's feature value is defined as: White Rectangle pixel and minus black rectangle pixel and. Next, we need to solve two problems: 1. Find the number of features in each child window to be detected. 2: Find the feature values of each feature.
The number of features in the subwindow is the number of feature rectangles. During training, slide calculation is performed on each feature in the training image subwindow to obtain various rectangular features at each position. The same type of rectangular feature in different positions in the subwindow belongs to different features. It can be proved that after determining the feature form, the number of rectangular features is only related to the size of the subwindow [11]. In the 24x24 detection window, there are about 160,000 rectangular features.
A feature template can be placed in "any" size "or" any "in a subwindow. Each form is called a feature. Finding all the features in the subwindow is the basis for weak classification training.
2.2 calculate the number of conditional Rectangles and rectangular features in a subwindow
A Child Window of M * m size can calculate the number of rectangular features in such a large child window.
Taking a detector with a resolution of m × M pixels as an example, the total number of all rectangles In the detector that meet specific conditions can be calculated as follows:
For m × m subwindows, we only need to determine the upper left Vertex a (x1, Y1) and lower right vertex B (X2, 63) of the rectangle, that is, we can determine a rectangle; if this rectangle must also meet the following two conditions (called (S, T), the rectangle that satisfies the (S, T) condition is called the condition rectangle ):
1) The side length in the X direction must be divisible by the natural number S (evenly divided into S segments );
2) The Edge length in the Y direction must be divisible by natural numbers T (evenly divided into T segments );
Then, the minimum size of the rectangle is s × T or T × S, the maximum size is [m/s] · s × [M/T] · T or [M/T] · T × [m/s] · s; [] is the integer operator. 2.3 Number of conditional rectangles
In the following two steps, we can locate a rectangle that meets the conditions:
From the above analysis, we can see that in the m × m subwindow, the number of all rectangles meeting the (S, T) condition is:
In fact, the (S, T) condition describes the characteristics of the rectangular feature. The following lists the (S, T) conditions corresponding to different rectangular features:
The following uses a 24x24 subwindow as an example to calculate the total number of features:
The total number of features in different subwindows is listed below:
3. Integral chart 3.1 concept of integral chart
After obtaining the rectangular feature, you need to calculate the value of the rectangular feature. Viola and others proposed a method for finding feature values using integral graphs. The concept of a point chart can be shown in Figure 3:
The integral graph of coordinate a (x, y) is the sum of all pixels in the upper left corner of the graph (the shadow part in the graph ). Defined:
II (x, y) indicates the integral graph, I (x, y) indicates the original image, and for the color image, it is the color value of this vertex; for the gray image, is its gray value, ranging from 0 ~ 255.
In, A (x, y) indicates the integral graph of a vertex (x, y); s (x, y) indicates a vertex (x, y) the sum of all original images in the Y direction. The integral chart can also be obtained using formula (2) and formula (3:
3.2 using the integral graph to calculate the feature value
3.3 calculate the feature value
It is known from the previous section that the pixel value of a region can be calculated by the integral graph of the endpoint of the region. The feature value defined in the preceding Feature Template can be used to calculate the feature value of a rectangular feature from the feature endpoint integral graph. Take the second feature in the "two-rectangle feature" as an example. For example, use an integral graph to calculate its feature value:
Cascade classifier and detection process
1. Weak Classifier
After determining the number of rectangular features and feature values in the training subwindow, we need to train a weak classifier h (x, F, P, O) for each feature F ).
It is too difficult to edit the formula in csdn, so the formula is used here.
Note: before preparing a training sample, you need to normalize the sample size to 20*20 in grayscale mode. In this way, each sample has a grayscale image and the sample size is the same, this ensures that each Haar feature (describing the location of the feature) appears in each sample.
2. Train a strong classifier
In the training of a strong classifier, t indicates the number of weak classifiers contained in the strong classifier. Of course, if cascade classifier is used, the number of weak classifiers in the strong classifier here may be relatively small, and multiple strong classifiers are cascade.
In step C (2), "Each feature F" refers to all possible rectangular features in 20*20 training samples, which is roughly 80,000, all of these are calculated. That is to say, to calculate about 80,000 weak classifiers, select a good performance classifier.
Step for training a strong classifier
3. Introduce the weak classifier and why Haar features can be used for classification.
For the rectangular feature in this algorithm, the feature value f (x) of the weak classifier is the feature value of the rectangular feature. Because the size of the selected training sample set is equal to the size of the Child window, the size of the Child Window determines the number of rectangular features, therefore, the features of each sample in the training sample set are the same and the number of features is the same, and a feature has a fixed feature value for a sample.
For images with an ideal random pixel value distribution, the average value of the feature values of the same rectangle for different images tends to a fixed value K.
This situation should also occur in non-face samples. However, because non-face samples are not necessarily pixel-random images, the above judgment may have a large deviation.
Calculate the average value of the feature values of all types of samples (face or non-face) for each feature, and obtain the average distribution of all features for all types of samples.
Shows the distribution of the feature values of all 78,460 rectangular features in the 20x20 subwindows for all 2,706 face samples and 4,381 non-face samples 6. The distribution shows that the mean value of most feature values is within the range before and after 0. Unexpectedly, the distribution curves of face samples and non-face samples are not much different. However, after the feature values are greater than or less than a certain value, the distribution curves are consistent, this shows that most of the features are very small for face recognition and non-face recognition. However, there are some features and corresponding thresholds that can effectively distinguish face samples from non-face samples.
To better illustrate the problem, we randomly extracted two features a and B from 78,460 rectangular features, which traverse 2,706 face samples and 4,381 non-face samples, the corresponding feature values of each image are calculated, and the feature values are sorted from small to large. The following figure shows the distribution chart based on the new sequence table:
It can be seen that the feature value distribution of rectangular feature a in face samples and non-face samples is very similar, so the ability to distinguish between face and non-face is poor.
The following describes the feature value distribution of rectangular Feature B in face samples and non-face samples:
We can see that the feature value distribution of rectangular Feature B, especially at, is significantly different between face samples and non-face samples, so we can better classify faces.
From the above analysis, the meaning of the threshold Q is clearly visible. The direction indicator P is used to change the direction of the non-equal sign.
A weak learner (a feature) only requires that it can distinguish face and non-face images with a slightly lower error rate than 50%, therefore, the above mentioned difference can only be accurate within a certain probability range.
Enough. According to this requirement, all rectangular features with an error rate lower than 50% can be found (select a threshold appropriately. For a fixed training set, almost all rectangular features can meet the above requirements ). During each round of training, the best weak Classifier in the current round will be selected (in the algorithm, T is the best weak Classifier in T iteration ), finally, the best weak classifier obtained in each round is upgraded to a strong classifier by a certain method (boosting ).
4. Training and selection of weak Classifiers
Training a weak classifier (feature F) is to determine the optimal threshold value of F under the current weight distribution, so that the weak classifier (feature f) has the lowest classification error for all training samples.
Selecting an optimal weak classifier is to select the weak classifier (feature) with the lowest classification error for all training samples among all weak classifiers ).
For each feature F, calculate the feature values of all training samples and sort them. By scanning the feature values in the sorted order, we can determine an optimal threshold value for this feature and train it into a weak classifier. Specifically, calculate the following four values for each element in the sorted table:
5. Strong Classifier
Note that t = 200 weak classifiers refer to non-cascade strong classifiers. If a strong cascade classifier is used, the number of weak classifiers for each strong classifier is relatively small.
Generally, in academic circles, cascade classifiers refer to cascading strong classifiers. Generally, there are about 10 strong classifiers, each of which has 10 to 20 weak classifiers. Of course, the number of weak classifiers in each layer of a strong classifier can be different. You can place fewer weak classifiers in the previous layer as needed, and the number of weak classifiers will gradually increase in the subsequent layers.
6. image detection process
When detecting input images, the input images are generally much larger than the 20*20 training samples. The Adaboost algorithm uses the method of expanding the detection window, rather than narrowing down the image.
Why do we need to expand the detection window instead of narrowing down the image? In the previous image detection, we usually used to scale down the image to eleven consecutive levels, and then we checked each level of image, finally, the results of each level are summarized. However, there is a problem that the face detection algorithm using Adaboost of cascading classifier is very fast and it is impossible to use the image scaling method because it only scales the image to 11 levels, it takes at least one second, and it cannot meet the real-time processing requirements of AdaBoost.
Because the Haar feature has nothing to do with the size of the detection window (you need to read the original author's literature for details), you can perform the level method of the detection window.
At the beginning of the detection, the detection window is the same as the sample size, and then moves to the left and down according to certain scale parameters (that is, the number of pixels moved each time) to traverse the entire image, mark possible face areas. After traversing, follow the specified magnification parameter to enlarge the detection window, and then perform an image traversal; this way, the detection window is continuously enlarged to traverse the detection image, the traversal is stopped after the detection window is half of the original image. Because the entire algorithm process is very fast, even if it has been traversed for so many times, according to the configuration of different computers to process an image is about dozens of milliseconds to one hundred milliseconds.
After traversing the image once in the detection window, process the overlapping detected face areas for merging and other operations.
Haar features and integral chart