Principle and derivation of Adaboost (course & Reading Notes) and adaboost Derivation
Principle and derivation of Adaboost
0 Introduction
I always wanted to write Adaboost, but I was too late to write. Although its algorithm concept is simple, "Listening to opinions of many people and making comprehensive decisions at last", the process description of its algorithm in general is too obscure. On the afternoon of June 23, November 1, Zou Bo spoke about decision tree and Adaboost in the 8th class of the Machine Learning class I organized. Adaboost spoke very well. After the lecture, I knew that I could write this blog.
This article is based on Zou bozhi's decision tree and Adaboost's PPT and statistical learning methods. It can be defined as a course note, Reading Notes, or learning experience, if you have any questions or comments, please feel free to comment out thanks at any time.
1. Adaboost Principle 1.1 What is Adaboost
AdaBoost, short for "Adaptive Boosting", was proposed by Yoav Freund and Robert Schapire in 1995. The adaptive mechanism is that the samples with the previous basic classifier error are enhanced, and the weighted samples are used to train the next basic classifier again.
AdaBoost is an iterative algorithm that adds a new weak Classifier in each round until a predetermined error rate is reached. Each training sample is assigned a weight, indicating its probability of being selected into the training set by a classifier. If a sample point has been accurately classified, the probability of being selected in the next training set is reduced. On the contrary, if a sample point is not accurately classified, then its weight is increased.
In specific implementation, the weights of each sample are initially equal. For the k iteration operation, we select the sample points based on these weights and then train the classifier. Then, based on this classifier, we can increase the weight of the sample which is divided incorrectly and reduce the weight of the sample that is correctly classified. Then, the weight updated sample set is used to train the next classifier. The entire training process continues so iteratively.
1.2 Adaboost algorithm flow
Given a training dataset T = {(x1, y1), (x2, y2 )... (XN, yN)}, where instance, and instance space, yi belong to tag set {-1, + 1 }, adaboost aims to learn a series of weak or basic classifiers from the training data, and then combine these weak classifiers into a strong classifier.
The Adaboost algorithm process is as follows:
- 1.First, initialize the distribution of weights of the training data. Each training sample is initially assigned the same weight: 1/N.
Next, if a sample point has been accurately classified, the probability of being selected in the next training set is reduced. On the contrary, if a sample point is not accurately classified, then its weight is increased. Specifically, it is:
AUse a training dataset with weight distribution Dm to learn the basic binary classifier:
BCalculate the classification error rate of Gm (x) on the Training dataset.
CCalculate the coefficient of Gm (x). am indicates the importance of Gm (x) In the final classifier:
According to the formula above, when em <= 1/2, am> = 0, and am increases with the decrease of em, which means that the basic classifier with a smaller classification error rate plays a greater role in the final classifier.
D. Update the weight distribution of the Training dataset.
This increases the weight of the normally classified sample by the Basic classifier Gm (x), and decreases the weight of the correctly classified sample. In this way, the AdaBoost method can "Focus" on the difficult samples.
Zm is a normalization factor, making Dm + 1 A probability distribution:
- 3.Build a linear combination of basic Classifiers
The final classifier is obtained as follows:
1.3 Adaboost example
For the following training samples, use the AdaBoost algorithm to learn a strong classifier.
Solution Process: Initialize the distribution of weights of training data, so that each weight value W1i = 1/N = 0.1, where N = 10, I = ,..., 10, and then for m = 1, 2, 3 ,... equivalent iteration.
Iteration Process 1: For training data with the weight distribution of D1 for m = 1, the error rate is the lowest when the threshold value of v is 2.5. Therefore, the basic classifier is:
In this way, the error rate of G1 (x) in the training dataset e1 = P (G1 (xi) =yi) = 0.3 is obtained.
Then calculate the coefficient of G1:
Then update the distribution of the weights of the training data:
Finally, we obtain the distribution of weights for each data: D2 = (0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.1666, 0.1666, 0.1666). The classification function is f1 (x) = 0.4236G1 (x), so the final obtained classifier sign (f1 (x) has three false classification points on the Training dataset.
Iteration Process 2: For m = 2, in the training data with the weight distribution of D2, the error rate is the lowest when the threshold value is 8.5. Therefore, the basic classifier is:
G2 (x) error rate in the training dataset e2 = P (G2 (xi) =yi) = 0.2143
Calculate the coefficient of G2:
Update the distribution of weights of training data:
D3 = (0.0455, 0.0455, 0.0455, 0.1667, 0.1667, 0.01667, 0.1060, 0.1060, 0.1060, 0.0455)
F2 (x) = 0.4236G1 (x) + 0.6496G2 (x)
The classifier sign (f2 (x) has three false classification points on the Training dataset.
Iteration 3: For m = 3, in the training data with the weight distribution of D3, the error rate is the lowest when the threshold value is 5.5. Therefore, the basic classifier is:
G3 (x) error rate e3 = P (G3 (xi) =yi) = 0.1820 on the Training Dataset
Calculate the coefficient of G3:
Update the distribution of weights of training data:
D4 = (0.125, 0.125, 0.125, 0.102, 0.102, 0.102, 0.065, 0.065, 0.065, 0.125), f3 (x) = 0.4236G1 (x) + 0.6496G2 (x) + 0.7514G3 (x), classifier sign (f3 (x) has 0 false classification points on the Training dataset.
2 Adaboost Error
Through the above example, we can see that the training error e is continuously reduced in the learning process of Adaboost. What is the error field?
In fact, the upper bound of adaboost's training error is:
Next, let's use derivation to prove the above formula.
When G (xi) is less than yi, yi * f (xi) <0, so exp (-yi * f (xi) is greater than or equal to 1.
For the second half, don't forget:
The entire derivation process is as follows:
The results show that Zm can be minimized by selecting the appropriate Gm in each round, so as to minimize the training error. Next, let's continue to evaluate the upper bound of the above results.
For binary classification, the following results are available:
Here ,.
Continue to prove this conclusion.
We can see from the definition of Zm and the conclusion obtained at the beginning of this section:
This inequality can be first introduced in the Taylor expansion of vertex x by the open numbers of e ^ x and.
It is worth mentioning that if gamma 1, gamma 2... The maximum value of Gamma is recorded as gamma (apparently, Gamma ≥ gamma I> 0, I =,... m), then for all m, there are:
This conclusion shows that the training error of AdaBoost decreases at an exponential rate. In addition, the AdaBoost algorithm does not need to know the lower bound gamma in advance. AdaBoost is adaptive and can adapt to the training error rates of weak classifiers.
Finally, Adaboost has another understanding, that is, it can be considered that the model is an addition model, the loss function is an exponential function, and the learning algorithm is a Class-2 Classification learning method of forward-step algorithms, if you are interested, see section 8.3 of statistical learning methods or other relevant materials.
3 references and recommendations
Recently I am studying adaboost. Is there any multi-class classification such as M1, M2, MH, MR and MO? I have not understood the principle of the MH algorithm, especially the MH code.
Www.informedia.cs.cmu.edu/...1.html
The M1 algorithm is available on this website, but it may not be available on a foreign website.
If you cannot get to google, the adaboost toolbox of CMU should have
Hello, you said that in opencv, the human face detection is converted into an eye. Just replace a classifier and ask if they work the same way? Is Adaboost, rectangle
In the same principle, both haar classifier and Adaboost algorithm are used.
When you write a program, load two classifiers: face and eye.
After detecting the face, set the ROI to the face and perform eye detection.