Transferred from: http://blog.csdn.net/v_july_v/article/details/40718799
the principle and derivation of Adaboost algorithm
1 AdaBoost Principle 1.1 AdaBoost is what
AdaBoost, an abbreviation for the English "Adaptive boosting" (adaptive enhancement), was presented by Yoav Freund and Robert Schapire in 1995. Its adaptation is that the sample of the previous basic classifier is strengthened, and the weighted sample is again used to train the next basic classifier. At the same time, a new weak classifier is added to each round until a predetermined small enough error rate is reached or a predetermined maximum number of iterations is reached.
Specifically, the entire AdaBoost iterative algorithm is 3 steps:
- Initializes the weight distribution of the training data. If there are N samples, each training sample is given the same weight at the very beginning: 1/n.
- Training weak classifiers. In the training process, if a sample point has been accurately classified, then in the construction of the next training set, its weight is reduced, conversely, if a sample point is not accurately classified, then its weight is improved. Then, the weight-updated sample set is used to train the next classifier, and the entire training process goes on so iteratively.
- The weak classifiers of each training are combined into strong classifiers. After the training process of each weak classifier is finished, the weight of the weak classifier with small classification error rate is enlarged, which plays a larger role in the final classification function, while the weight of the weak classifier with large classification error rate is reduced, which plays a smaller role in the final classification function. In other words, a weak classifier with a low error rate occupies a larger weight in the final classifier, otherwise it is smaller.
1.2 AdaBoost algorithm Flow
Given a training data set t={(X1,y1), (x2,y2) ... (Xn,yn)}, where instance, and instance space, Yi belongs to the tag set { -1,+1},adaboost is to learn a series of weak classifiers or basic classifiers from the training data, and then combine these weak classifiers into a strong classifier.
The AdaBoost algorithm flow is as follows:
- step 1. First, the weight distribution of the training data is initialized. Each training sample is given the same weight at the very beginning: 1/n.
- Step 2. For multiple rounds of iteration, with M = ..., M represents the number of rounds of the iteration
a. Use the training data set with weight distribution DM to get the basic classifier (select the threshold with the lowest error rate to design the basic classifier):
B.calculating the classification error rate of the GM (x) on the training data set
from the above formula, it is known that the
error rate of GM (x) on the training data set is the sum of the weights of the samples by the GM (x) by mistake.
C. Calculates the coefficient of GM (x), which is the importance of GM (X) in the final classifier (purpose: To get the weight of the basic classifier in the final classifier):
from the above formula, the EM <= 1/2, am >= 0, and am with the reduction of the EM increase, meaning that the smaller the classification error rate of the basic classifier in the final classifier of the greater role.
D.updating the weights distribution of the training data set (objective: To obtain a new weight distribution for the sample) for the next iteration
The weights of the samples which are classified by the
basic classifier GM (x) are increased, and the weights of the samples are reduced. In this way, the AdaBoost method can "focus on" or "focus on" the more difficult samples.
Among them, ZM is a normalization factor, making dm+1 a probability distribution:
- Step 3. Combining each weak classifier
Thus the final classifier is obtained, as follows:
An example of 1.3 adaboost
Below, given the following training samples, use the AdaBoost algorithm to learn a strong classifier.
Solution process: Initialize the weight distribution of the training data, so that each weight value w1i = 1/n = 0.1, where N = 10,i = 1, 2, ..., 10, then respectively for M = ... The equivalent is iterated.
After getting the training samples for the 10 data, according to the corresponding relationship between X and Y, the 10 data to be divided into two categories, one is "1", a class is "1", according to the characteristics of the data found: "0 1 2" The corresponding class is "3", "1 3 4" The corresponding class of 5 data corresponds to "3", "1 6 7" These 3 data correspond to the class is "1", 9 is relatively lonely, corresponding class "1". Aside from the lonely 9 do not say, "0 1 2", "3 4 5", "6 7 8" This is 3 classes of different data, the corresponding class is 1, 1, 1, it is intuitively inferred that the corresponding data can be found in the demarcation point, such as 2.5, 5.5, 8.5 to divide those data into two categories. Of course, this is only a subjective conjecture, the following actual calculation of this specific process.
Iterative Process 1
For M=1, on the training data where the weight distribution is D1(10 data, each weight of each data is initialized to 0.1), it can be calculated as follows:
- The error rate of the Threshold V for 2.5 is 0.3 (x < 2.5 when 1,x > 2.5 is taken-1, 6 7 8 is wrong , the error rate is 0.3),
- When the threshold V is 5.5, the error rate is 0.4 (x < 5.5 when 1,x > 5.5 is taken-1, then 3 4 5 6 7 8 are wrong, the error rate of 0.6 is greater than 0.5, undesirable. So x > 5.5 when taking 1,x < 5.5 time to take-1, then 0 1 2 9 wrong, the error rate is 0.4),
- When the threshold V is 8.5, the error rate is 0.3 (x < 8.5 when 1,x > 8.5 is taken-1, then 3 4 5 is wrong, and the error rate is 0.3).
Can be seen, regardless of the threshold value of v 2.5, or 8.5, the total score error 3 samples, so can take any one of them, such as 2.5, into the first basic classifier for:
It says that the threshold V takes 2.5 is 6 7 8 wrong, so the error rate is 0.3, more detailed explanation is: because the sample set
- 0 1 2 The corresponding class (Y) is 1, because they are less than 2.5, so the G1 (x) is divided into the corresponding class "1", divided.
- 3 4 5 The corresponding class (Y) is-1, because they themselves are greater than 2.5, so the G1 (x) is divided into the corresponding class "1", divided.
- But 6 7 8 itself corresponds to the class (Y) is 1, but because they themselves are greater than 2.5 and was G1 (x) in the class "-1", so the 3 samples were divided wrong.
- 9 corresponds to the Class (Y) is-1, because it itself is greater than 2.5, so by G1 (x) in the corresponding class "-1", divided.
In order to obtain the error rate of G1 (x) on the training data set (the sum of the weights of the "6 7 8" by G1 (x) by mistake)E1=p (G1 (xi) ≠yi) = 3*0.1 = 0.3.
The coefficients of the G1 are then calculated based on the error rate E1:
This A1 represents the weight of G1 (x) in the final classification function, which is 0.4236.
Then update the weight distribution of the training data for the next iteration:
It is worth mentioning that the formula updated by the weights indicates whether the new weights of each sample are larger or smaller, depending on whether it is divided or divided correctly.
That is, if a sample is divided wrongly, Yi * Gm (xi) is negative, negative negative positive, the result makes the whole equation larger (the sample weights become larger), or smaller.
After the first iteration, we finally get the new weights distribution of each data D2 = (0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.1666, 0.1666, 0.1666< /c6>, 0.0715). As can be seen, because the sample is the data "6 7 8" by G1 (x) is wrong, so their weights from the previous 0.1 increased to 0.1666, conversely, the other data are divided correctly, so their weights are reduced from the previous 0.1 to 0.0715.
The classification function f1 (x) = A1*G1 (x) = 0.4236G1 (x).
At this point, the first basic classifier sign (F1 (x)) obtained has 3 mis-classifications (6 7 8) on the training data set.
The whole iterative process of the first round can be seen as follows: the sum of the weights of the sampled samples is affected by the error rate, and the error rate affects the weight of the basic classifier in the final classifier .
Iterative Process 2
For m=2, on training data with a weight distribution of D2 = (0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.1666, 0.1666, 0.1666, 0.0715), the calculation is available:
- When the threshold V is 2.5, the error rate is 0.1666*3 (x < 2.5 when 1,x > 2.5 is taken-1, then 6 7 8 is wrong, the error rate is 0.1666*3),
- The minimum error rate when the Threshold V is 5.5 is 0.0715*4 (x > 5.5 when 1,x < 5.5 is 1, then 0 1 2 9 is wrong, the error rate is 0.0715*3 + 0.0715),
- The error rate when the threshold V is 8.5 is 0.0715*3 (x < 8.5 when 1,x > 8.5 is taken-1, 3 4 5 is wrong , the error rate is 0.0715*3).
Therefore, the error rate is the lowest when the threshold V is 8.5, so the second basic classifier is:
The following sample is still in the face:
Obviously, G2 (x) divided the sample "3 4 5", according to D2 they have a weight of 0.0715, 0.0715, 0.0715, so G2 (x) error rate on the training data set E2=p (G2 (xi) ≠yi) = 0.0715 * 3 = 0.2143.
Calculate the coefficients of the G2:
Update weights distribution for training data:
D3 = (0.0455, 0.0455, 0.0455, 0.1667, 0.1667, 0.01667, 0.1060, 0.1060, 0.1060, 0.0455). The weighted value of the Divided sample "3 4 5" becomes larger, and the weights of other pairs of samples are smaller.
F2 (x) =0.4236g1 (x) + 0.6496G2 (x)
At this point, the resulting second basic classifier sign (F2 (x)) has 3 false classification points (i.e. 3 4 5) on the training data set.
Iterative Process 3
For m=3, on training data with a weight distribution of D3 = (0.0455, 0.0455, 0.0455, 0.1667, 0.1667, 0.01667, 0.1060, 0.1060, 0.1060, 0.0455), the calculation is available:
- When the threshold V is 2.5, the error rate is 0.1060*3 (x < 2.5 when 1,x > 2.5 is taken-1, then 6 7 8 is wrong, the error rate is 0.1060*3),
- The minimum error rate when the threshold V is 5.5 is 0.0455*4 (x > 5.5 when 1,x < 5.5 is taken-1, 0 1 2 9 is wrong , the error rate is 0.0455*3 + 0.0715),
- When the threshold V is 8.5, the error rate is 0.1667*3 (x < 8.5 when 1,x > 8.5 is taken-1, then 3 4 5 is wrong, the error rate is 0.1667*3).
Therefore, the threshold value of V 5.5 is the lowest error rate, so the third basic classifier is:
Still the original sample:
At this point, the sample is classified as: 0 1 2 9, the 4 samples corresponding weights are 0.0455,
So G3 (x) error rate on the training data set E3 = P (G3 (xi) ≠yi) = 0.0455*4 = 0.1820.
Calculate the coefficients of the G3:
Update weights distribution for training data:
D4 = (0.125, 0.125, 0.125, 0.102, 0.102, 0.102, 0.065, 0.065, 0.065, 0.125). The weighted value of the sample "0 1 2 9" is larger, and the weights of the other pairs are smaller.
F3 (x) =0.4236g1 (x) + 0.6496G2 (x) +0.7514g3 (x)
At this point, the third basic classifier sign (F3 (x)) is obtained with 0 mis-classifications on the training data set. At this point, the entire training process is over.
Now, let's summarize the changes of the weights and the error rates for the next 3 iterations, as shown below (where the underlined representation in the sample weight D represents the new weight of the sample that was divided in the previous round):
- Prior to training, the weights of each sample were initialized to D1 = (0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1);
- In the first iteration , the sample "6 7 8" is divided, the corresponding error rate is E1=p (G1 (xi) ≠yi) = 3*0.1 = 0.3, the first basic classifier in the final classifier occupies a weight of A1 = 0.4236. After the first iteration, the new weights of the samples are D2 = (0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.1666, 0.1666, 0.1666, 0.0715);
- In the second iteration , the sample "3 4 5" is divided, the corresponding error rate is E2=p (G2 (xi) ≠yi) = 0.0715 * 3 = 0.2143, this second basic classifier in the final classifier in the weight of a2 = 0.6496. After the second iteration, the new weights for the sample are D3 = (0.0455, 0.0455, 0.0455, 0.1667, 0.1667, 0.01667, 0.1060, 0.1060, 0.1060, 0.045 5);
- In the third iteration , the sample "0 1 2 9" is divided, the corresponding error rate is e3 = P (G3 (xi) ≠yi) = 0.0455*4 = 0.1820, this third basic classifier in the final classifier in the weight of a3 = 0.7514. After the third round of iterations, the new weights for the sample are D4 = (0.125, 0.125, 0.125, 0.102, 0.102, 0.102, 0.065, 0.065, 0.065, 0.125).
From the above process, it can be found that if some of the samples are divided, their weights in the next iteration will be increased, while the other pairs of samples in the next iteration of the weights will be reduced. In this way, the error rate E (all the sum of the weights of the GM (x) mis-categorized samples) is always reduced by selecting the threshold value with the lowest error rates to be used to design the basic classifier by increasing the weighted value of the divided sample and decreasing the weight of the sample.
In conclusion, the A1, A2, and A3 values calculated above are converted into g (x), g (x) = SIGN[F3 (x)] = sign[A1 * G1 (x) + A2 * G2 (x) + A3 * G3 (X)], and the resulting classifier is:
G (x) = SIGN[F3 (x)] = sign[0.4236G1 (x) + 0.6496G2 (x) +0.7514g3 (x)].
Error bounds of 2 AdaBoost
The above example shows that adaboost in the process of learning to reduce the training error e, until each weak classifier combination into the final classifier, then the final classifier of the error bounds exactly how much?
In fact, the upper bounds of the training error of the Adaboost final classifier are:
Now, let's prove the above equation by derivation.
When G (xi) ≠yi, Yi*f (xi) <0, thus exp (-yi*f (xi)) ≥1, so the first half is proven.
For the second part, don't forget:
The entire derivation process is as follows:
This result shows that the appropriate GM can be selected at each turn to minimize the ZM, thus reducing the training error the fastest. Next, let's continue to ask for the upper bound of the above results.
For the two classification, the following results are available:
Among them,.
Continue to prove this conclusion.
From the previous definition of ZM and the conclusions of this section are the first to know:
And this inequality can be started by e^x and 1-x open radical, in the point X Taylor expansion launched.
It is worth mentioning that if take γ1,γ2 ... The minimum value, which is recorded as Γ (obviously, γ≥γi>0,i=1,2,... m), then for all M, there are:
This conclusion shows that the training error of adaboost is decreased at exponential rate. In addition, the AdaBoost algorithm does not need to know in advance that the Nether γ,adaboost is self-adaptive, and it can adapt to the training error rate of the weak classifier.
Finally, Adaboost has another understanding that it can be considered that the model is an additive model, the loss function is an exponential function, the learning algorithm is a forward step-by-step algorithm of the two class classification learning method, next month, December will be deduced again, and then update this article. Prior to that, it is interesting to refer to section 8.3 of the statistical learning method or other relevant information.
3 Adaboost exponential loss function derivation
In fact, in step 3 of the algorithm flow in section 1.2 above AdaBoost, we construct a linear combination of each basic classifier
is an additive model, and the adaboost algorithm is actually a special case of the forward step-by-step algorithm. So the question is, what is the additive model, and what is the forward step-by-step algorithm?
3.1 Addition model and forward step-by-step algorithm
As shown is an addition model
Which, called the base function, is called the parameter of the base function, which is called the coefficient of the base function.
Under the condition of given training data and loss function, learning the addition model becomes an empirical risk minimization problem, that is, the minimization of loss function:
The problem can then be simplified as follows: Once backwards, each step only learns one base function and its coefficients, gradually approaching the upper formula, i.e.: Each step only optimizes the following loss function:
This optimization method is called the forward step-by-step algorithm.
Let's look at the algorithm flow for the forward step-by- step algorithm in detail:
- Input: Training Data Set
- Loss function:
- Set of Base functions:
- Output: Additive model
- Algorithm steps:
- 1. Initialization
- 2. For m=1,2,.. M
- A) minimization of loss function
Get the parameters and.
- 3. Finally get the addition model
In this way, the forward step-by-step algorithm simplifies the optimization of all parameters (,) from M=1 to M to a successive solution for each, (1≤m≤m) optimization problem.
The relationship between 3.2 forward step algorithm and AdaBoost
At the end of section 2nd above, we say that AdaBoost has another kind of understanding that it can be considered that the model is an additive model, a loss function is exponential, and a learning algorithm is a forward step-by-step algorithm for two classes of classification learning methods. In fact, the Adaboost algorithm is a special case of the forward step algorithm, in Adaboost, each basic classifier is equivalent to the base function in the addition model, and its loss function is exponential function.
In other words, the addition model is equivalent to the final classifier of adaboost when the base function in the current to-step algorithm is the basic classifier in AdaBoost
You can even say that this final classifier is actually an additive model. Just this addition model consists of a basic classifier and its coefficients, M = 1, 2, ..., M. The forward step-by-step algorithm is the process of learning the basis function, which is consistent with the process of learning each basic classifier by AdaBoost algorithm.
Now, let's prove that the loss function of the current to-step algorithm is the exponential loss function
, it is equivalent to the learning process of the adaboost algorithm to learn the specific operation .
Assuming that the m-1-round iteration, the forward step algorithm has been obtained:
And then the iteration in the M-round gets, and. Among them, for:
and the unknown. So now our goal is to train and, based on the forward step-by-step algorithm, to minimize the exponential loss on the training data set T, i.e.
To solve this problem, we can fix the other parameters, solve one or two parameters, and solve the remaining parameters. For example we can fix and, only target and do optimizations.
In other words, in the face and the 2m parameters are unknown, you can:
- First assumed and known, solved out and;
- The other unknown parameters are then solved individually.
And considering that the above is neither dependent nor dependent on G, it is a fixed value unrelated to the minimization, denoted as, that is, the above can be expressed as (after repeated use of this formula, précis-writers for):
It is worth mentioning that, although not related to minimization, it depends on, and varies with each iteration of the cycle.
Next, it is to be proven that the upper formula to achieve the smallest and is the adaboost algorithm solved by the and .
In order to solve the equation, we beg for it first.
First, please . For any of the above, the smallest g (x) is obtained by the following formula:
Don't forget.
In comparison with the calculation formula for the error rate described in section 1.2:
It is known that the basic classifier of the adaboost algorithm is obtained, because it is the basic classifier that minimizes the classification error rate when the M-wheel weights the training data. In other words, this is what the AdaBoost algorithm requires, and do not forget that in each iteration of the adaboost algorithm, the threshold value with the lowest error rate is chosen to design the basic classifier.
then beg . Or go back to the previous equation:
The second part of the equation can be further reduced by:
And then the above is calculated
Substituting the above, and the derivative, so that the derivative result is 0, that is, to get the smallest one, namely:
This is exactly the same as the calculation formula in section 1.2 above.
Furthermore, there is no doubt that the error rate is in the above formula:
That is, the sum of the weights of the samples that were mistakenly classified by the GM (x).
In this way, a combination of models, followed, can be launched
Thus there are:
Weight update formula as described in section 1.2 above
Compared to a single normalization factor, the latter is one more
So, the whole process down, we can see that the forward step-by-step algorithm to learn the basis function of the process, is indeed with the adaboost algorithm to learn each basic classifier of the process consistent, the two are completely equivalent.
In summary, this section not only provides another understanding of AdaBoost: Additive model, loss function is exponential function, learning algorithm is forward step-by-step algorithm, but also explains the origin of the basic classifier and its coefficients in the first 1.2 section, and the interpretation of the weight value update formula, You can even think of this section as an explanation of the entire section 1.2 above.
4 References and recommended readings
- Wikipedia on the introduction of AdaBoost: Http://zh.wikipedia.org/zh-cn/AdaBoost;
- The decision tree of Shambo and AdaBoost Ppt:http://pan.baidu.com/s/1hqepkdy;
- Shambo the PPT:HTTP://PAN.BAIDU.COM/S/1KTKKEPD of AdaBoost index loss function derivation (page 85th ~ 98th);
- "Statistical learning Method Hangyuan Li" the 8th chapter;
- Some humble opinions about AdaBoost: http://blog.sina.com.cn/s/blog_6ae183910101chcg.html;
- A Short Introduction to Boosting:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.5148&rep=rep1 &type=pdf;
- The report ppt:http://vdisk.weibo.com/s/fciltuai9m111 by Professor Ntu Zhou Zhihua on the boosting 25;
- "Ten Algorithms of Data Mining" the 7th Chapter Adaboost;
- http://summerbell.iteye.com/blog/532376;
- Statistics study those things: http://cos.name/2011/12/stories-about-statistical-learning/;
- Basic study notes on statistical learning: http://www.loyhome.com/%E2%89%AA%E7%BB%9F%E8%AE%A1%E5%AD%A6%E4%B9%A0%E7%B2%BE%E8%A6% 81the-elements-of-statistical-learning%e2%89%ab%e8%af%be%e5%a0%82%e7%ac%94%e8%ae%b0%ef%bc%88%e5%8d%81%e5%9b%9b %ef%bc%89/;
- PRML 14th Chapter Combination Model Reading notes: Http://vdisk.weibo.com/s/DmxNcM5_IaUD;
- Incidentally, a very useful Web page for editing latex formulas online: HTTP://WWW.CODECOGS.COM/LATEX/EQNEDITOR.PHP?LANG=ZH-CN.
The principle and derivation of machine learning note _prml_adaboost algorithm