The principle and derivation of Adaboost algorithm

Source: Internet
Author: User
Tags rounds

the principle and derivation of Adaboost algorithm

(Original link: http://blog.csdn.net/v_july_v/article/details/40718799) 0 Introduction

Always wanted to write adaboost, but the delay failed to pen. Although the algorithm thought is simple: listens to the multi-person opinion, finally synthesizes the decision, but the general book on its algorithm's flow description is too obscure. Yesterday, November 1 afternoon, in my organization of the 8th class of machine classes Z Lecturer in the decision-making tree and AdaBoost, of which, adaboost speak hearty, finish, I know, can write a blog.

Unintentional, this article combined with the machine class decision tree and AdaBoost ppt, with Zou said AdaBoost index loss function deduction PPT (page 85~ 98th), and "statistical learning method" and other references written, can be defined as a course note, reading notes or learning experience, Any questions or comments, please feel free to point out in this comment, thanks.



1 principle of AdaBoost What is 1.1 adaboost?

AdaBoost, an abbreviation for the English "Adaptive boosting" (adaptive enhancement), was presented by Yoav Freund and Robert Schapire in 1995. Its adaptation is that the sample of the previous basic classifier is strengthened, and the weighted sample is again used to train the next basic classifier. At the same time, a new weak classifier is added to each round until a predetermined small enough error rate is reached or a predetermined maximum number of iterations is reached.

Specifically, the entire AdaBoost iterative algorithm is 3 steps: Initialize the weight distribution of training data. If there are N samples, each training sample is given the same weight at the very beginning: 1/n.
Training weak classifiers. In the training process, if a sample point has been accurately classified, then in the construction of the next training set, its weight is reduced, conversely, if a sample point is not accurately classified, then its weight is improved. Then, the weight-updated sample set is used to train the next classifier, and the entire training process goes on so iteratively. The weak classifiers of each training are combined into strong classifiers. After the training process of each weak classifier is finished, the weight of the weak classifier with small classification error rate is enlarged, which plays a larger role in the final classification function, while the weight of the weak classifier with large classification error rate is reduced, which plays a smaller role in the final classification function. In other words, a weak classifier with a low error rate occupies a larger weight in the final classifier, otherwise it is smaller. 1.2 adaboost algorithm Flow

Given a training data set t={(X1,y1), (x2,y2) ... (Xn,yn)}, where instance, and instance space, Yi belongs to the tag set { -1,+1},adaboost is to learn a series of weak classifiers or basic classifiers from the training data, and then combine these weak classifiers into a strong classifier.

The algorithm flow for AdaBoost is as follows: Step 1. First, the weight distribution of the training data is initialized. Each training sample is given the same weight at the very beginning: 1/n.

Step 2. For multiple rounds of iteration, with M = ..., M represents the number of rounds of the iteration

a. Use the training data set with weight distribution DM to get the basic classifier (select the threshold with the lowest error rate to design the basic classifier):

B.calculating the classification error rate of the GM (x) on the training data set
From the above formula, it is known that the error rate of GM (x) on the training data set is the sum of the weights of the samples by the GM (x) by mistake. c. Calculate the coefficient of GM (x), am represents the importance of GM (X) in the final classifier (purpose: To obtain the weight of the basic classifier in the final classifier): From the above formula, the EM <= 1/2, am >= 0, and am with the reduction of EM increase , which means that the smaller the classification error rate, the greater the role of the basic classifier in the final classifier. D.updating the weights distribution of the training data set (objective: To obtain a new weight distribution for the sample) for the next iteration

The weights of the samples which are classified by the basic classifier GM (x) are increased, and the weights of the samples are reduced. In this way, the AdaBoost method can "focus on" or "focus on" the more difficult samples.

Among them, ZM is a normalization factor, making dm+1 a probability distribution:

Step 3. Combining each weak classifier

Thus the final classifier is obtained, as follows:

An example of 1.3 AdaBoost

Below, given the following training samples, use the AdaBoost algorithm to learn a strong classifier.

Solution process: Initialize the weight distribution of the training data, so that each weight value w1i = 1/n = 0.1, where N = 10,i = 1, 2, ..., 10, then respectively for M = ... The equivalent is iterated.

After getting the training samples for the 10 data, according to the corresponding relationship between X and Y, the 10 data to be divided into two categories, one is "1", a class is "1", according to the characteristics of the data found: "0 1 2" The corresponding class is "3", "1 3 4" The corresponding class of 5 data corresponds to "3", "1 6 7" These 3 data correspond to the class is "1", 9 is relatively lonely, corresponding class "1". Aside from the lonely 9 do not say, "0 1 2", "3 4 5", "6 7 8" This is 3 classes of different data, the corresponding class is 1, 1, 1, it is intuitively inferred that the corresponding data can be found in the demarcation point, such as 2.5, 5.5, 8.5 to divide those data into two categories. Of course, this is only a subjective conjecture, the following actual calculation of this specific process.

Iterative Process 1

For M=1, the weight distribution is D1(10 data, each data weight is initialized to 0.1) of the training data, after calculation can be: The threshold of V to 2.5 error rate is 0.3 (x < 2.5 when taking 1,x > 2.5 time Take-1, then 6 7 8 is wrong , the error rate is 0.3), the threshold value of V to 5.5 when the error rate is 0.4 (x < 5.5 when 1,x > 5.5 time Take-1, then 3 4 5 6 7 8 is wrong, the error rate of 0.6 is greater than 0.5, undesirable. Therefore, X > 5.5 when taking 1,x < 5.5 time to take-1, 0 1 2 9 wrong, error rate is 0.4), the threshold value of V 8.5 when the error rate is 0.3 (X < 8.5 time to take 1,x > 8.5 When taken-1, then 3 4 5 error rate is 0.3).

Can be seen, regardless of the threshold value of v 2.5, or 8.5, the total score error 3 samples, so can take any one of them, such as 2.5, into the first basic classifier for:

It says that the threshold V takes 2.5 is 6 7 8 wrong, so the error rate is 0.3, the more detailed explanation is: Because the sample set 0 1 2 corresponds to the Class (Y) is 1, because they are less than 2.5, so it is G1 (x) in the corresponding class "1", divided. 3 4 5 The corresponding class (Y) is-1, because they themselves are greater than 2.5, so the G1 (x) is divided into the corresponding class "1", divided. But 6 7 8 itself corresponds to the class (Y) is 1, but because they themselves are greater than 2.5 and was G1 (x) in the class "-1", so the 3 samples were divided wrong. 9 corresponds to the Class (Y) is-1, because it itself is greater than 2.5, so by G1 (x) in the corresponding class "-1", divided.

In order to obtain the error rate of G1 (x) on the training data set (the sum of the weights of the "6 7 8" by G1 (x) by mistake)E1=p (G1 (xi) ≠yi) = 3*0.1 = 0.3.

The coefficients of the G1 are then calculated based on the error rate E1:

This A1 represents the weight of G1 (x) in the final classification function, which is 0.4236.
Then update the weight distribution of the training data for the next iteration:

It is worth mentioning that the formula updated by the weights indicates whether the new weights of each sample are larger or smaller, depending on whether it is divided or divided correctly.

That is, if a sample is divided wrongly, Yi * Gm (xi) is negative, negative negative positive, the result makes the whole equation larger (the sample weights become larger), or smaller.

After the first iteration, we finally get the new weights distribution of each data D2 = (0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.1666, 0.1666, 0.1666, 0.07 15). As can be seen, because the sample is the data "6 7 8" by G1 (x) is wrong, so their weights from the previous 0.1 increased to 0.1666, conversely, the other data are divided correctly, so their weights are reduced from the previous 0.1 to 0.0715.

The classification function f1 (x) = A1*G1 (x) = 0.4236G1 (x).

At this point, the first basic classifier sign (F1 (x)) obtained has 3 mis-classifications (6 7 8) on the training data set.

The whole iterative process of the first round can be seen as follows: the sum of the weights of the sampled samples is affected by the error rate, and the error rate affects the weight of the basic classifier in the final classifier.

Iterative Process 2

For m=2, on training data with a weight distribution of D2 = (0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.1666, 0.1666, 0.1666, 0.0715), the calculation is available: Threshold v 2.5 error Rate is 0.1666*3 (x < 2.5 when taking 1,x > 2.5 when taking-1, then 6 7 8 wrong, error rate is 0.1666*3), the threshold value of V 5.5 is the lowest error rate of 0.0715*4 (x > 5.5 Time 1,x < 5.5 Time Take-1, then 0 1 2 9 wrong, the error rate is 0.0715*3 + 0.0715), the threshold value of V 8.5 error Rate is 0.0715*3 (x < 8.5 when the 1,x > 8.5 time Take-1, then 3 4 5 is wrong, the error rate is 0 .0715*3).

Therefore, the error rate is the lowest when the threshold V is 8.5, so the second basic classifier is:

The following sample is still in the face:

Obviously, G2 (x) divided the sample "3 4 5", according to D2 they have a weight of 0.0715, 0.0715, 0.0715, so G2 (x) error rate on the training data set E2=p (G2 (xi) ≠yi) = 0.0715 * 3 = 0.2143.

Calculate the coefficients of the G2:

Update weights distribution for training data:
D3 = (0.0455, 0.0455, 0.0455, 0.1667, 0.1667, 0.01667, 0.1060, 0.1060, 0.1060, 0.0455). The weighted value of the Divided sample "3 4 5" becomes larger, and the weights of other pairs of samples are smaller.
F2 (x) =0.4236g1 (x) + 0.6496G2 (x)

At this point, the resulting second basic classifier sign (F2 (x)) has 3 false classification points (i.e. 3 4 5) on the training data set.

Iterative Process 3

For m=3, on training data with a weight distribution of D3 = (0.0455, 0.0455, 0.0455, 0.1667, 0.1667, 0.01667, 0.1060, 0.1060, 0.1060, 0.0455), the calculation is available: Threshold v 2.5 error Rate is 0.1060*3 (x < 2.5 when 1,x > 2.5 time Take-1, then 6 7 8 wrong, error rate is 0.1060*3), the threshold value of v 5.5 error rate is 0.0455*4 (x > 5.5 of 1 , x < 5.5 time Take-1, then 0 1 2 9 wrong , error rate is 0.0455*3 + 0.0715), the threshold value of V to 8.5 when the error rate is 0.1667*3 (x < 8.5 time 1,x > 8.5 Time Taken-1, then 3 4 5 wrong, error rate To 0.1667*3).

Therefore, the threshold value of V 5.5 is the lowest error rate, so the third basic classifier is:

Still the original sample:

At this point, the sample is classified as: 0 1 2 9, the 4 samples corresponding weights are 0.0455,

So G3 (x) error rateon the training data set E3 = P (G3 (xi) ≠yi) = 0.0455*4 = 0.1820.

Calculate the coefficients of the G3:

Update weights distribution for training data:

D4 = (0.125, 0.125, 0.125, 0.102, 0.102, 0.102, 0.065, 0.065, 0.065, 0.125). The weighted value of the sample "0 1 2 9" is larger, and the weights of the other pairs are smaller.

F3 (x) =0.4236g1 (x) + 0.6496G2 (x) +0.7514g3 (x)

At this point, the third basic classifier sign (F3 (x)) is obtained with 0 mis-classifications on the training data set. At this point, the entire training process is over.

Now, let's summarize the next 3 iterations, changes in the weights and error rates of each sample, as shown below (where the underlined representation in the sample weight D represents the new weight of the sample that was divided in the previous round): Before training, the weights of each sample were initialized to D1 = (0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1);First round Iteration, sample "6 7 8 "is divided and the corresponding error rate isE1=p (G1 (xi) ≠yi) = 3*0.1 = 0.3, this first basic classifier occupies a weight in the final classifierA1= 0.4236. After the first iteration, the new weight of the sample isD2= (0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.1666, 0.1666, 0.1666, 0.0715);
Second round iterationMedium, sample"3 4 5"is divided and the corresponding error rate isE2=p (G2 (xi) ≠yi) = 0.0715 * 3 = 0.2143, this second basic classifier occupies a weight in the final classifierA2= 0.6496. After the second iteration, the new weight of the sample isD3= (0.0455, 0.0455, 0.0455, 0.1667, 0.1667, 0.01667, 0.1060, 0.1060, 0.1060, 0.0455);
Third round IterationMedium, sample"0 1 2 9"is divided and the corresponding error rate isE3= P (G3 (xi) ≠yi) = 0.0455*4 = 0.1820, this third basic classifier occupies a weight in the final classifierA3= 0.7514. After the third round of iterations, the new weights for the sample areD4= (0.125, 0.125, 0.125, 0.102, 0.102, 0.102, 0.065, 0.065, 0.065, 0.125).

From the above process, it can be found that if some of the samples are divided, their weights in the next iteration will be increased, while the other pairs of samples in the next iteration of the weights will be reduced. In this way, the error rate E (all the sum of the weights of the GM (x) mis-categorized samples) is always reduced by selecting the threshold value with the lowest error rates to be used to design the basic classifier by increasing the weighted value of the divided sample and decreasing the weight of the sample.

In conclusion, the A1, A2, and A3 values calculated above are converted into g (x), g (x) = SIGN[F3 (x)] = sign[A1 * G1 (x) + A2 * G2 (x) + A3 * G3 (X)], and the resulting classifier is:

G (x) = SIGN[F3 (x)] = sign[0.4236G1 (x) + 0.6496G2 (x) +0.7514g3 (x)].



error bounds of 2 AdaBoost

The above example shows that adaboost in the process of learning to reduce the training error e, until each weak classifier combination into the final classifier, then the final classifier of the error bounds exactly how much.

In fact, the upper bounds of the training error of the Adaboost final classifier are:

Now, let's prove the above equation by derivation.

When G (xi) ≠yi, Yi*f (xi) <0, thus exp (-yi*f (xi)) ≥1, so the first half is proven.

For the second part, don't forget:

The entire derivation process is as follows:


This result shows that the appropriate GM can be selected at each turn to minimize the ZM, thus reducing the training error the fastest. Next, let's continue to ask for the upper bound of the above results.

For the two classification, the following results are available:

Among them,.

Continue to prove this conclusion.

From the previous definition of ZM and the conclusions of this section are the first to know:

And this inequality can be started by e^x and 1-x open radical, in the point X Taylor expansion launched.

It is worth mentioning that if take γ1,γ2 ... The minimum value, which is recorded as Γ (obviously, γ≥γi>0,i=1,2,... m), then for all M, there are:

This conclusion shows that the training error of adaboost is decreased at exponential rate. In addition, the AdaBoost algorithm does not need to know in advance that the Nether γ,adaboost is self-adaptive, and it can adapt to the training error rate of the weak classifier.

Finally, Adaboost has another understanding that it can be considered that the model is an additive model, the loss function is an exponential function, the learning algorithm is a forward step-by-step algorithm of the two class classification learning method, next month, December will be deduced again, and then update this article. Prior to that, it is interesting to refer to section 8.3 of the statistical learning method or other relevant information.



3 Adaboost exponential loss function derivation

In fact, in step 3 of the algorithm flow in section 1.2 above AdaBoost, we construct a linear combination of each basic classifier

is an additive model, and the adaboost algorithm is actually a special case of the forward step-by-step algorithm. So the question is, what is the additive model, and what is the forward step algorithm? 3.1 addition model and forward step-by-step algorithm

An addition model is shown in the following figure

Which, called the base function, is called the parameter of the base function, which is called the coefficient of the base function.

Under the condition of given training data and loss function, learning the addition model becomes an empirical risk minimization problem, that is, the minimization of loss function:

The problem can then be simplified as follows: Once backwards, each step only learns one base function and its coefficients, gradually approaching the upper formula, i.e.: Each step only optimizes the following loss function:

This optimization method is called the forward step-by-step algorithm.

    Below, let's look at the algorithm flow for the forward step algorithm : Input: Training data set loss function: Set of functions: output: Addition model algorithm steps: 1. Initialize 2. For m=1,2,.. M a) minimization of loss function

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.