1. Lifting algorithm
The lifting algorithm combines a series of single algorithms (such as decision Tree, SVM, etc.) together to make the model more accurate. Here we first introduce two kinds of bagging (representing the algorithm random forest), boosting (representing the algorithm adaboost-is the core of this chapter)
Bagging thought: Taking random forest as an example
Assuming that the sample set has a total sample size of 100, each sample has 10 characteristics (that is, the dimension is 10); The ratio of random sampling is generally (60%-80%)
Step 1: We randomly take out 60 data (note that there is a back-up sample) used to build a decision tree, so randomly 50 times, will eventually form 60 decision trees.
Step 2: When we build the decision tree, we use random sampling (with put back) for each decision tree feature, randomly selecting 6 features.
Step 3: Use the previous step 1, step 2 to build the 60 different decision tree models, the final results are judged by the 60 comprehensive area, such as (image to this network):
Boosting thought: In the classification problem, by changing the weight of training samples, learning multiple classifiers, and the linear combination of these classifiers, improve the performance of classification (AdaBoost to explain later).
The similarities and differences between bagging and boosting:
Same: Both are integrated algorithms, that is, more than a combination of classifiers, improve the accuracy of classification
XOR: Classifier level: For example, a 20-person working group is now going to reach a resolution on a particular issue. Now the meeting and discussion, bagging is the same, that is, no matter how much work experience, ability, I see the majority, the latter average. And boosting will be based on the ability or experience of the number of comprehensive assessment to each engineer a weight, also say that the ability of strong, the weight of the experience is bigger, their opinion is important. Make a decision on this basis (note that this example is for the classifier level only).
Sample level: Boosting will also give the sample a certain weight. If from a structural level.
Model structure level: Bagging is a parallel decision (analog parallel circuit), and boosting is a serial decision.
2.AdaBoost algorithm
2.1 AdaBoost principle and formation process
AdaBoost algorithm of sentiment, I in the collation, want to use an inverted way to record and interpretation, because I myself in the process of learning directly from the mathematical expression to learn, always feel a lot of doubt, so that the time to accept this algorithm a little longer, The following formally began to organize their understanding of the process of adaboost
AdaBoost algorithm is a kind of boosting algorithm, its function is to combine a series of weakly classifier linear together to form a strong classifier, can understand adaboost like a leader, those weak classifier algorithm (such as: single-layer decision tree, etc.) like the staff, Each employee has its own characteristics, and the role of AdaBoost as the boss is to put these employees together in some way to do things better, in machine learning is to classify tasks or return tasks can do better, that is, the promotion method. What is the way adaboost to accomplish this task?
In the Hangyuan Li-Statistical learning method, the author raises two questions about the lifting method, and the principle of adaboost is how to solve these two problems.
Question 1: How does each round change weights or probability distributions for training data?
AdaBoost: Increases the weights of the samples that were incorrectly categorized by the previous round of classifiers, while reducing the weights of the samples that were classified correctly. In this way, those data that are not properly categorized are subject to greater attention by the latter round if the weights are increased.
Question 2: How do I combine a classifier into a strong classifier?
AdaBoost: Weighted majority voting method, that is, increase the weight of weak classifier with small classification error, make it play a larger role in voting, reduce the weight of the weak classifier with large classification Chao Wu rate, make it play a smaller role in voting
Understanding AdaBoost is the mathematical expression of understanding the above two problems:
AdaBoost The final mathematical expression is:
where m means: the first M classifier, m for a total of M classifiers
X: Represents a collection of samples
The forming process of the upper formula (1) is as follows:
Input: Training data set, which; weak learning algorithm;
Output: Final classifier g (x)
2.2-Step understanding of the above AdaBoost algorithm
m = 1 o'clock, when the first classifier is used to start learning the training data set
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
First step: for the first classifier, we assume that the weights for each sample are the same. Use the examples in the Hangyuan Li-Statistical learning approach to understand the first step:
Example: for the following data set, it is assumed that the weak classifier is generated by x<ν or want to x>v , in which the threshold ν makes the classifier have the lowest classification error rate on the training data set, and the AdaBoost algorithm learns a strong classifier.
Step Two: determine the basic classifier g
Here the sample size is not large, we can manually calculate, for example, the threshold value of 1.5,2.5,3.5,...,9.5 when the classification error rate, you can get when the v=2.5 is the lowest classification error rate is:
Step Three: calculate the training error rate on the classifier
Fourth Step: calculate the weights of the classifier (note that this is calculated based on the base of e)
Sixth step: determine the final classifier
Using the above g (X) to classify the training data set, there are still 3 data are classified incorrectly
Seventh Step: calculate the weight of the sample set for the next loop
D2= (0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.0715, 0.1666, 0.1666, 0.1666, 0.0715)
m = 2, combination of the second weak classifier
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Repeat step two to step seventh above
The threshold with the lowest classification error rate is 8.5
At this point, the training data set is classified using the G (x) above, and 3 points are still classified by mistake.
m = 3, combining a third weak classifier
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
Repeat the second and seventh steps above again
The threshold with the lowest classification error is 5.5
At this time using the above g (X) to classify the sample set, the above training data set has 0 false classification points, that is, the error rate of 0, at this point can be stopped, the above g (x) can be used as a strong classifier of this data
1.3 Summary
Through the understanding of the above cases, we can quantify the performance of the indicators of the law and then from the qualitative perspective of understanding
(1) By the previous classifier divided the sample, the next classifier to classify it must not be divided wrong, why?
By numerical observation, the sample weights are enlarged in the next classification (m= 2 o'clock sample weight of 0.1666 samples), and we mentioned in the second step of the flowchart, when selecting the classifier, the need to choose the classifier with the lowest classification error rate (for the principle of this selection is shown in the following formula (11)), it is obvious that if the weight of the sample classification is not to achieve the second step of the requirements.
(2) Why is the error rate of 0 when m=3, how to understand from the qualitative angle?
The weighted alpha of the m=3 composite classifier gradually increases, that is, we give the sample with a low classification error rate a high weight, and judging from the value, the final f (x) >0 or F (x) <0 is determined by three classifiers, If the first classifier divides a sample into a Category 1, divided into 1 categories, it will provide a negative contribution to the final classifier, and the subsequent classifier will always offset the negative contribution, making the final result a 1 category.
Understanding of 3.AdaBoost algorithms
As mentioned in the Hangyuan Li-Statistical learning method, the AdaBoost algorithm is the model of the addition model , the loss function is exponential function , the learning algorithm is a forward step algorithm when the two classification learning method, This explanation actually explains the origin of the principle of our adaboost algorithm;
The general expression of the 3.1 addition model is as follows:
3.2 forward step algorithm:
After selecting the model, our goal is to train the model by training the data set (essentially the parameters in the training model), and what is the quality of the model training? It is often minimized by the risk minimization, the loss function , where, given the training data set and the loss function L (y,f (x)), our goal is to minimize this loss function in the following form:
Since f (x) is the additive model, now our task is to optimize the model parameters, so that the loss function is minimal, we can use the optimization of each item in the addition model, so that the results of each item is minimized, so that the problem will be transformed into the following form:
The above (3) to (4) conversion process is the core of the forward step-by-step algorithm .
3.3 Derivation of the AdaBoost model using the above addition model and the principle of the forward step-by- step algorithm
The AdaBoost model is:
The loss function is:
The set has been optimized for the m-1 wheel, which has been obtained, such as the formula (7), in the M-round we are able to iterate through the following formula (8)
We can minimize the corresponding loss function by making the above formula (8), the loss function is as follows (9)
The formula (9) is minimized by optimizing the formula (9) to obtain
That is, it is not dependent on alpha, nor does it depend on G, so the minimum value of the loss function with G as an argument is equivalent to the following formula (11):
Formula (11) to explain the second step of the above flowchart, each time we look for the classifier, we need to find the lowest error rate of the classifier, that is, to make the loss function minimum.
For Alpha, the formula (9) is about the alpha derivative and makes the reciprocal equal to 0, and the process is as follows:
, making it 0.
Known by the above (9), so
This differs from the AdaBoost algorithm in that the AdaBoost is normalized to W
2. Example
3. Algorithm derivation
4. Attention Points
6. Integrated algorithm boosting----adaboost algorithm