Integrated learning is broadly divided into two categories, a class of serial generation, such as boosting. One class is parallelization, such as bagging and "random forest".
The following are respectively described:
1.Boosting
The method is to train a basic learning machine, then, to learn the training samples, to identify the wrong sample for additional attention , so that the distribution of training samples to adjust , and then the new sample distribution to train the next learner. In this way, a weighted combination of these several base learning machines is finally carried out.
The representative of boosting is the famous adaboost.
To tell a story, our team sits in the 1308 meeting. To solve a lot of problems, the first person to solve, at this time, some problems have been well solved by him, and some problems are not well solved. At the moment, we will pay more attention to the problem he is not able to solve, the concrete way is to increase the weight of this problem, so as to get the next person's more attention. And the problem that the person has solved reduces its weight. We will also be based on each person's ability to solve problems to give everyone a different weight, the higher the capacity of the greater the weight. Until one day, we all do not know the answer to a problem, and then everyone to express their own views, the final result is to synthesize the views given. That is, everyone's point of view will be combined with this person's weight after the synthesis. Here, everyone is a weak classifier. All together this is a strong classifier. Mathematical expression is "weighted Union" (1) H is a variety of learners, that is, those people. In front of a is the weight, is their voice. The following will not say too much, want to say is only these points: 1. Since this is also a model, what is the optimal formula for our model? 2. How does this weight change when we say that each question will change its weight according to the outcome of the human settlement? 3. Each person will also be given a certain voice (weight) according to their own ability to solve the problem, what is the formula? So I want to say the main points.
1 . 1 Minimizing the exponential loss function
By the name we can tell that we are using the exponential loss function as the optimization goal of our model, why? First, the Model H (x) is trained by the training set into the optimization function. When the exponential loss function (2) is minimized by training set D, h is available.
If the requirement is minimized, then the (2) type is biased: (3). Here, there are two possible assumptions based on the performance of X, one f (x) = 1 and the other F (x) =-1. In both cases, different probabilities, the F (x) equals 1 or-1 generation into the original (3). The (3) formula is 0, then it is available (4).
Therefore, (5).
Ok! Here, it means that sign (H (X)) achieves the Bayesian optimal error rate. This means that if the exponential loss function is minimized, the classification error rate will also be minimized. And, if we design a classifier, see if it is qualified or not to see if it can make the classification error rate minimized? And if the exponential loss function is used to indicate the performance of the classification task is consistent with this classification error rate. In addition, the exponential loss function is also continuous and can be micro. Therefore, we take the exponential loss function as its optimization function which is very reasonable!
1.2 Determining the classifier and its weights
This sectionto the white two things. First, how to determine this round of sub-classifiers, that is, how to generate a person's values. Second, these people's discourse power is how much, this sub-classifier weight is how big. The goal of all this is to make the exponential loss function we've identified in section 1.1 minimal.
i.e. (6). M in the formula is the M-wheel, that is, the M-Man in the process of judging. And the total number may be more than M personal.
The HM (X) is: (7). Substituting (6), then (6) the formula becomes (8)
Where (9), WMI is the weight of the training sample in the M-wheel (described in detail later), it has nothing to do with the αm and HM we are about to ask, he only has the strong classifier Hm-1 (X) and the sample label F (xi) formed in the previous round, so we can consider it as a constant in the minimization.
First, we ask HM (X). That is to form the values of individual m. (10). EM is the classification error rate. That is, the M-person is to solve the current problem, and is inclined to solve the problems that have been difficult to solve, because the predecessors of difficult to solve problems, the weight of these problems will be greater, once the power of the problem solved will make (10) the classification error rate is even smaller. The HM (x) specialty is also called the basic classifier with the smallest error rate for the weighted training data classification.
Once the basic classifier of the first M wheel is known, the weight of the classifier is required to αm. The formula (10) is put into the formula (8). Formula (8) will become
(11). Gaga, the middle is full of numerical substitution and some simple mathematical deduction, look carefully, the back of the formula derivative and make the derivative of 0, then get (12), this is the weight of the sub-classifier.
1.3 Sample Weight Update
In fact, 1.2 of the many mentioned (deep feeling to 1.2 and 1.3 can not separate AH!) ), here, comb it again. By (7) and (9) type, can be obtained (13). This is the update of the sample weights, which is the update of the problem in our example.
1.4 Algorithmic Flow
Input: training set, number of training m.
procedure:1) Weight distribution of initialization training data
2) for m =,..., m do
The first half of the basic sub-classifier HM (x), the formula (10) is obtained using the training sample data set of the distributed TM with weights.
Calculates the classification error rate of the HM (X) on the training data set, the second half of the formula (10).
Calculates the coefficients of the HM (x), the formula (12).
Update the weight distribution of the training data set (13).
End for
3) construct a linear combination of basic classifiers (1).
Output:
1.5 Concluding remarks
Boosting I just have a look. I do not dare to use the project, I feel too easy to fit, because many problems in life is no solution, and this model is to solve all the problems, it forcibly hope that the sub-classifier to do what it is not able to do, the result will let it explode, will let it cross-fit. From the angle of deviation-variance decomposition, boosting mainly focus on reducing the deviation, say one more, the deviation is equal to the prediction accuracy, and the variance is equal to the predictive stability, can obviously explain the deviation and variance
In short, boosting is able to build strong integrations based on a learner with fairly weak generalization capabilities, such as being more suitable for decision trees. Boosting. Finish
Integrated Learning 1-boosting