1. Integrated Learning Overview
Integrated learning algorithm can be said to be the most popular machine learning algorithms, participated in the Kaggle contest students should have a taste of the powerful integration algorithm. The integration algorithm itself is not a separate machine learning algorithm, but instead builds multiple learners based on other machine learning algorithms and integrates them together. Integration algorithm can be divided into homogeneous integration and heterogeneous integration, homogeneous integration is a value integration algorithm in the individual learners are the same type of learners, such as all decision trees; heterogeneous integration is an individual learner in an integrated algorithm composed of different types of learners. (at present, the more popular integration algorithms are homogeneous algorithms, and basically based on decision trees or neural networks)
Integration algorithm is composed of a number of weak learners of the algorithm, and for these weak learners we hope that each learner has a good accuracy, and there are large differences between the learners, such an integration algorithm will have better results, but in fact, accuracy and diversity are often conflicting, This requires us to find a better critical point to ensure the effectiveness of the integration algorithm. Depending on how the individual learner is generated, we can divide the integration algorithm into two categories:
1) There is a strong dependency between individual learners, it is necessary to serialize the generated serialization method, this kind of representative is boosting (common algorithm has adaboost, GBDT);
2) There is no strong dependency between individual learners, and each individual learner can be generated in parallel, and the representative of this class is bagging (common algorithm has randomforest).
Next, let's introduce these two different integration algorithms.
2. Integrated algorithm-boosting
For the classification problem, given a training set, it is much easier to find a weak learner than to ask for a strong learner, and the boosting method is to get a series of weak learners from the weak learner, and then combine these weak learners to form a strong learner. The working mechanism of the boosting algorithm can be summed up as follows: Training a base learner from the initial training set, and then adjusting the training sample distribution according to the performance of the base learner, so that the sample that the previous base learner did wrong receives more attention (by exerting weight to control), The next base learner is then trained based on the adjusted sample set so that it is repeated until the number of learning-based learners reaches the set number of T (the number of base learners is the parameter we need to debug, which is selected by cross-validation), Ultimately, the T-base learners are given different weights based on their predictive performance and combined to form our strong learner.
The boosting algorithm is most representative of the AdaBoost algorithm, the specific flow of the AdaBoost algorithm is as follows:
1) The weight distribution of the initialization training set, the weights of each sample are the same
2) for the M-base learner, the use of the value distribution DM training data set training model, to get the base learner GM
Calculates the classification error rate of the GM (x) on the training data set
Calculates the coefficient αm of the GM (x), which is then used as the weight of the base learner.
Update weights distribution of training data set, obtain dm+1
Wherein ZM is the canonical field factor, and its expression is
3) Constructing a linear combination of base learners
The final learner model is as follows
3. Integrated algorithm-bagging
The core of the bagging algorithm is to use these sample sets to train the base learner by sampling the original data set and obtaining many different sets of samples. In the first section we mentioned that the base learner needs to balance accuracy and diversity, so the samples we collect need to meet these two points, bagging is sampled using the self-Service sampling method (Bootstrap), and the Rules for self-service sampling are as follows: Given a DataSet containing a sample of M, We randomly extract a sample from it, and then put it back so that the next time there is still a chance to draw, so after the M random sampling, we can get a sample of M sample set (such a sample concentration of about 63.2% in the original dataset), in this way extract the T sample set out, Using this T-sample for training a T-base learner, the poll method is usually used to determine the final output, based on the predicted results of this T-base learner. bagging and AdaBoost only apply to two classification problems, which can be used without modification for multi-classification problems.
Bagging algorithm is the most common algorithm is randomforest (random forest) algorithm, Randomforest is a variant of bagging, on the basis of bagging introduced a property interference strategy, mainly to improve the diversity between the base learner, The rule: The traditional decision tree chooses the optimal attribute on the whole attribute set to divide the sample collection, and Randomforest first randomly selects the K attribute on the attribute set to make up a sub-attribute set, then chooses the optimal attribute on this sub-attribute set to divide the sample collection. The parameter K here controls the degree to which randomness is introduced, and in general the recommended k=log2d,d is the size of the attribute set.
4, the Integrated learning output mode
The output of the final strong learner is related to each weak learner, but in what ways do we integrate the output of these weak-learner output values?
1) Average method
For the regression problem of the numerical class, the average method is usually used, that is, the output of each weak learner plus and averaging, the average as the final output, of course, sometimes give each learner weight, this is the weighted average to obtain the final output value.
2) Voting law
For the output of the classification problem is often used voting method, the simplest way to vote is to take the weak learner in the most predictable class as our output value, and then a strict vote is not only to output the most predictable category of the value, and the value of the class to account for more than half of the predicted value, Finally, the weight of each weak learner is given, and the final vote calculation is multiplied by the corresponding weight (for example, one vote for the monitor at the time of the election is five votes, while the average student is one vote).
3) Learning method
For the average method and voting method is relatively simple, and sometimes in the prediction of errors may be, so derived from the learning method, such as stacking, when using stacking is the output of all weak learners as input, on this basis to build a model, so that the machine to learn the output mode, Sometimes we train multiple strong learners, such as training a random forest learner, training a adaboost learner, and then inputting the output of the two learner's weak learners (so we have two input data) and train a learner for the final prediction to output the results.
5, AdaBoost and randomforest contrast
The weak learners commonly used in adaboost algorithms are decision trees and neural networks (in theory, any learner can be used as a base learner). For decision trees, the AdaBoost classification uses the CART classification tree, and the AdaBoost regression uses the cart regression tree. AdaBoost and boosting algorithms, focusing on the model deviation, the iterative process is to reduce the deviation for the purpose, the general deviation will be small, but this does not mean that the variance of the adaboost is very large, easy to fit, in fact, can be adjusted by the complexity of the model to avoid overfitting, For example, when the decision tree is a base learner, you can adjust the depth of the tree, or the number of samples in the leaf node to achieve. AdaBoost can only do two classification problems, the final output is determined by the sign function. Of course, some modifications to the algorithm can also be used for regression problems, but for the multi-classification problem is more complex.
Key Benefits of AdaBoost:
1) High classification accuracy when adaboost as a classifier
2) under the framework of AdaBoost, a variety of categorical regression models can be used to construct a base learner
3) as a simple two classification problem, the construction is simple, the results can be understood
4) not easy to fit
AdaBoost's main drawbacks:
1) sensitive to abnormal samples, abnormal samples may get higher weights in iteration, which will affect the prediction accuracy of the final strong learning.
2) AdaBoost can only do two classification problems, to do more classification asked to do other modifications
The weak learners commonly used in the randomforest algorithm are also decision tree and neural network, and for decision tree, the CART classification regression tree is used in random forest to deal with categorical regression, usually using voting method and averaging method to determine the final output. The voting method for classification is also decided that random forest can be applied to multi-classification problem without modification, like random forest and bagging algorithm, the emphasis is on reducing the variance of the model, the generalization ability is strong, but sometimes there will be large training error, which can be solved by increasing the complexity of the model. In addition, random selection is used in random forest, the number of sub-special collection will also affect the variance and deviation of the model, it is generally considered that the larger the sub-feature set, the smaller the deviation of the model. Because the random forest model is simple, the effect is good, so there are many variants of the algorithm, these variants can be used to deal with the classification of regression problems, can also deal with feature conversion, anomaly detection and so on.
For example, extra trees is a generalized form of random forest, it changes two places, one is no longer take the self-help sampling method to randomly sample samples, but the original set as a training sample, and the second is the direct random selection of features (equivalent to a sub-feature set only one element). The variance of extra trees is smaller than that of random forests, so the generalization ability is stronger, but the deviation is greater.
Key benefits of random forests:
1) training can be highly parallelized, so the speed of the algorithm should be chosen faster than AdaBoost.
2) Because of the random sub-feature set, the algorithm still has good efficiency under high-dimensional features.
3) After training you can give the importance of each feature to the output
4) due to random sampling, the variance of the trained model is small and the generalization ability is strong.
5) The algorithm is easier to implement than boosting.
6) Insensitive to partial feature deletions
Main disadvantages of random forests:
1) In some large noisy sample sets, the RF model is prone to fall into the fit
2) The characteristics of the value ratio are easy to influence the decision of random forest, and affect the fitting effect of the model.
Finally, on the bagging focus on reducing variance, boosting focus on reducing the deviation of one said. First, for bagging, the sample is resampled, and then a similar model is used to train the base learner, because the similarity of the sub-sample set is the same as the model used, so the final base-learner has approximately equal variance and deviation, which can not be reduced by the model average for similar deviations. The variance of the model is reduced by the average value of each base learner, and the common values of each base learner are strengthened. For boosting, the core of the algorithm is to pay attention to the deviation, in the algorithm optimization process to minimize the weak learner, which in itself is a way to reduce the deviation for the purpose of optimization, it will reduce the deviation of the model, of course, boosting will also reduce the variance of the model, but there is no obvious deviation.
Summary of machine learning Algorithms (iii)--Integrated learning (Adaboost, Randomforest)