Summary of integrated learning algorithms----boosting and bagging

Source: Internet
Author: User

1. Integrated Learning Overview

1.1 Integrated Learning Overview

Integration learning has a higher quasi-rate in machine learning algorithms, the disadvantage is that the training process of the model may be more complicated and the efficiency is not very high. At present, there are more than 2 kinds of integrated learning: based on boosting and based on bagging, the former representative algorithm has adaboost, GBDT, Xgboost, the latter's representative algorithm is mainly random forest.

1.2 Main ideas of integrated learning
The main idea of integrated learning is to use certain means to learn multiple classifiers, and these classifiers require weak classifiers, and then combine multiple classifiers for public prediction. The core idea is how to train multiple weak classifiers and how to combine these weak classifiers.

1.3. Weak classifier selection in integrated learning
The reason for the general use of weak classifiers is to equalize the error, because once a classifier is too strong it can cause subsequent results to be affected too much, and serious causes the subsequent classifiers to fail to classify. The common weak classifier can adopt the error rate less than 0.5, such as logistic regression, SVM, neural network.

1.4. Generation of multiple classifiers
The classifier can be trained by randomly selecting the data, and a new classifier can be generated by the weights of the training data which is constantly adjusting the error classification.

1.5. How to combine multiple weakly classified areas
The integration of basic classifiers generally has a simple majority vote, weighted voting, Bayesian voting, based on the integration of D-s evidence theory, based on the integration of different feature subsets.

2. Boosting algorithm

2.1 Basic Concepts

The boosting method is a method to improve the accuracy of weak classification algorithms by constructing a series of predictive functions and then combining them into a predictive function in a certain way. He is a framework algorithm, mainly through the operation of the sample set to obtain a subset of samples, and then the weak classification algorithm on the sample subset training to generate a series of base classifiers. He can be used to improve the recognition rate of other weak classification algorithms, that is, the other weak classification algorithm as a base classification algorithm in the boosting framework, through the boosting framework of the training sample set operation, to obtain a different subset of training samples, the subset of the sample to train the generation of the base classifier; For each sample set, a base classifier is generated on the sample set by using the base classification algorithm, so that after a given training wheel N, the N base classifier can be produced, and then the boosting framework algorithm uses the weighted fusion of the N base classifier to produce a final result classifier, in which N base classifiers, The recognition rate of each individual classifier is not necessarily high, but the result of their union has a high recognition rate, which improves the recognition rate of the weak classification algorithm. The same classification algorithm can be used to produce a single base classifier, and different classification algorithms are used, which are generally unstable weak classification algorithms, such as Neural Network (BP), decision Tree (C4.5) and so on.

2.2, Adaboost

AdaBoost is a more representative algorithm in boosting, the basic idea is to construct a classifier through the distribution of training data, and then calculate the weight of the weak classifier by the error rate, by updating the distribution of training data and iterating until the number of iterations is reached or the loss function is less than a certain threshold value.

AdaBoost Algorithm Flow:
Suppose the training data set is t={(X1,y1), (X2,y2), (X3,y3), (X4,y4), (X5,y5)} where yi={-1,1}

1. Distribution of initialization training data
The weight distribution of the training data is D={W11,W12,W13,W14,W15}, where w1i=1/n. That is, the average distribution.

2. Select the basic classifier
This selects the simplest linear classifier y=ax+b, and after the classifier is selected, the minimization of the classification error can be obtained by the parameter.

3, calculate the coefficient of the classifier and update the data weight
The error rate can also be calculated as E1. At the same time, the coefficients of this classifier can be obtained. The basic adaboost gives a formula for calculating the coefficients of
Then update the weight distribution of the training data,
(Image from Hangyuan Li's statistical learning method)

4. Combination of classifiers


Of course, this combination is based on the coefficient of the classifier, and the coefficient of the classifier is based on the error rate, so adaboots finally affect how to use the error rate, as well as training data update weight calculation coefficients.

5, some problems of adaboost

Some of the parameters that can be adjusted in AdaBoost and the choice of calculation formula are as follows:

* * How weak classifiers are selected
* * How to better experiment error rate to calculate the coefficients of the classifier
* * How to better calculate the distribution of weights for training data
* * How weak classifiers are combined
* * Number of iterations
* * The threshold value of the loss function is selected how much

3. Bagging algorithm

The bagging method bootstrap the abbreviation of aggregating, adopts the selection training data which is randomly put back and then constructs the classifier, the last combination. Here is an example of a random forest.
Overview of random forest algorithms

Random forest algorithm is breiman in the 80 's, the basic idea is to construct many decision trees, form a forest, and then use these decision trees together to decide what the output category is. Random forest algorithm and the foundation of constructing single decision tree are the extension and improvement of single decision tree algorithm. In the whole process of random forest algorithm, there are two stochastic processes, the first is that the input data is randomly selected from the whole training data as a decision tree construction, but also has the choice of putting back; the second is that each decision tree is constructed with a random selection of features from the overall set of features. These two stochastic processes allow random forests to largely avoid the occurrence of over-fitting phenomena.

Random forest algorithm specific process:

1, from the training data to select N data as the training data input, generally n is far less than the overall training data n, which will cause a part of the data can not be taken, this part of the data is called outside the bag data, can use the outside of the bag data to do error estimation.

2, after selecting the input training data, we need to build a decision tree, the specific method is that each split node from the overall feature set m selected m feature construction, under normal circumstances m is far less than M.

3, in the process of constructing each decision tree, the decision tree is constructed according to the selection of the smallest Gini index to select the split node. Other nodes of the decision tree are constructed with the same split rule until all training samples of the node belong to the same class or reach the maximum depth of the tree.

4. Repeat the 2nd and 3rd steps multiple times, each time the input data corresponds to a decision tree, so that the random forest can be used to make the prediction data decision.

5, the input of the training data selection, a number of decision-making tree is also constructed, the treatment of predictive data, such as the input of a forecast data, and then a decision tree decision-making at the same time, the final use of the majority of votes in the category of decision-making.

Random forest algorithm diagram

Note points for random forest algorithms:

1, in the process of building a decision tree does not need pruning.
2, the number of trees throughout the forest and the characteristics of each tree needs to be artificially set.
3, when constructing decision tree, the choice of split node is based on the minimum Gini coefficient.

Random forests have a number of advantages:

A. Good performance on the data set, two random introduction, so that stochastic forest is not easy to fall into overfitting.

B. On many current data sets, relative to other algorithms have a great advantage, two random introduction, so that stochastic forest has a good anti-noise ability.

C. It can handle very high-dimensional (feature many) data, and do not have to do feature selection, the ability to adapt to the data set: both the processing of discrete data, but also the processing of continuous data, data sets need not be normalized.

D. When creating a random forest, the unbiased estimate is used for generlization error.

E. The training speed is fast, can get the variable importance sort.

F. In the course of training, the interaction between feature can be detected.

G is easy to make a parallelization method.

H. implementation is relatively simple.

Summary of integrated learning algorithms----boosting and bagging

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.