Statistical learning Methods--cart, Bagging, Random Forest, boosting

Last Update:2015-06-20 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://blog.csdn.net/abcjennifer/article/details/8164315

This article explains the cart (classification and Regression Tree) from a statistical perspective, Bagging (Bootstrap aggregation), Random Forest Boosting the characteristics and classification of four classifiers, the reference material is Wang Bo at the University of Michigan Ji Zhu PDF and group.

CART (Classification and Regression Tree)

Breiman, Friedman, Olshen & Stone (1984), Quinlan (1993) Idea: recursively dividing the input space into rectangular advantages: variable selection can be done to overcome the missing data , you can handle the disadvantage of mixed predictions: unstable example: For the following data, you want to split into red and green two classes, the original data generation is this: red Class:x1^2+x2^2>=4.6green class:otherwise The final classification tree can be obtained through continuous segmentation:

So how to divide is the best? How to divide the input space into rectangles is the best strategy? The three-medium evaluation standard strategy is generally used here:

When splitting, find the splitting variable and the splitting point that make the purity drop the fastest.

From the results, it can be seen that the cart can be iterated through variable selection to create a classification tree, so that each classification plane can best divide the remaining data into two categories.
Classification tree is very simple, but there are often noisy classifiers. So introduced ensemble classifiers:bagging, Random forest, and boosting.

General, boosting > Bagging > Classification tree (Single tree)

Bagging (Breiman1996): Also known as Bootstrap aggregation

Bagging's strategy:

-Select n samples from a sample set with bootstrap samples

-On all attributes, set up the classifier (CART or SVM or ...) for these n samples. ）

-Repeat the above two steps m times, I.e.build m classifier (CART or SVM or ...) ）

-Put the data on the M classifier to run, and finally vote see what kind of

Fit Many large trees to bootstrap resampled versions of the training data, and classify by majority vote.

Is the bagging selection strategy, each time from n data sampled n a bag of n data, a total of select B to get B bags, that is, B bootstrap samples.

Random Forest (Breiman1999):

The random forest was modified on the basis of bagging.

-Select n samples from sample set with Bootstrap sample, pre-build cart

-On each node of the tree, select the K attribute randomly from all attributes and select the best split attribute as the node

-Repeat above two steps m times, i.e.build m cart

-This m cart forms a random Forest

Random forests can handle both the amount of discrete values, such as the ID3 algorithm, or the amount of continuous values, such as the C4.5 algorithm. The random here means 1. Random Selection sub-sample 2 in Bootstrap. The random subspace algorithm randomly selects K attributes from the attribute set, and each tree node splits, from this random K-attribute, chooses the optimal result to prove that sometimes the random forest is better than bagging. Today, Microsoft's Kinect uses the random Forest, related papers real-time Human Pose recognition in Parts from a single Depth images is CVPR2011 's best paper.

Boosting (Freund & Schapire 1996):

Fit many large or small trees to reweighted versions of the training data. Classify by weighted majority vote.

First give a general concept, boosting in the selection of hyperspace when the sample added a weight, so that the loss function as far as possible to consider those sub-error class sample (i.e. the sample of the wrong class weight large).

How to do it?

-Boosting resampling is not the sample, but the distribution of the sample, for the correct classification of the sample weight is low, the wrong sample weight is high (usually near the boundary of the sample), the final classifier is a lot of weak classifier linear superposition (weighted combination), the classifier is quite simple.

AdaBoost and Realboost are two ways to implement boosting. General said, AdaBoost better use, realboost more accurate.

Here is the procedure for AdaBoost to set up and update the weights:

The following are the performance comparisons of several algorithms:

For multi-Class classification (multi-class), generalization~ is a similar process:

For example, to classify the data in K class, instead of K-1 the total of two classes each time, we only need each weak classifier better than the random guessing (i.e. Accuracy rate >1/k)

Multi-Class classification algorithm flow:

Loss function design for multi-class classifiers:

=============== Supplement ===============

The ten algorithms of data mining can be studied slowly later:

C4.5

K-means

Svm

Apriori

PageRank

AdaBoost

Knn

Naivebayes

CART

=============== Summary ===============

Boosting can be used for variable selection, so the first component can be a simple variable.

Boosting may be overfit, so it is a method of regularization boosting to stop at a relatively early time.

Look forward to more friends to add ...

Reference:

1. http://cos.name/2011/12/stories-about-statistical-learning/

2. wiki_boosting

3. wiki_bagging (bootstrap_aggregating)

4. Wiki_cart

Statistical learning Methods--cart, Bagging, Random Forest, boosting

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More