http://blog.csdn.net/abcjennifer/article/details/8164315
This article explains the cart (classification and Regression Tree) from a statistical perspective, Bagging (Bootstrap aggregation), Random Forest Boosting the characteristics and classification of four classifiers, the reference material is Wang Bo at the University of Michigan Ji Zhu PDF and group.
- CART (Classification and Regression Tree)
Breiman, Friedman, Olshen & Stone (1984), Quinlan (1993) Idea: recursively dividing the input space into rectangular advantages: variable selection can be done to overcome the missing data , you can handle the disadvantage of mixed predictions: unstable example: For the following data, you want to split into red and green two classes, the original data generation is this: red Class:x1^2+x2^2>=4.6green class:otherwise The final classification tree can be obtained through continuous segmentation:
- So how to divide is the best? How to divide the input space into rectangles is the best strategy? The three-medium evaluation standard strategy is generally used here:
When splitting, find the splitting variable and the splitting point that make the purity drop the fastest.
- From the results, it can be seen that the cart can be iterated through variable selection to create a classification tree, so that each classification plane can best divide the remaining data into two categories.
- Classification tree is very simple, but there are often noisy classifiers. So introduced ensemble classifiers:bagging, Random forest, and boosting.
General, boosting > Bagging > Classification tree (Single tree)
- Bagging (Breiman1996): Also known as Bootstrap aggregation
Bagging's strategy:
-Select n samples from a sample set with bootstrap samples
-On all attributes, set up the classifier (CART or SVM or ...) for these n samples. )
-Repeat the above two steps m times, I.e.build m classifier (CART or SVM or ...) )
-Put the data on the M classifier to run, and finally vote see what kind of
Fit Many large trees to bootstrap resampled versions of the training data, and classify by majority vote.
Is the bagging selection strategy, each time from n data sampled n a bag of n data, a total of select B to get B bags, that is, B bootstrap samples.
- Random Forest (Breiman1999):
The random forest was modified on the basis of bagging.
-Select n samples from sample set with Bootstrap sample, pre-build cart
-On each node of the tree, select the K attribute randomly from all attributes and select the best split attribute as the node
-Repeat above two steps m times, i.e.build m cart
-This m cart forms a random Forest
Random forests can handle both the amount of discrete values, such as the ID3 algorithm, or the amount of continuous values, such as the C4.5 algorithm. The random here means 1. Random Selection sub-sample 2 in Bootstrap. The random subspace algorithm randomly selects K attributes from the attribute set, and each tree node splits, from this random K-attribute, chooses the optimal result to prove that sometimes the random forest is better than bagging. Today, Microsoft's Kinect uses the random Forest, related papers real-time Human Pose recognition in Parts from a single Depth images is CVPR2011 's best paper.
- Boosting (Freund & Schapire 1996):
Fit many large or small trees to reweighted versions of the training data. Classify by weighted majority vote.
First give a general concept, boosting in the selection of hyperspace when the sample added a weight, so that the loss function as far as possible to consider those sub-error class sample (i.e. the sample of the wrong class weight large).
How to do it?
-Boosting resampling is not the sample, but the distribution of the sample, for the correct classification of the sample weight is low, the wrong sample weight is high (usually near the boundary of the sample), the final classifier is a lot of weak classifier linear superposition (weighted combination), the classifier is quite simple.
AdaBoost and Realboost are two ways to implement boosting. General said, AdaBoost better use, realboost more accurate.
Here is the procedure for AdaBoost to set up and update the weights:
The following are the performance comparisons of several algorithms:
For multi-Class classification (multi-class), generalization~ is a similar process:
For example, to classify the data in K class, instead of K-1 the total of two classes each time, we only need each weak classifier better than the random guessing (i.e. Accuracy rate >1/k)
Multi-Class classification algorithm flow:
Loss function design for multi-class classifiers:
=============== Supplement ===============
The ten algorithms of data mining can be studied slowly later:
C4.5
K-means
Svm
Apriori
Em
PageRank
AdaBoost
Knn
Naivebayes
CART
=============== Summary ===============
Boosting can be used for variable selection, so the first component can be a simple variable.
Boosting may be overfit, so it is a method of regularization boosting to stop at a relatively early time.
Look forward to more friends to add ...
Reference:
1. http://cos.name/2011/12/stories-about-statistical-learning/
2. wiki_boosting
3. wiki_bagging (bootstrap_aggregating)
4. Wiki_cart
Statistical learning Methods--cart, Bagging, Random Forest, boosting