Course Address: Https://class.coursera.org/ntumltwo-002/lecture
Important! Important! Important ~
I. Random Forest (RF)
1.RF Introduction
- RF combines many of the cart in a bagging way, regardless of the computational cost, usually the more trees the better.
- The use of the cart in RF does not undergo pruning operations, generally there will be a large deviation (variance), combined with the average effect of bagging can reduce the deviation of the cart.
- At the time of training the cart, the randomness and diversity of G (t) can be increased by using randomly sampled samples (bootstraping), randomly sampled features, and even the sample features through mapping matrix p projection to random subspace.
2.RF algorithm Structure and benefits
Two, OOB (Out-of-bag) and self-validating (Automatic Validation)
The sampling method (Bootstrapping) used in 1.RF can result in some samples not being used in a training session, and samples that are not used are called OOB (out-of-bag).
When the sample set is large, if the size of the training data is the same as the size of the sample collection, then the probability that a sample is not used is approximately 1/3,oob size and about 1/3 of the sample set, and the following is a concrete mathematical description.
2.RF Validation
RF does not pay attention to the classification effect of each tree, nor does it actually validate G (t) with OOB data, but instead uses OOB data to validate G.
But at the same time in order to ensure that the validation data is never "peeping" during training, the G used is to remove G (t) consisting of the test's OOB-related.
Finally, all the OOB test results are averaged. "In practice, Eoob are usually very accurate," Lin said.
Iii. Feature Selection (Feature Selection) and permutation test (permutation test)
- In practice, when there are very many characteristics of the sample, it is sometimes desirable to remove redundant or unrelated feature items and select relative important feature items.
- In linear models, the importance of feature items is used | Wi|, it is generally difficult to measure the importance of feature items in a non-linear model.
- The instrumental permutation test (permutation test) in the use of statistics in RF is used to measure the importance of feature items.
- n samples, D dimensions per sample, in order to measure the importance of one of the features di, according to permutation test the N sample of the di features are shuffled shuffle, shuffle before and after the error subtraction is the importance of this feature.
- RF often does not use permutation Test during training, but instead disrupts the OOB feature items when validation, and then evaluates the validation to get the importance of the feature item.
IV. Application of RF
- On a simple data set, the RF model boundary is smoother and the confidence interval (Margin) is larger than the single cart tree.
- In complex and noisy datasets, decision trees are often poorly performed, RF has good noise reduction, and RF models behave well in comparison.
- How many trees does RF choose? Overall is the more the better!!! In practice, it is necessary to use enough trees to ensure the stability of G, so the stability of G can be used to determine how many trees are good.
Machine learning techniques-random forest (Forest)