Machine learning Techniques-random forest (Forest)

Last Update:2016-04-05 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Course Address: Https://class.coursera.org/ntumltwo-002/lecture

Important! Important! Important ~

I. Random Forest (RF)

1.RF Introduction

RF combines many of the cart in a bagging way, regardless of the computational cost, usually the more trees the better.
The use of the cart in RF does not undergo pruning operations, generally there will be a large deviation (variance), combined with the average effect of bagging can reduce the deviation of the cart.
At the time of training the cart, the randomness and diversity of G (t) can be increased by using randomly sampled samples (bootstraping), randomly sampled features, and even the sample features through mapping matrix p projection to random subspace.

2.RF algorithm Structure and benefits

Two, OOB (Out-of-bag) and self-validating (Automatic Validation)

The sampling method (Bootstrapping) used in 1.RF can result in some samples not being used in a training session, and samples that are not used are called OOB (out-of-bag).

When the sample set is large, if the size of the training data is the same as the size of the sample collection, then the probability that a sample is not used is approximately 1/3,oob size and about 1/3 of the sample set, and the following is a concrete mathematical description.

2.RF Validation

RF does not pay attention to the classification effect of each tree, nor does it actually validate G (t) with OOB data, but instead uses OOB data to validate G.

But at the same time in order to ensure that the validation data is never "peeping" during training, the G used is to remove G (t) consisting of the test's OOB-related.

Finally, all the OOB test results are averaged. "In practice, Eoob are usually very accurate," Lin said.

Iii. Feature Selection (Feature Selection) and permutation test (permutation test)

In practice, when there are very many characteristics of the sample, it is sometimes desirable to remove redundant or unrelated feature items and select relative important feature items.
In linear models, the importance of feature items is used | Wi|, it is generally difficult to measure the importance of feature items in a non-linear model.
The instrumental permutation test (permutation test) in the use of statistics in RF is used to measure the importance of feature items.
n samples, D dimensions per sample, in order to measure the importance of one of the features di, according to permutation test the N sample of the di features are shuffled shuffle, shuffle before and after the error subtraction is the importance of this feature.
RF often does not use permutation Test during training, but instead disrupts the OOB feature items when validation, and then evaluates the validation to get the importance of the feature item.

IV. Application of RF

On a simple data set, the RF model boundary is smoother and the confidence interval (Margin) is larger than the single cart tree.
In complex and noisy datasets, decision trees are often poorly performed, RF has good noise reduction, and RF models behave well in comparison.
How many trees does RF choose? Overall is the more the better!!! In practice, it is necessary to use enough trees to ensure the stability of G, so the stability of G can be used to determine how many trees are good.

Machine learning techniques-random forest (Forest)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More