"Random Forest" heights Field machine learning techniques

Last Update:2015-08-03 Source: Internet

Author: User

Tags random seed shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Generally speaking, Lin's explanation to the random forest is mainly on the general algorithm, and to some extent, it pays more attention to insights.

Lin cites the respective features of the bagging and decision tree respectively:

The Random forest is the combination of the two.

1) Ease of parallelism

2) retain the advantages of C&RT

3) weakened the shortcomings of fully-grown tree by means of bagging

A insights is mentioned here: if the diversity of each classifier is larger, the effect may be better after aggregation.

Therefore, the Random forest not only samples are boostrapping, but also a similar approach to features processing.

The advantage of using random subspace is that the feature dimension is reduced and the operation efficiency is improved.

Further, the RF author proposes an extension of the idea:

Any low-dimension feature space can be regarded as the transformation of the projection matrix p to the original feature, or it can be said that the original features is a linear combination (combination)

One exception is this: if there is no change after the projection, this p is natural basis.

In order to introduce more randomness, the RF authors suggest that the projection matrix be used to transform the features every time B (x) is done. This is really randomness everywhere.

Next, Lin describes how to do model validation for the characteristics of RF.

First, Lin gives the approximate percentage of samples that have not been used (out-of-bag) in the process of boostrapping each tree in the RF.

Assuming that every tree is BOOSTRAPPINGN, there will still be 1/3 of samples that are not drawn by the tree.

For each tree, these samples that are not drawn by the boostrapping process are called out-of-bag.

Using this rule, the validation of the RF model is somewhat elegant.

1) An intuitive approach is to verify the GT with the OOB data for each tree, and the RF model does not value the classification of each tree.

2) The second way of thinking comes, a bit around, but also clear (can analogy validation by one verification method).

For example (X1,Y1) This data, for G2,G3 is out of the bag, then for (x1,y1) This sample error, you can use G (G2,G3) average to verify. (If only (x1,y1) this point to verify, that is validation by one method).

for (x1,y1) ~ ... (Xn,yn) Most can be found, with these samples for the Oog g (GI ...), respectively, to find the value of these validation, and then take an average OK.

The second method of verification:

A. Data that is guaranteed to be tested is definitely not peeping during training

B. Guaranteed not to validate the single subtrees tree GT, but to focus on g (GI ...) The performance

This validation way is very useful in practice, not re-training, save time and effort.

Then came the subject of feature selection.

This issue is actually more natural, since the random forest each step needs to randomness select features, Nature will ask: which features more important?

The linear model is reviewed first:

The result of the linear model learning process, w, is itself a measure of the importance of variables: the greater the |wi| (whether positive or negative), the greater the impact on the result, and therefore more important.

There is also a statistical method, which is to use the idea of permutation test to do.

For example, n samples, each sample D dimension characteristics, in order to measure the importance of the I-dimensional features, can be the nth sample of the I-dimensional features are shuffle upset. Re-evaluation of the pre-shuffle and shuffle after the model performance.

However, there is a problem, must constantly shuffle, training, the process is very cumbersome.

So the RF author thought of a somewhat lazy trick, as follows:

Training, do not play permutation, change in validation time play permutation: that is, the OOB test sample Xn,i Shuffle, and then to evaluate the verification.

This Trcik is also counted as a very pratical idea, study.

Finally, Lin cites several examples of RF models in practice:

1) for a simple data set, the RF model tends to be smooth, with a large confidence interval classifier

2) for complex noisy data (decision tree performance is not good), RF model noise reduction is very good

3) How many trees in the forest is better to choose?

In short, the more trees the better, but because of the random forest, random seed is also very important (this depends on fate).

Random Forest Heights field machine learning techniques

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More