??
Introduction
??
Before the decision tree in the selection of the best characteristics of the data set of the partition is said that this method can be used for feature selection, and then read the Breiman home page related to the introduction, think this is worthy of authority Ah, is worthy of random forest algorithm proposed, speak very clearly, the URL is as follows
??
Http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
??
??
Feature importance
??
The importance of a feature X in a random forest is calculated as follows:
??
First, for each decision tree in the random forest , the corresponding OOB ( out-of-pocket data ) data is used to calculate its out-of-pocket data error , which is recorded as errOOB1. So that each decision tree can get a errOOB1,k tree Decision tree by K errOOB1
??
Then is to traverse all the features to examine the importance of this feature, the way to examine the importance of random data out of the OOB all samples of the feature x added noise interference ( can be understood as a random change in the sample in the feature X value ), the outside of the bag data error is calculated again , recorded as errOOB2. So that each decision tree can get a errOOB2,k tree Decision tree by K errOOB2
??
The reason why this expression can be used as a measure of the importance of the corresponding feature is that if a feature is randomly added to the noise , the accuracy of the outside of the bag is significantly reduced , then this characteristic has a great influence on the classification result of the sample. In other words, it is of high importance.
??
So for the importance of feature X =∑(erroob2-erroob1)/ktree,
Machine learning: Using random forests to select features