The calculation of the feature importance of the Tree ensemble algorithm
Integrated learning is widely concerned by the advantages of high predictive precision, especially the integrated learning algorithm using decision tree as the base learner. The tree's well-known code of integrated algorithms has random forests and GBDT. The random forest has a good resistance to overfitting, and the parameters (number of decision trees) have less effect on the prediction performance, and the parameter is easier to set up, and a larger number is generally arranged. GBDT has a very good theoretical foundation and generally has more advantages in performance. For the principle of the GBDT algorithm, please refer to my previous post, "GBDT algorithm theory in-depth analysis."
The tree-based integration algorithm also has a good feature, that is, the model after the end of the training can be used to output the relative importance of the characteristics of the model, it is easy for us to choose features, understand which factors are critical impact on the prediction, which in some areas (such as bioinformatics, neuroscience, etc.) is particularly important. This paper mainly introduces how the tree-based integrated algorithm calculates the relative importance of each feature. the advantage of using boosted tree as a learning algorithm: when using different types of data, it is easy to balance runtime efficiency and accuracy without having to do feature normalization/normalization; For example, using boosted Tree as a model of on-line prediction can cut down the number of trees participating in the forecast when the machine resource is tense, so that the predictive efficiency learning model can output the relative importance of the feature, and can be used as a feature selection model to be interpreted to be insensitive to data field deletions. The ability to automatically interaction between multi-group features has a very good calculation of non-linear feature importance
Friedman's approach in GBM's paper:
The global importance of the characteristic J J is measured by the average of the importance of the characteristic J-J in a single tree:
j2j^=1m∑m=1mj2j^ (Tm) \hat{j_{j}^2}=\frac1m \sum_{m=1}^m\hat{j_{j}^2} (t_m)
where M is the number of trees. The importance of feature J J in a single tree is as follows:
j2j^ (t) =∑t=1l−1i2t^1 (VT=J) \hat{j_{j}^2} (t) =\sum\limits_{t=1}^{l-1} \hat{i_{t}^2} 1 (v_{t}=j)
where l l is the number of leaf nodes of the tree, l−1 L-1 is the number of non-leaf nodes of the tree (the constructed tree is a two-fork tree with left and right children), VT V_{t} is a feature associated with the node T T, and i2t^ \hat{i_{t}^2} is the reduced value of the square loss after the node T-T Division.
Implementing Code Snippets
To better understand the computational method of feature importance, the implementation in the Scikit-learn Toolkit is given below, and the code removes some unrelated parts.
The following code is derived from the calculation method for the Feature_importances property of the Gradientboostingclassifier object:
def feature_importances_ (self):
total_sum = Np.zeros ((self.n_features,), Dtype=np.float64) for
tree in Self.estimators_:
total_sum + = Tree.feature_importances_
importances = Total_sum/len (Self.estimators_)
return importances
Among them, Self.estimators_ is an array of decision trees constructed by the algorithm, and Tree.feature_importances_ is the characteristic importance vector of a single tree, which is calculated as follows:
Cpdef compute_feature_importances (Self, Normalize=true): "" "computes the importance of each
feature (aka variable)." " "While
node! = End_node:
if Node.left_child! = _tree_leaf:
# ... and node.right_child! = _tree_leaf:
lef t = &nodes[node.left_child] Right
= &nodes[node.right_child]
importance_data[node.feature] + = (
Node.weighted_n_node_samples * node.impurity-
left.weighted_n_node_samples * left.impurity-
right.weighted _n_node_samples * right.impurity)
node + = 1
importances/= nodes[0].weighted_n_node_samples
return Importances
The code above has been simplified to preserve the core idea. The reduction of the weighted purity of all non-leaf nodes during splitting is calculated, and the more important the characterization is.
The reduction in purity is actually the benefit of this split of the node, so we can also understand that the greater the yield when the node splits, the higher the importance of the corresponding characteristics of the node. For a definition of revenue please refer to the definition of equation (9) in my previous blog post, "GBDT algorithm theory in Depth". References
[1] Feature Selection for Ranking using Boosted Trees
[2] Gradient Boosted Feature Selection
[3] Feature Sele Ction with ensembles, Artificial Variables, and redundancy elimination
[4] GBDT algorithm in layman's principles