The calculation of feature importance of GBDT algorithm

Source: Internet
Author: User
The calculation of the feature importance of the Tree ensemble algorithm

Integrated learning is widely concerned by the advantages of high predictive precision, especially the integrated learning algorithm using decision tree as the base learner. The tree's well-known code of integrated algorithms has random forests and GBDT. The random forest has a good resistance to overfitting, and the parameters (number of decision trees) have less effect on the prediction performance, and the parameter is easier to set up, and a larger number is generally arranged. GBDT has a very good theoretical foundation and generally has more advantages in performance. For the principle of the GBDT algorithm, please refer to my previous post, "GBDT algorithm theory in-depth analysis."

The tree-based integration algorithm also has a good feature, that is, the model after the end of the training can be used to output the relative importance of the characteristics of the model, it is easy for us to choose features, understand which factors are critical impact on the prediction, which in some areas (such as bioinformatics, neuroscience, etc.) is particularly important. This paper mainly introduces how the tree-based integrated algorithm calculates the relative importance of each feature. the advantage of using boosted tree as a learning algorithm: when using different types of data, it is easy to balance runtime efficiency and accuracy without having to do feature normalization/normalization; For example, using boosted Tree as a model of on-line prediction can cut down the number of trees participating in the forecast when the machine resource is tense, so that the predictive efficiency learning model can output the relative importance of the feature, and can be used as a feature selection model to be interpreted to be insensitive to data field deletions. The ability to automatically interaction between multi-group features has a very good calculation of non-linear feature importance

Friedman's approach in GBM's paper:

The global importance of the characteristic J J is measured by the average of the importance of the characteristic J-J in a single tree:
j2j^=1m∑m=1mj2j^ (Tm) \hat{j_{j}^2}=\frac1m \sum_{m=1}^m\hat{j_{j}^2} (t_m)
where M is the number of trees. The importance of feature J J in a single tree is as follows:
j2j^ (t) =∑t=1l−1i2t^1 (VT=J) \hat{j_{j}^2} (t) =\sum\limits_{t=1}^{l-1} \hat{i_{t}^2} 1 (v_{t}=j)
where l l is the number of leaf nodes of the tree, l−1 L-1 is the number of non-leaf nodes of the tree (the constructed tree is a two-fork tree with left and right children), VT V_{t} is a feature associated with the node T T, and i2t^ \hat{i_{t}^2} is the reduced value of the square loss after the node T-T Division.

Implementing Code Snippets

To better understand the computational method of feature importance, the implementation in the Scikit-learn Toolkit is given below, and the code removes some unrelated parts.

The following code is derived from the calculation method for the Feature_importances property of the Gradientboostingclassifier object:

def feature_importances_ (self):
    total_sum = Np.zeros ((self.n_features,), Dtype=np.float64) for
    tree in Self.estimators_:
        total_sum + = Tree.feature_importances_ 
    importances = Total_sum/len (Self.estimators_)
    return importances

Among them, Self.estimators_ is an array of decision trees constructed by the algorithm, and Tree.feature_importances_ is the characteristic importance vector of a single tree, which is calculated as follows:

Cpdef compute_feature_importances (Self, Normalize=true): "" "computes the importance of each
    feature (aka variable)." " "While

    node! = End_node:
        if Node.left_child! = _tree_leaf:
            # ... and node.right_child! = _tree_leaf:
            lef t = &nodes[node.left_child] Right
            = &nodes[node.right_child]

            importance_data[node.feature] + = (
                Node.weighted_n_node_samples * node.impurity-
                left.weighted_n_node_samples * left.impurity-
                right.weighted _n_node_samples * right.impurity)
        node + = 1

    importances/= nodes[0].weighted_n_node_samples

    return Importances

The code above has been simplified to preserve the core idea. The reduction of the weighted purity of all non-leaf nodes during splitting is calculated, and the more important the characterization is.

The reduction in purity is actually the benefit of this split of the node, so we can also understand that the greater the yield when the node splits, the higher the importance of the corresponding characteristics of the node. For a definition of revenue please refer to the definition of equation (9) in my previous blog post, "GBDT algorithm theory in Depth". References

[1] Feature Selection for Ranking using Boosted Trees
[2] Gradient Boosted Feature Selection
[3] Feature Sele Ction with ensembles, Artificial Variables, and redundancy elimination
[4] GBDT algorithm in layman's principles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.