Summary of Scikit-learn decision Tree algorithm class library usage

Source: Internet
Author: User

Reference: http://www.cnblogs.com/pinard/p/6056319.html

Before, the algorithm principle of decision tree was summarized, including the principle of decision tree algorithm (above) and the principle of decision tree algorithm (below). Today, we introduce the decision tree algorithm from the point of view of practice, mainly explain the use of Scikit-learn to run decision Tree algorithm, the visualization of the results and the key points of some parameters parameter tuning.

1. Introduction to Scikit-learn decision Tree algorithm class library

The internal implementation of Scikit-learn decision Tree algorithm class library is using the tuning cart tree algorithm, which can be used to classify and return. The class of the categorical decision tree corresponds to the decisiontreeclassifier, and the class of the regression decision tree corresponds to the decisiontreeregressor. The parameter definitions of the two are almost identical, but the meanings are not all the same. The following is a summary of the important parameters of Decisiontreeclassifier and Decisiontreeregressor, focusing on the differences between the two parameters and the points of attention.

2.Decisiontreeclassifier and decisiontreeclassifier important parameter notes

To facilitate comparisons, here we compare the key parameters of Decisiontreeclassifier and decisiontreeregressor in tabular form .

Parameters Decisiontreeclassifier Decisiontreeregressor

Feature Selection Standard criterion

"Gini" or "entropy" can be used to represent the Gini coefficient, which represents the information gain. It is generally said that using the default Gini coefficient "Gini" is OK, that is, the cart algorithm. Unless you prefer the best Feature selection method like ID3, C4.5.

Either "MSE" or "Mae" can be used, which is the mean variance, the sum of the absolute values of the difference between the and the mean. It is recommended to use the default "MSE". In general the "MSE" is more accurate than "Mae". Unless you want to compare the effects of the two parameters to the difference.

Standard splitter of feature dividing point selection

You can use "best" or "random". The former finds the optimal dividing point in all the dividing points of the feature. The latter is random to find the local optimal dividing point in the partial dividing point.

The default "best" is suitable when the sample size is small, and if the sample data is very large, the decision tree construction recommends "random"

Maximum number of features to consider when dividing max_features

Many types of values can be used, the default being "None", meaning that all features are considered when dividing, and if "log2" means that the partition is consideredl o G2n log2n features; if "sqrt" or "auto" means that the partition is considered to be a maximum of N?? √n features. If it is an integer, it represents the absolute number of features considered. In the case of floating-point numbers, the representation considers the feature percentage, which is the number of features after rounding (percent xn). where n is the total number of characteristics of the sample.

In general, if the sample features are not many, such as less than 50, we use the default "None" can be, if the number of features is very large, we can flexibly use the other values just described to control the maximum number of features to consider when partitioning, to control the decision tree generation time.

Maximum depth max_depth of decision tree

The maximum depth of the decision tree is not entered by default, and if not entered, the decision tree does not limit the depth of the subtree when the subtree is established. In general, you can ignore this value when there is less data or features. If the model sample size is large and the characteristics are many, it is recommended to limit the maximum depth, depending on the distribution of the data. Commonly used can be valued between 10-100.

Minimum number of samples required for internal node re-partitioning Min_samples_split

This value restricts the condition that the subtree continues to divide, and if a node has fewer samples than min_samples_split, it will not continue to try to select the optimal feature for partitioning. The default is 2. If the sample size is small, do not need to tube this value. If the sample quantity is very large, it is recommended to increase this value. One of my previous project examples, there are about 100,000 samples, when building a decision tree, I chose min_samples_split=10. Can be used as a reference.

leaf node Minimum sample number Min_samples_leaf

This value limits the minimum number of samples on the leaf node and, if the number of leaf nodes is less than the number of samples, is pruned along with the sibling nodes. The default is 1, an integer that can enter a minimum number of samples, or a minimum number of samples as a percentage of the total number of samples. If the sample size is small, you do not need to tube this value. If the sample quantity is very large, it is recommended to increase this value. The previous 100,000 sample project uses a value of min_samples_leaf of 5, for informational purposes only.

Minimum sample weights and min_weight_fraction_leaf of leaf nodes

This value limits the minimum value of all the sample weights of the leaf node and, if it is less than this value, is pruned along with the sibling nodes. The default is 0, which is to not consider the weight issue. In general, if we have more samples with missing values, or if the classification tree sample has a large variation in the distribution category, we will introduce the sample weights, and we should pay attention to this value.

Maximum number of leaf nodes max_leaf_nodes

By limiting the maximum number of leaf nodes, you can prevent overfitting, which is "None" by default, that is, the maximum number of leaf nodes is not limited. If the limit is added, the algorithm will establish the optimal decision tree in the maximum number of leaf nodes. If the feature is not many, you can not consider this value, but if the features are divided into more, you can limit the specific values can be obtained by cross-validation.

Category Weight Class_weight

The weights of the various categories of the sample are specified to prevent too many samples of the training set, resulting in the decision tree being overly biased towards these categories. Here you can specify the weights of each sample, or "balanced", if you use "balanced", the algorithm will calculate its own weight, the sample size of the category of the corresponding sample weight will be high. Of course, if your sample class distribution does not have a significant bias, you can choose the default "None" regardless of the parameter. Not applicable to regression tree

Node partition minimum impurity Min_impurity_split

This value limits the growth of the decision tree, and if a node's purity (Gini coefficient, information gain, mean variance, absolute difference) is less than this threshold, then the node no longer generates child nodes. That is the leaf node.

Whether the data is pre-sorted presort

This value is a Boolean value, and by default, False is not sorted. In general, if the sample size is low or a decision tree with a very small depth is limited, setting to true allows the division point to be selected faster and the decision tree set up more quickly. If the sample size is too large, there is no benefit. The problem is that when the sample size is low, I'm not slow at all. So this value is generally lazy to ignore it.

In addition to these parameters to be noted, other points of attention during the tuning are:

1) When the sample number is small but the sample features are very large, the decision tree is easy to fit, in general, more samples than the number of features will be relatively easy to build a robust model

2) If the sample number is small but the sample characteristics are very large, before fitting the decision tree model, it is recommended to do the dimension specification, such as principal component Analysis (PCA), feature selection (Losso) or independent component Analysis (ICA). The dimension of this feature will be greatly reduced. It would be nice to fit the decision tree model again.

3) It is recommended to visualize the multi-use decision tree (in the next section), while limiting the depth of the decision tree (for example, up to 3 layers) so that you can observe the initial fit of the data in the resulting decision tree before deciding whether to increase the depth.

4) in the training model first, pay attention to observe the category of the sample (mainly refers to the classification tree), if the category distribution is very uneven, you should consider using Class_weight to limit the model too biased to the category of more than a sample.

5) The array of decision trees uses the float32 type of numpy, and if the training data is not in this format, the algorithm will do the copy before running.

6) If the input sample matrix is sparse, it is recommended to call sparsity before fitting csc_matrix , and call before the forecastcsr_matrix稀疏化。

3. Visualization of Scikit-learn Decision tree Results

The visualization of decision trees makes it easy for us to visually observe the model and discover problems in the model. The visualization method of decision tree in Scikit-learn is introduced here.

3.1 Decision tree Visualization environment construction

The visualization of decision trees in Scikit-learn generally requires the installation of Graphviz. Mainly includes installation of Graphviz and Python's Graphviz plugin.

The first step is to install Graphviz. In: http://www.graphviz.org/. If you are Linux, you can install it using Apt-get or Yum method. If it is windows, download the MSI file installation on the official website. Whether it's Linux or Windows, set environment variables after loading, add Graphviz Bin directory to path, such as I am windows, add C:/Program Files (x86)/graphviz2.38/bin/to Path

The second step is to install the Python plugin graphviz:pip install Graphviz

The third step is to install the Python plugin pydotplus. This is nothing to say: Pip install pydotplus

So the environment is good, sometimes python will be very stupid, still can't find Graphviz, then, you can add this line in the code:

os.environ["PATH"] + = Os.pathsep + ' C:/Program Files (x86)/graphviz2.38/bin/'

Note that the back road is your own Graphviz bin directory.

3.2 Three ways to visualize a decision tree

Here we have an example of the decision tree visualization.

First load the class library:

From sklearn.datasets import load_irisfrom sklearn import treeimport sysimport os       os.environ["PATH"] + = Os.pathsep + ' C:/Program Files (x86)/graphviz2.38/bin/'

Then loading the Sciki-learn data, a decision tree is fitted to get the model:

Iris = Load_iris () CLF = tree. Decisiontreeclassifier () CLF = Clf.fit (Iris.data, Iris.target)

You can now deposit the model into the dot file Iris.dot.

With open ("Iris.dot", ' W ') as F:    f = Tree.export_graphviz (CLF, out_file=f)

At this time we have 3 visualization methods, the first is to use Graphviz dot command to generate a decision tree visualization file, after the completion of this command, the current directory can see the decision tree visualization file iris.pdf. Opens a model diagram that you can see in the decision tree.

#注意, this command executes dot-tpdf iris.dot-o on the command line iris.pdf

The second method is to pydotplus generate Iris.pdf. This will not have to be the command line to create a special PDF file.

The third approach is a personal recommendation, because it can be directly generated in the Ipython notebook. The code is as follows:

From Ipython.display import Image  dot_data = Tree.export_graphviz (CLF, Out_file=none,                          feature_names= Iris.feature_names,                           class_names=iris.target_names,                           filled=true, rounded=true,                           special_characters= True)  graph = pydotplus.graph_from_dot_data (dot_data)  

The diagram generated in Ipython's notebook is as follows:

  

4.Decisiontreeclassifier Instances

Here is a decisiontreeclassifier example that limits the number of decision tree layers to 4 .

From Itertools import Productimport numpy as Npimport matplotlib.pyplot as Pltfrom sklearn import Datasetsfrom sklearn.tre E Import decisiontreeclassifier# still uses its own iris data Iris = Datasets.load_iris () X = iris.data[:, [0, 2]]y = iris.target# training model, limit tree Maximum depth 4CLF = Decisiontreeclassifier (max_depth=4) #拟合模型clf. Fit (X, y) # paint x_min, X_max = x[:, 0].min ()-1, x[:, 0].max () + 1y_ Min, Y_max = x[:, 1].min ()-1, x[:, 1].max () + 1xx, yy = Np.meshgrid (Np.arange (X_min, X_max, 0.1), Np.arange (y_min,                     y _max, 0.1)) z = clf.predict (Np.c_[xx.ravel (), Yy.ravel ()]) z = z.reshape (xx.shape) Plt.contourf (xx, yy, Z, alpha=0.4) Plt.scatter (x[:, 0], x[:, 1], c=y, alpha=0.8) plt.show ()

The resulting figure is as follows:


We then visualized our decision tree, using the recommended third method. The code is as follows:

From Ipython.display import Image from  sklearn import treeimport pydotplus dot_data = Tree.export_graphviz (CLF, Out_f Ile=none,                          feature_names=iris.feature_names,                           class_names=iris.target_names,                           filled=true, rounded= True,                           special_characters=true)  graph = pydotplus.graph_from_dot_data (dot_data)  

The resulting decision tree graph is as follows:

The above is the Scikit-learn decision tree algorithm use a summary, hope can help everyone.

Summary of Scikit-learn decision Tree algorithm class library usage

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.