(Data Science Learning Codex 23) Decision tree Classification principle detailed &python and R implementation

Source: Internet
Author: User

Decision Trees (decision tree) are based on the probability of the occurrence of a known variety of circumstances, by constituting a decision tree to find the net present value of the expected value of greater than or equal to zero probability, evaluation of project risk, determine its feasibility of decision-making analysis method, is a kind of graphic method to use probability analysis intuitively. Because this decision-making branch is drawn like a tree's branches, it is called a decision tree. In machine learning, a decision tree is a predictive model that represents a mapping between object properties and object values.

First, the initial knowledge decision tree

Decision tree is a tree structure, in general, a decision tree contains a root node, several internal nodes and several leaf nodes:

leaf node : The end of a direction of a tree, representing the output of the result;

root node : initial sample all;

internal nodes : Each internal node corresponds to a property test (i.e. one decision).

From the root node-each leaf node, form the judgment sequence; Our decision tree classifier training is designed to produce a strong generalization ability, that is, to deal with the non-sample ability of the decision tree, its basic process follows the "divide and conquer" strategy:

Algorithm process:

STEP1: Input Sample Set d{(x1,y1), (x2,y2),..., (Xn,yn)}, attribute set a{a1,a2,..., ad}, the entire sample set is stored in the root node;

STEP2: From attribute set a a certain rule (the specific rule is determined by the algorithm) to pick out an optimal attribute A1, all samples from the root node flow to the decision node, according to the sample on the A1 this attribute value, flow direction (such as):

After the sample set is judged by a property, there are several situations when the different flow direction is determined:

1. There is only one category y0for all samples flowing to a certain direction, at which point the direction is marked as a leaf node, that is, a sample that eventually flows out of this direction can be directly judged as a category y0;

    2. After judging by the current attribute, there is no sample outflow in a certain direction, this is usually the lack of sample diversity, which can be marked as leaf node, the training concentration of various other ratios as a priori probability, all the new samples outflow from this direction are marked as the most prior probability of the category;

3. On an attribute judgment, all training samples are given the same value, similar to the case 2, and there is no training sample outflow in other possible directions, and the method is 2 when the new sample is processed;

STEP3: Through the process of STEP2 all the properties are used to form a complete tree, each judgment path has passed all the attributes, then all the leaf node defined output category for the training process to reach the leaf node in the sample of the largest proportion (that is, the use of a priori distribution) of the class, so far, A decision tree is trained to complete.

Second, the selection of the training process attributes

Now we know the training process of the decision tree, but for which attribute is placed first, which is in the second place and so on, still do not know, this is the decision tree is very important and very clever point-the choice of division;

Division Selection : The key to decision tree learning is how to select the optimal partitioning attribute, and we hope that as the partitioning process continues, the branches of the decision tree will contain samples as much as possible in the same category, that is, the purity of the nodes (purity) is more and more high, Here are a few different rules for measuring the purity of a sample, and they have different decision tree algorithms, respectively:

1. Information Gain

Before we define the information gain, we first introduce the following concepts:

Information entropy (information entropy):

One of the most commonly used metrics for measuring the purity of sample sets, assuming that the percentage of K-class samples in the current sample set D is PK (k=1,2,..., |y|), then the information entropy of D is defined as:

Ent (D) the smaller the purity ofD , the higher the |y| Represents the number of possible values for a property, assuming that there is a V possible value for discrete attribute a {a1,a2,..., AV}, use a to divide the sample set D , resulting in V Branch node, where the v Branch node flows to all samples in D that have a value of AV in attribute a , which is recorded as DV, then attribute a the information gain obtained by dividing D is:

where |dv| Refers to the number of samples in D that take AV in a attribute, then |dv|/| d| Can be regarded as the weight in the direction of AV ;

* Principle: The greater the information gain, means that the use of a attribute to divide the "purity increase" is the largest, that is, the current optimal division is:

2. Gain Rate

Sometimes, if the sample set contains a "number" which makes the branch node more pure than the other effective attributes of the non-valid attributes (because the number will separate each sample), so that the number of branches can become leaf nodes (corresponding to the special case of 1), such a decision-making tree is obviously not a generalization ability, The new sample cannot be predicted, that is, the information gain criterion has a preference for attributes with a higher number of desirable values, in order to reduce the possible adverse effects of this preference, introduced below:

C4.5 algorithm:

Instead of using the information gain directly, the "gain rate" is used to select the current optimal partitioning attribute.

The gain rate is defined as:

which

Called the intrinsic value of attribute a , the greater the number of possible values for attribute a (i.e., the greater the V ), the greater the value of IV (a) , and the higher the information gain. The gain rate has a preference for attributes with a lower number of attributes, so the C4.5 algorithm does not directly compare the gain rate of all attributes, but has a heuristic process: first select the attribute with the higher than average information gain in the candidate partition attribute, and then choose the highest gain rate from it.

3. Gini coefficient

A cart decision tree (classfication and Regression tree) uses the Gini index to select the partitioning attribute, then the purity of data D can be measured using the Gini value:

Gini (D) reflects the probability of extracting two samples from DataSet D whose category tag is inconsistent, i.e., the smaller the Gini (d) , the higher the purity of DataSet D , the Gini index for an attribute a is:

Therefore, in the candidate attribute set A , select the current remaining attribute to minimize the number of the Gini index as the current optimal partitioning attribute, namely:

third, pruning treatment

In decision tree Learning, in order to classify training samples as correctly as possible, the node partitioning process is repeated, sometimes resulting in too many branches of the decision tree, which may result in excessive learning of the training set, so that some characteristics of the training set itself as the general nature of all data, resulting in overfitting.

The process of reducing the risk of overfitting by proactively removing some branches is called pruning.

The basic strategy of pruning the decision tree:

1. Pre-pruning (prepruning)

In the decision tree generation process, the performance estimation is performed before each node is divided, and if the partitioning of the current node does not bring up the generalization performance of the decision tree, the partition is stopped and the current node is marked as a leaf node.

2. Post-pruning (post-pruning)

First, a complete decision tree is generated from the training set, then the non-leaf nodes are examined from the bottom up, and if the corresponding subtree of the node is replaced with a leaf node, the tree is replaced with a leaf node.

Pre-pruning:

Steps:

STEP1: In order to measure generalization ability, the set of samples is divided into training set and validation set by using the retention method;

STEP2: According to the information gain criterion, select a * as the first non-leaf node under the root node, train the model of classification by this attribute, and compare the correct rate of the two models on the verification set, and choose a better scheme.

STEP3: Repeat STEP2 to examine all attributes until the final decision tree is complete;

* Only one layer of decision trees is called "decision stumps" (decision Stump)

Principle: Cut (eliminate) the correct rate is less than or equal to the current correct rate (i.e. the current highest correct rate) branch operation;

Advantages: Pre-pruning makes many branches of decision tree not unfold, reduces the risk of model overfitting, and significantly reduces the training time and test time overhead of decision trees.

Disadvantage: Although the current division of some branches can not improve generalization ability, may even lead to a temporary decline in generalization ability, but the subsequent division based on it may lead to significant performance improvement;

Pre-pruning is based on the "greedy" nature to prohibit these branches from unfolding, only concerned about the current performance, to the pre-pruning decision tree model brings the risk of less fitting.

post-pruning:  

Steps:

STEP1: For without any pruning treatment, only according to a certain information purity evaluation method finally formed a complete decision tree using all the attributes, starting from its most non-leaf nodes, training not to cut the node and cut the node when the model, comparative generalization ability;

STEP2: If the generalization ability is improved, then the corresponding model change/maintain original operation is taken;

STEP3: Repeat the process until all non-leaf nodes complete the pruning effect assessment.

Principle: If the correct rate is improved after pruning, the pruning operation will be taken, otherwise it will be unchanged;

Advantages: The risk of under-fitting is very small, generalization ability is better than pre-pruning decision tree;

Cons: The post-pruning process is done after a full decision tree is generated, and the training time is costly after one-by-one inspection of all non-leaf nodes in the tree from the bottom up.

These are some of the basic knowledge of the decision tree algorithm, and we implement the decision tree algorithm in Python and R respectively:

Iv. Python

we use the tree subordinate Decisiontreeclassifier () in the Sklearn module to classify decision trees, details of which are detailed in Sklearn's official website: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html# Sklearn.tree.DecisionTreeClassifier, here we introduce the main parameters:

criterion : Character type, used to determine the basis of the selection of the algorithm, there is corresponding to the cart tree algorithm "Gini" and corresponding ID3 algorithm "entropy", the default is "Gini"

Splitter : Character type, used to determine how each attribute is chosen to judge the node, based on the values of the indicators determined in criterion , the "best" corresponding to the optimal node and the "random" corresponding to the randomly selected, the default is ”

max_depth : integer, used to determine the maximum depth of the decision tree (i.e., the maximum number of non-leaf nodes), the default is none, i.e. no limit depth

min_samples_split : There are two cases,

1. Integer type, this parameter determines the minimum number of samples used to split non-leaf nodes, that is, if it is less than the preset value, the node can be directly based on a priori distribution to produce a leaf node output, the default value 2;

2. Floating-point type, this parameter function is not changed, but the determined min_samples_split becomes min_samples_split*n_samples, here represents the percentage.

min_samples_leaf : There are two cases,

1. Integer, this parameter determines the minimum number of samples used to generate a leaf node, that is, a leaf node is not generated when it is less than the value, the default value is 1;

2. Floating-point type, same as Min_samples_split

min_weight_fraction_leaf : Floating point, which is used to determine the weight of each sample, in the end of the leaf node to produce results, mainly for the scale operation of the category imbalance, the default weight of each sample is equal;

max_features : This parameter is used to determine the number of attributes to use when each non-leaf node attribute is divided (in the calculation of the information gain and the Gini index), using the full attribute by default, in the following cases:

1. Integer type, the integer that is passed in is the maximum number of attributes to be considered at each partition;

2. Floating point, the maximum number of attributes is the total number of the floating-point parameter * attribute;

3. Character type, "Auto", the maximum number of attributes is the total number of attributes of the open root; "sqrt" when the same as "auto"; "log2", the maximum number of attributes is the total number of attributes to take the logarithm;

4.None, then the maximum number of attributes is the total number of attributes;

max_leaf_nodes : This parameter is used to determine the maximum number of leaf nodes in the final decision tree model, with no limit by default, or None

class_weight : Used to deal with the weight of the category imbalance problem, it is recommended to use "balanced", that is, automatically according to the prior distribution of the right, the default is None, that is, ignore the weight, each similar to look

The above is the main parameters of Sklearn.tree.DecisionTreeClassifier, below we use the data of the Titanic in Kaggle playground as the demonstration data to the survival or not two classification:

Data Description:

Code:

 fromSklearn.treeImportDecisiontreeclassifier fromSklearn.model_selectionImportTrain_test_splitImportPandas as PDImportNumPy as NP" "read in Data" "Raw_train_data= Pd.read_csv ('Train.csv') Train=Raw_train_data.dropna () Target_train= train['survived'].tolist ()#Ticket classPclass = train['Pclass'].tolist () Sex= train['Sex'].tolist () Sex= [] forIinchrange (len (Sex)):ifSex[i] = ='male': Sex.append (1)    Else: Sex.append (0) age= train[' Age'].tolist ()#number of brothers and sisters on boardSIBSP = train['sibsp'].tolist ()#number of parents or children on boardParch = train['Parch'].tolist () Fare= train['Fare'].tolist ()#Port of Embarkationembarked = train['embarked'].tolist () sabor_c=[]sabor_q= []#setting dummy variables for boarding ports forIinchRange (Len (embarked)):ifEmbarked[i] = ='C': Sabor_c.append (1) sabor_q.append (0)elifEmbarked[i] = ='Q': Sabor_q.append (1) sabor_c.append (0)Else: Sabor_q.append (0) sabor_c.append (0)" "defining arguments and targets" "Train_=Np.array ([Sex,age,sabor_c,sabor_q]). Ttarget_=Np.array (Target_train)" "the average of the correct rate of training for repeated random split sample sets" "S= [] forIinchRange (1000): X_train, X_test, Y_train, Y_test= Train_test_split (Train_, target_, test_size=0.3) CLF= Decisiontreeclassifier (class_weight='Balanced', max_depth=2) CLF=Clf.fit (X_train,y_train) s.append (Clf.score (x_test,y_test) )" "Print Results" "Print('average correct rate:'+str (Np.mean (S)))

Training effect:

R

There is a great convenience in using decision tree correlation algorithm in R, that is, when we visualize the decision tree, we all know that the decision tree is a highly explanatory machine learning algorithm, which is one of the reasons why it is widely used, and it is very convenient to draw decision tree in R. The initial generation and pruning of a decision tree is performed using two different functions, where we use the Rpart package to create a classification tree, where the Rpart () function creates a decision tree, and the prune () function is used to prune the tree, with the following specific parameters:

To Rpart ():

Formula: This is the input format for many of the algorithms in R, with the name of the target column at the left end and the argument column name at the right end;

Data: The name of the input frame;

Weights: Optional custom category weights, mainly used in category imbalance, similar to the re-scaling in logical classification;

Na.action: The missing value is processed, default deletion of the target column missing samples, but the reserved arguments are missing samples (decision tree is more tolerant of missing values, there are corresponding processing methods)

Parms: The default is the "Gini" index, which is the method of the CART decision tree Partition node;

> Rm (list=ls ())>Library (Rpart.plot)>Library (Rpart)>data (Iris)> Data <-Iris> Sam <-Sample (1: Max, -)> Train_data <-Data[sam,]> Test_data <-data[-Sam,]> Dtree <-rpart (species~.,data=train_data)>plotcp (Dtree)> dtree.pruned <-prune (dtree, cp=0.01)>PRP (dtree.pruned)> dtree.pred <-Predict (dtree.pruned,test_data[,1:4],type='class')> Dtree.perf <-table (test_data[,5],dtree.pred)>dtree.perf dtree.pred setosa versicolor virginica setosaTen          0         0versicolor0         Ten         0virginica0          3         7

(Data Science Learning Codex 23) Decision tree Classification principle detailed &python and R implementation

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.