The basic decision tree algorithm
,
The basic decision tree algorithm can be designed to be a recursive algorithm, recursive algorithm when no need or can not be divided when the return value, the red part of the above marked the return of the recursive function three cases, the first case is the training set of the same label, the result is directly labeled as the label can be. The second case is that the property set is empty and the same in both cases. The third case is that the training set is empty and there is no such data in the training set. Let me use an example to illustrate the following:
Titanic Simplified Data Set Description decision tree returns in three cases
We assume that there is the following data, which is simplified from the Titanic data, and the attributes are:
Sex, passenger gender, value female and male
Pclass, rank, value is
embarked, landing port, value for S,c
Survived represents the result of the classification, 1 means survival, 0 means death.
sex |
pclass |
embarked |
survived |
f |
2 |
s |
1 |
f |
3 |
c |
1 |
m |
1 |
s |
0 |
m |
1 |
s |
1 |
m |
2 |
s |
0 |
m |
2 |
c |
1 |
m |
1 |
s |
0 |
The following illustration shows the above three cases, and the values inside the angle brackets indicate different situations.
Sex for female all survived, the classification results are 1, so this belongs to the first, with the classification result 1 as a leaf node.
When sex is male, pclass is 1, and we can see that the values of the samples are the same on the attributes, and the results are in the second case, and the classification result is represented by the class with the most categories in the sample. and Pclass for 2,embarked s and C both cases are a null value of this case, so also belong to the second class.
We found that there was no sex=male,pclass=3 in our data, and we used the most categorized tags in the training set instead, which belongs to the third category.
Common strategies for dividing attributes
1: Use information gain
Using some of the concepts of information entropy, in a sample set, the more "chaotic" the classification of a sample, the more information entropy it has, the more impure the sample is, and the more "neat" a sample is, the less information entropy it has, the purer the sample. The classification of samples is [1,1,1,0] than [1,1,0,0] good, the former is more "neat", more pure. So, if a property is divided into more pure samples, then the more we tend to choose this property.
D is a collection of samples, the proportion of the class K sample is $p_k (k=1,2,..., \vert y \vert) $, then the information entropy of D is
$Ent (D) =–\sum \limits_{i=i}^{k} p_k \log_2 p_k$
If attribute A is divided, the desirable value of attribute A is ${a^1,a^2,..., a^v}$, then the sample of attribute set D on $a^v$ is $d^v$
So, the information gain for dividing the sample with attribute A is:
$Gain (d,a) = ent (d) –\sum \limits_{v=1} ^{v} \frac {\vert d_v \vert}{\vert D \vert} Ent (D^V) $
The greater the information gain rate, the better the result of using attribute a. ID3 algorithm with information gain as the criterion for attribute selection
2: Using the gain rate
When using the information gain rate, a phenomenon is observed, the greater the likelihood of classification in an attribute, the higher the information gain, and the introduction of the gain rate for attribute division
$Gain \_ratio (d,a) = \frac {Gain (d,a)}{iv (a)}$ where $IV (a) =–\sum \limits_{i=1} ^ {V} \frac {\vert d_v \vert}{\vert D \vert } \log_2 \frac {\vert d_v \vert}{\vert D \vert}$
The C4.5 algorithm is improved based on the gain rate.
3: Using the Gini index
The purity of the data D can be measured by the Gini value.
$Gini (D) = \sum \limits_{k=1} ^{\vert y \vert} \sum \limits_{k ' \neq k} p_k p_{k '} $ = $-\sum \limits_{k=1} ^{\vert Y \vert} p_k ^2$
The Gini index defined above the Gini value is:
$Gini \_index (d,a) = \sum \limits_{v=1} ^{v} \frac {\vert d_v \vert}{\vert D \vert} Gini (D^V) $
Select the lowest attribute value A of the Gini index to divide, and the CART (classification and Regression Tree) is selected by the Gini index.
Pruning treatment
Pruning is to get better generalization performance, pruning is divided into pre-pruning and post-pruning.
Pre-pruning is when the decision tree is established, some nodes that can be segmented as a leaf node are compared to the nodes after the partition. The classification values of leaf nodes can be used in the second case of decision Tree algorithm: the category of leaf node is set to the most sample category in the DataSet. Then evaluate it in some way.
After pruning is based on the decision tree has been established, some of the points separated by the leaf node to replace, observe the performance.
Continuous values and multivariable
When dealing with continuous values, to select an appropriate segmentation point in the continuous attribute to divide the attribute, the choice of the split point is to select one of the many values that can make the information gain the most.
Multi-variables mean that when dividing a property, you can choose more than one attribute to divide it.
Decision Trees in Sklearn
This blog is very clear: Scikit-learn Decision Tree Algorithm class library use summary, said the various parameters
Try using the algorithm in the blog to process the above data and use Graphviz to draw:
fromSklearn.treeImportDecisiontreeclassifier fromSklearnImportTreeImportPandas as Pdpath= R"C:\Users\xiaotiange\Desktop\tic.csv"DT=decisiontreeclassifier () Titanic=pd.read_csv (path) fromSklearn.preprocessingImportLabelencoderle=Labelencoder () Z1= Pd.get_dummies (titanic['Sex']) Z2= Pd.get_dummies (titanic['Pclass']) Z3= Pd.get_dummies (titanic['embarked']) X= Pd.concat ([z1,z2,z3], Axis=1) X.columns= ['female','male','Pclass_1','pclass_2','Pclass_3','Embarked_c','embarked_s']y= titanic['survived']dt.fit (x, y) f_names=X.columnst_names= ['survived','dead']dot_data= Tree.export_graphviz (DT, out_file=None,feature_names=F_names, Class_names=t_names, filled=true, rounded=True, Special_characters=True)ImportPydotplusgraph=Pydotplus.graph_from_dot_data (Dot_data) graph.write_pdf ("titanic.pdf")
Decision Tree algorithm