Huadian North Wind Blows
Key laboratory of cognitive computing and application, Tianjin University
Date Modified: 2015/8/11
Decision tree is a very simple machine learning classification algorithm. The decision tree idea originates from the human decision-making process. In the simplest case, humans find that when it rains, they tend to scrape the East wind and darken the sky. Corresponding to the decision tree model, predicting the wind and darkening of the weather model is the feature we collect and whether it rains is the category tag. The decision tree is built as shown in the figure below
Decision tree Model construction process, in the feature set is not put back in succession, recursive selection feature as the node of the decision tree-the current node information gain or gain rate is the largest, the current node value as the current node branch out of the forward edge (in fact, the main choice is these edges, which is derived from the information gain calculation formula can be obtained). The intuitive explanation for this
In the extreme case, if there is a feature, when the characteristics of the different values, the corresponding category labels are pure, the decision-maker will certainly choose this feature, as the identification of unknown data criteria. The following formula for calculating the gain of information can be found at this time the corresponding information gain is the largest.
G (D,a) =h (D)-H (d| A
G (D,a): Represents the information gain of feature A on training data set D
H (d): Empirical entropy representing data set D
H (d| A): Represents the conditional entropy of the data set D under a given condition of feature a.
Conversely, when the corresponding category label is evenly distributed under the respective values of a characteristic, H (d| A) The maximum, and for all features H (D) is the same. Therefore, at this time the G (d,a) is the smallest.
In a word, the characteristic we want to pick is: the classification information contained in the current feature is the most explicit.
Let's look at a decision tree algorithm written by Matlab to help understand
Tree termination condition is
1, the feature number is empty
2, the tree is pure
3, the information gain or gain rate is less than the threshold
Main function:
CLEAR;CLC;
% outlooktype=struct (' Sunny ', 1, ' Rainy ', 2, ' overcast ', 3);
% temperaturetype=struct (' hot ', 1, ' warm ', 2, ' cool ', 3);
% humiditytype=struct (' high ', 1, ' norm ', 2);
% windytype={' True ', 1, ' False ', 0};
% playgolf={' Yes ', 1, ' No ', 0};
% data=struct (' Outlook ', [], ' temperature ', [], ' humidity ', [], ' windy ', [], ' playgolf ', []);
outlook=[1,1,3,2,2,2,3,1,1,2,1,3,3,2] ';
temperature=[1,1,1,2,3,3,3,2,3,3,2,2,1,2] ';
humidity=[1,1,1,1,2,2,2,1,2,2,2,1,2,1] ';
windy=[0,1,0,0,0,1,1,0,0,0,1,1,0,1] ';
Data=[outlook temperature humidity windy];
playgolf=[0,0,1,1,1,0,1,0,1,1,1,1,1,0] ';
propertyname={' Outlook ', ' temperature ', ' humidity ', ' windy '};
delta=0.1;
Decisiontreemodel=decisiontree (Data,playgolf,propertyname,delta);
Building the Model main function section
function Decisiontreemodel=decisiontree (Data,label,propertyname,delta)
global Node;
Node=struct (' Fathernodename ', [], ' edgeproperty ', [], ' NodeName ', []);
Buildtree (' root ', ' Stem ', data,label,propertyname,delta);
Node (1) =[];
Model. Node=node;
Decisiontreemodel=model;
Recursive build tree
function Buildtree (fathernodename,edge,data,label,propertyname,delta) global Node;
Sonnode=struct (' Fathernodename ', [], ' edgeproperty ', [], ' NodeName ', []);
Sonnode.fathernodename=fathernodename;
Sonnode.edgeproperty=edge;
If length (unique label) ==1 Sonnode.nodename=label (1);
Node=[node Sonnode];
Return
End If Length (PropertyName) <1 labelset=unique (label);
K=length (Labelset);
Labelnum=zeros (k,1);
For I=1:k Labelnum (i) =length (Find (Label==labelset (i)));
End [~,labelindex]=max (Labelnum);
Sonnode.nodename=labelset (Labelindex);
Node=[node Sonnode];
Return
End [Sonindex,buildnode]=calcutenode (Data,label,delta);
If Buildnode Datarowindex=setdiff (1:length (PropertyName), Sonindex);
Sonnode.nodename=propertyname{sonindex};
Node=[node Sonnode];
PropertyName (Sonindex) =[];
Sondata=data (:, Sonindex);
Sonedge=unique (Sondata);
For I=1:length (Sonedge) edgedataindex=find (Sondata==sonedge (i)); BuildtrEE (Sonnode.nodename,sonedge (i), data (Edgedataindex,datarowindex), label (Edgedataindex,:), Propertyname,delta);
End Else Labelset=unique (label);
K=length (Labelset);
Labelnum=zeros (k,1);
For I=1:k Labelnum (i) =length (Find (Label==labelset (i)));
End [~,labelindex]=max (Labelnum);
Sonnode.nodename=labelset (Labelindex);
Node=[node Sonnode];
Return End
Calculate the characteristics of the next tree node
function [Nodeindex,buildnode]=calcutenode (Data,label,delta)
largeentropy=centropy (label);
[M,n]=size (data);
Entropygain=largeentropy*ones (1,n);
Buildnode=true;
For I=1:n
pdata=data (:, i);
Itemlist=unique (pData);
For J=1:length (itemList)
Itemindex=find (Pdata==itemlist (j));
Entropygain (i) =entropygain (i)-length (itemIndex)/m*centropy (label (ItemIndex));
End
% is run as gain rate, commented out as gain
% entropygain (i) =entropygain (i)/centropy (pData);
End
[Maxgainentropy,nodeindex]=max (Entropygain);
If Maxgainentropy<delta
buildnode=false;
End
Calculate entropy
function Result=centropy (propertylist)
result=0;
Totallength=length (propertylist);
Itemlist=unique (propertylist);
Pnum=length (itemList);
For I=1:pnum
itemlength=length (Find (Propertylist==itemlist (i));
Pitem=itemlength/totallength;
RESULT=RESULT-PITEM*LOG2 (pitem);
End
The data structure of the output is node type,
struct (' Fathernodename ', [], ' edgeproperty ', [], ' NodeName ', [])
Because MATLAB does not have the object-oriented function, if using python,java,c#, writing a binary tree will be more convenient.