General Decision tree Induction Framework See previous posts: http://blog.csdn.net/zhyoulun/article/details/41978381
ID3 Attribute Selection Metric principle
ID3 uses information gain as a property selection metric. The measure is based on Shannon's pioneering work in the study of the value of messages or the information theory of "informative content". The node n represents or holds the tuple of partition D. Select the attribute with the highest information gain as the split attribute of node n . This property minimizes the amount of information required for the GANSO classification in the result partition and reflects the minimum randomness or "impurity" in those partitions. This method minimizes the number of desired tests needed to classify an object and ensures that a simple (but not necessarily simplest) tree is found.
The desired information required for tuple classification in D is given by the following formula,
Pi is a non 0 probability of D-loyal arbitrary tuple belonging to CI. Using a 2-based logarithmic function is because the information is encoded in binary. Info (d) is the average amount of information required to identify the class label of the tuple in D. Note that all of our information at this time is just the percentage of the tuple of each class.
Now let's say we're going to divide the tuple in D by a property A, where property A has a V different value {a1,a2,... av} based on the observation of the training data. You can use property A to divide D into v partitions or subsets {d1,d2,..., Dv}, where the DJ contains the tuples in D and their a value is AJ. These partitions correspond to the branches that grow from node N. Ideally, we want this division to produce an accurate classification of tuples. That is, you want each partition to be pure (the actual situation is mostly impure, such as if the partition might contain tuples from different classes). After this division, how much information we need to get an accurate classification. This quantity is measured by the following:
which | dj|/| The d| acts as the weight of the section J partition. Info_a (d) is based on the expected information needed to classify a tuple of D according to a. The smaller the expected information required, the higher the purity of the partition .
Information gain is defined as the difference between the original information requirements (based on the class scale only) and the new information requirements (after a partition). That
In other words, Gain (a) tells us how much we get through the division of a. It is expected to reduce the demand for information that results from a value of a. Select attribute A with the highest information gain gain (a) as the split attribute of node N.
Here is an example.
Data
Data.txt
Youth,high,no,fair,no
youth,high,no,excellent,no
middle_aged,high,no,fair,yes
senior,medium,no, Fair,yes
senior,low,yes,fair,yes
senior,low,yes,excellent,no
middle_aged,low,yes,excellent,yes
youth,medium,no,fair,no
youth,low,yes,fair,yes
senior,medium,yes,fair,yes
Youth,medium,yes, Excellent,yes
middle_aged,medium,no,excellent,yes
middle_aged,high,yes,fair,yes
Senior,medium, No,excellent,no
Attr.txt
Age,income,student,credit_rating,buys_computer
Operation result
Age (1:youth; 2:middle_aged; 3:senior;)
Credit_rating (1:fair; 2:excellent;)
Leaf:no () Leaf:yes () Leaf:yes ()
student (1:no;
2:yes;)
Leaf:no ()
Leaf:yes ()
Finally attach the Java code
Decisiontree.java
Package com.zhyoulun.decision;
Import Java.io.BufferedReader;
Import Java.io.File;
Import Java.io.FileInputStream;
Import java.io.FileNotFoundException;
Import java.io.IOException;
Import Java.io.InputStreamReader;
Import java.util.ArrayList;
Import Java.util.Map; /** * is responsible for reading and writing data, and generating decision tree * * @author Zhyoulun */public class DecisionTree {private arraylist<arraylist<st
Ring>> Alldatas;
Private arraylist<string> allattributes; /** * Read all relevant data from the file * @param datafilepath * @param attrfilepath/Public DecisionTree (String datafilepath,string
Attrfilepath) {super ();
try {this.alldatas = new arraylist<> ();
This.allattributes = new arraylist<> ();
InputStreamReader InputStreamReader = new InputStreamReader (new FileInputStream (New File (DataFilePath));
BufferedReader BufferedReader = new BufferedReader (InputStreamReader);
String line = null; while ((Line=bufferedreader.readline ())!=null) {string[] StriNGS = Line.split (",");
arraylist<string> data = new arraylist<> ();
for (int i=0;i<strings.length;i++) Data.add (Strings[i]);
This.allDatas.add (data);
} InputStreamReader = new InputStreamReader (new FileInputStream (New File (Attrfilepath));
BufferedReader = new BufferedReader (InputStreamReader);
while ((Line=bufferedreader.readline ())!=null) {string[] strings = Line.split (",");
for (int i=0;i<strings.length;i++) This.allAttributes.add (Strings[i]);
} inputstreamreader.close ();
Bufferedreader.close ();
catch (FileNotFoundException e) {//TODO auto-generated catch block E.printstacktrace ();
catch (IOException e) {//TODO auto-generated catch block E.printstacktrace ();
}//for (int i=0;i<this.allattributes.size (); i++)//{//System.out.print (This.allAttributes.get (i) + "");
}//System.out.println (); for (int i=0;i<this.alldatas.size (); i++)//{//for (int j=0;j<this.alldatas.get (i). Size (), j + +)//{//System.out.print (This.allDatas.get (i). Get (J) + "");
}//System.out.println (); }/** * @param alldatas * @param allattributes/public DecisionTree (arraylist<arraylist<string
>> Alldatas, arraylist<string> allattributes) {super ();
This.alldatas = Alldatas;
This.allattributes = allattributes;
Public arraylist<arraylist<string>> Getalldatas () {return alldatas;
public void Setalldatas (arraylist<arraylist<string>> alldatas) {this.alldatas = Alldatas;
Public arraylist<string> getallattributes () {return allattributes;
public void Setallattributes (arraylist<string> allattributes) {this.allattributes = allattributes; /** * Recursive generation decision number * @return/public static TreeNode Generatedecisiontree (arraylist<arraylist<string> > datas, arraylist<string> attrs) {TreeNode TreeNode = nEW TreeNode (); If the elements in D are in the same class C, then if (Isinthesameclass (datas)) {treenode.setname (datas.get (0). Get (Datas.get (0). Size ()-1));
Rootnode.setname ();
return treeNode;
//If Attrs is empty, then (this situation does not normally occur, we should be to construct the decision tree for all candidate attribute sets) if (Attrs.size () ==0) return treeNode;
CriterionID3 criterionID3 = new CriterionID3 (datas, attrs);
int splitingcriterionindex = Criterionid3.attributeselectionmethod ();
Treenode.setname (Attrs.get (Splitingcriterionindex));
Treenode.setrules (Getvalueset (datas, Splitingcriterionindex));
Attrs.remove (Splitingcriterionindex); map<string, arraylist<arraylist<string>>> subdatasmap = Criterionid3.getsubdatasmap (
Splitingcriterionindex); For (String Key:subDatasMap.keySet ())//{//System.out.println ("===========");//System.out.println (key); for (int i=0;i<subdatasmap.get (key). Size (), i++)//{//for (int j=0;j<subdatasmap.get (key). Size (); j + +)// {//System.out.print (subdatasmAp.get (Key). get (i). Get (J) + "");
}//System.out.println (); }//} for (String Key:subDatasMap.keySet ()) {arraylist<treenode> treenodes = Treenode.getchildren ()
;
Treenodes.add (Generatedecisiontree (Subdatasmap.get (key), attrs));
Treenode.setchildren (TreeNodes);
return treeNode; /** * Gets the range of index columns in Datas * @param data * @param index * @return/public static arraylist<string& Gt Getvalueset (arraylist<arraylist<string>> Datas,int index) {arraylist<string> values = new ArrayList
<String> ();
String r = "";
for (int i = 0; i < datas.size (); i++) {r = Datas.get (i). get (index);
if (!values.contains (R)) {Values.add (R);
} return values; /** * The last column is a class label to determine whether the last column is the same * @param datas * @return/private static Boolean Isinthesameclass (Arraylist<a Rraylist<string>> datas) {String flag = datas.get (0). Get (Datas.get (0). Size ()-1);//Line No. 0, last column assign initial value for (int i=0;i<datas.size (); i++) {if (!datas.get (). () (Datas.get (i). Size ()-1). Equals (flag)) return false;
return true;
public static void Main (string[] args) {String datapath = "Files/data.txt";
String Attrpath = "Files/attr.txt";
Initializes the original data decisiontree DecisionTree = new DecisionTree (Datapath,attrpath);
Generate decision tree TreeNode TreeNode = Generatedecisiontree (Decisiontree.getalldatas (), decisiontree.getallattributes ());
Print (treenode,0);
private static void print (TreeNode treenode,int level) {for (int i=0;i<level;i++) System.out.print ("T");
System.out.print (Treenode.getname ());
System.out.print ("("); for (int i=0;i<treenode.getrules (). Size (), i++) System.out.print ((i+1) + ":" +treenode.getrules (). get (i) + ";
");
System.out.println (")");
arraylist<treenode> treenodes = Treenode.getchildren ();
for (int i=0;i<treenodes.size (); i++) {print (Treenodes.get (i), level+1);
}
}
}
Criterionid3.java
Package com.zhyoulun.decision;
Import java.util.ArrayList;
Import Java.util.HashMap;
Import Java.util.Map; /** * ID3, select split criterion * * @author zhyoulun * */public class CriterionID3 {private arraylist<arraylist<string>
> datas;
private arraylist<string> attributes;
Private map<string, arraylist<arraylist<string>>> Subdatasmap;
/** * Calculates all information gain, obtains the largest one as splitting attribute * @return/public int attributeselectionmethod () {Double gain =-1.0;
int maxindex = 0;
for (int i=0;i<this.attributes.size () -1;i++) {Double tempgain = This.calcgain (i);
if (tempgain>gain) {gain = Tempgain;
Maxindex = i;
} return Maxindex; /** * Compute Gain (age) =info (d)-info_age (d) * @param index * @return//** * @param index * @param iscalcs
UBDATASMAP * @return * * Private double Calcgain (int index) {double result = 0;
Calculates info (D) int lastindex = datas.get (0). Size ()-1; Arraylist<string> ValueSET = Decisiontree.getvalueset (This.datas,lastindex);
for (String value:valueset) {int count = 0;
for (int i=0;i<datas.size (); i++) {if (Datas.get (i). Get (lastindex). Equals (value)) count++;
Result + =-(1.0*count/datas.size ()) *math.log (1.0*count/datas.size ())/math.log (2);
SYSTEM.OUT.PRINTLN (result);
}//System.out.println ("==========");
Compute info_a (D) Valueset = Decisiontree.getvalueset (This.datas,index);
for (String temp:valueset)//System.out.println (temp);
System.out.println ("==========");
for (String value:valueset) {arraylist<arraylist<string>> Subdatas = new arraylist<> ();
for (int i=0;i<datas.size (), i++) {if Datas.get (i). Get (Index). Equals (value)) Subdatas.add (Datas.get (i)); }//For (arraylist<string> Temp:subdatas)//{/for (String temp2:temp)//SYSTEM.OUT.P
Rint (temp2+ "");
System.out.println (); } arraylist<string&Gt
Subvalueset = Decisiontree.getvalueset (Subdatas, lastindex);
System.out.print ("Subvalueset:");
for (String temp:subvalueset)//System.out.print (temp+ "");
System.out.println ();
for (String subvalue:subvalueset) {//System.out.println ("+++++++++++++++");//System.out.println (Subvalue);
int count = 0;
for (int i=0;i<subdatas.size (); i++) {if (Subdatas.get (i). Get (lastindex). Equals (Subvalue)) count++;
}//System.out.println (count); Result + = -1.0*subdatas.size ()/datas.size () * (-(1.0*count/subdatas.size ()) *math.log (1.0*count/subdatas.size ())/
Math.log (2));
SYSTEM.OUT.PRINTLN (result);
} return result; Public CriterionID3 (arraylist<arraylist<string>> datas, arraylist<string> attributes) {Supe
R ();
This.datas = datas;
this.attributes = attributes;
Public arraylist<arraylist<string>> Getdatas () {return datas; } public void SEtdatas (arraylist<arraylist<string>> datas) {this.datas = datas;
Public arraylist<string> getattributes () {return attributes;
public void SetAttributes (arraylist<string> attributes) {this.attributes = attributes; Public map<string, arraylist<arraylist<string>>> getsubdatasmap (int index) {arraylist<string& Gt
Valueset = Decisiontree.getvalueset (This.datas, index);
This.subdatasmap = new hashmap<string, arraylist<arraylist<string>>> ();
for (String value:valueset) {arraylist<arraylist<string>> Subdatas = new arraylist<> (); for (int i=0;i<this.datas.size (), i++) {if This.datas.get (i). Get (Index). Equals (value)) Subdatas.add (This.dat
As.get (i));
for (int i=0;i<subdatas.size (); i++) {Subdatas.get (i). Remove (index);
} this.subDatasMap.put (value, Subdatas);
return subdatasmap; } public void Setsubdatasmap (map<string,Arraylist<arraylist<string>>> subdatasmap) {this.subdatasmap = Subdatasmap;
}
}
Treenode.java
Package com.zhyoulun.decision;
Import java.util.ArrayList; public class TreeNode {private String name; The name of the node (splitting attribute) private arraylist<string> rules; Node splitting rules (assuming all are discrete values)//private arraylist<arraylist<string>> datas; The training tuple divided into the node (Datas.get (i) represents a training tuple)//private arraylist<string> candidateattributes; The candidate attributes (consistent with the number of training tuples) are divided into the nodes, private arraylist<treenode> children;
Child node Public TreeNode () {this.name = "";
This.rules = new arraylist<string> ();
This.children = new arraylist<treenode> ();
This.datas = null;
This.candidateattributes = null;
Public String GetName () {return name;
public void SetName (String name) {this.name = name;
Public arraylist<string> GetRules () {return rules;
public void SetRules (arraylist<string> rules) {this.rules = rules;
Public arraylist<treenode> GetChildren () {return children; } public void Setchildren (arraylist<Treenode> children) {This.children = children; }//Public arraylist<arraylist<string>> Getdatas ()//{//return datas;//}////public void Setdatas (Ar raylist<arraylist<string>> datas)//{//This.datas = datas///////Public arraylist<string> GETCA Ndidateattributes ()//{//return candidateattributes;//}////public void Setcandidateattributes (arraylist<string
> Candidateattributes)//{//this.candidateattributes = Candidateattributes;//}}
Reference: "Data Mining concepts and Technologies (3rd edition)"
Reprint please indicate the source: