Decision tree Induction (ID3 attribute selection metric) Java implementation

Last Update:2014-12-31 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

General Decision tree Induction framework See previous blog post: http://blog.csdn.net/zhyoulun/article/details/41978381

ID3 Attribute Selection Metric principle

ID3 uses information gain as a property selection metric. The measure is based on the pioneering work of Shannon in the study of the value of the message or the information theory of the "message content". The node n represents or stores the tuple of partition D. Select the attribute with the highest information gain as the split attribute of node n . This property minimizes the amount of information required for the GANSO classification in the result partition and reflects the minimum randomness or "impure" in those partitions. This approach minimizes the number of expected tests required to classify an object and ensures that a simple (but not necessarily the simplest) tree is found.

The expected information required for the tuple classification in D is given by the following formula,

Pi is a non-0 probability that the D-Zhong arbitrary tuple belongs to class CI. The logarithm function with the base 2 is used because the information is encoded in binary. Info (d) is the average amount of information required to identify the class labels of tuples in D. Notice that at this point all of our information is just a percentage of the tuple for each class.

Now suppose we want to classify a tuple in D according to a property A, where attribute A has a V different value {a1,a2,... av} According to the observation of the training data. You can use attribute A to divide d into a V partition or subset {D1,D2,..., Dv}, where the DJ contains tuples in D whose a value is AJ. These partitions correspond to branches that grow from node N. Ideally, we want this division to produce a tuple's exact classification. That is, you want each partition to be pure (the actual situation is mostly impure, such as partitions may contain tuples from different classes). After this division, how much information do we need in order to get an accurate breakdown? This amount is measured by the following formula:

Among them | dj|/| The d| serves as the weight of the first J-Partition. Info_a (d) is based on the expected information required to classify a tuple of D according to A. The smaller the desired information, the higher the purity of the partition .

The information gain is defined as the difference between the original information needs (based on the class scale only) and the new information requirements (after a division). That

In other words, Gain (a) tells us how much we get by dividing by a. It is expected to reduce the demand for information that results from knowing the value of a. Select the attribute a with the highest information gain gain (a) as the split attribute of node N.

Here is an example.

Data

Data.txt

Youth,high,no,fair,noyouth,high,no,excellent,nomiddle_aged,high,no,fair,yessenior,medium,no,fair,yessenior,low , Yes,fair,yessenior,low,yes,excellent,nomiddle_aged,low,yes,excellent,yesyouth,medium,no,fair,noyouth,low,yes, Fair,yessenior,medium,yes,fair,yesyouth,medium,yes,excellent,yesmiddle_aged,medium,no,excellent,yesmiddle_aged , High,yes,fair,yessenior,medium,no,excellent,no

Attr.txt

Age,income,student,credit_rating,buys_computer

Result of Operation

Age (1:youth; 2:middle_aged; 3:senior;) Credit_rating (1:fair; 2:excellent;) Leaf:no () Leaf:yes () Leaf:yes () student (1:no; 2:yes;) Leaf:no () Leaf:yes ()

Finally, attach the Java code

Decisiontree.java

Package Com.zhyoulun.decision;import Java.io.bufferedreader;import Java.io.file;import java.io.FileInputStream; Import Java.io.filenotfoundexception;import Java.io.ioexception;import Java.io.inputstreamreader;import Java.util.arraylist;import java.util.map;/** * Responsible for reading and writing data, and generating decision tree * * @author Zhyoulun * */public class Decisiontree{priv Ate arraylist<arraylist<string>> alldatas;private arraylist<string> allAttributes;/** * Read all relevant data from the file * @param datafilepath * @param attrfilepath */public decisiontree (String datafilepath,string attrfilepath) { Super (); Try{this.alldatas = new arraylist<> (); this.allattributes = new arraylist<> (); InputStreamReader InputStreamReader = new InputStreamReader (new FileInputStream (New File (DataFilePath))); BufferedReader BufferedReader = new BufferedReader (InputStreamReader); String line = Null;while ((Line=bufferedreader.readline ())!=null) {string[] strings = Line.split (","); arraylist<string> data = new arraylist<> (); for (int i=0;i<strings.length;i++) Data.add (Strings[i]); This.allDatas.add (data);} InputStreamReader = new InputStreamReader (new FileInputStream (New File (Attrfilepath)); BufferedReader = new BufferedReader (InputStreamReader); while ((Line=bufferedreader.readline ())!=null) {string[] strings = Line.split ("," ); for (int i=0;i<strings.length;i++) This.allAttributes.add (Strings[i]);} Inputstreamreader.close (); Bufferedreader.close ();} catch (FileNotFoundException e) {//Todo auto-generated catch Blocke.printstacktrace ()} catch (IOException e) {//Todo Aut O-generated catch Blocke.printstacktrace ();} for (int i=0;i<this.allattributes.size (); i++)//{//system.out.print (This.allAttributes.get (i) + "");//}// System.out.println ();////for (int i=0;i<this.alldatas.size (); i++)//{//for (int j=0;j<this.alldatas.get (i). Size (); j + +)//{//system.out.print (This.allDatas.get (i). Get (J) + "");//}//system.out.println ();//}}/** * @param Alldatas * @param allattributes */public decisiontree (arraylist<arraylist<string>&Gt Alldatas,arraylist<string> allattributes) {super (); This.alldatas = Alldatas;this.allattributes = AllAttributes ;} Public arraylist<arraylist<string>> Getalldatas () {return alldatas;} public void Setalldatas (arraylist<arraylist<string>> alldatas) {this.alldatas = Alldatas;} Public arraylist<string> getallattributes () {return allattributes;} public void Setallattributes (arraylist<string> allattributes) {this.allattributes = allattributes;} /** * Recursive generation decision number * @return */public static TreeNode generatedecisiontree (arraylist<arraylist<string>> datas, Arraylist<string> attrs) {TreeNode TreeNode = new TreeNode ();//If the elements in D are in the same Class C, ThenIf (Isinthesameclass (datas)) { Treenode.setname (Datas.get (0). Get (Datas.get (0). Size ()-1);//rootnode.setname (); return treeNode;} If Attrs is empty and then (this generally does not occur, we should build a decision tree for all of the candidate attribute sets) if (Attrs.size () ==0) return treeNode; CriterionID3 criterionID3 = new CriterionID3 (datas, attrs); int splitingcriterionindex = Criterionid3.attributEselectionmethod (); Treenode.setname (Attrs.get (Splitingcriterionindex)); Treenode.setrules (GetValueSet, Splitingcriterionindex)); Attrs.remove (Splitingcriterionindex); map<string, arraylist<arraylist<string>>> subdatasmap = Criterionid3.getsubdatasmap ( Splitingcriterionindex);//for (String key:subDatasMap.keySet ())//{//system.out.println ("===========");// SYSTEM.OUT.PRINTLN (key);//for (int i=0;i<subdatasmap.get (key). Size (); i++)//{//for (int j=0;j<subdatasmap.get (key). get (i). Size (); j + +)//{//system.out.print (Subdatasmap.get (key). get (i). Get (J) + "");//}//system.out.println () ;//}//}for (String key:subDatasMap.keySet ()) {arraylist<treenode> treenodes = Treenode.getchildren (); Treenodes.add (Generatedecisiontree (Subdatasmap.get (key), attrs); Treenode.setchildren (treenodes);} return treeNode;} /** * Gets the domain value of the index column in Datas * @param data * @param index * @return */public static arraylist<string> Getvalueset (Arrayli St<arraylist<string>> Datas,int index) {ArrayList<String> values = new arraylist<string> (); String r = ""; for (int i = 0; I < datas.size (); i++) {r = Datas.get (i). Get (index), if (!values.contains (R)) {Values.add (R);}} return values;} /** * The last column is the class designator, judging if the last column is the same * @param datas * @return */private static Boolean Isinthesameclass (Arraylist<arraylist<st Ring>> datas) {String flag = datas.get (0). Get (Datas.get (0). Size ()-1);//No. 0 row, the last column assigns the initial value for (int i=0;i<datas.size ( ) {i++) {if (!datas.get (i). Get (Datas.get (i). Size ()-1). Equals (flag) return false;} return true;} public static void Main (string[] args) {String DataPath = "Files/data.txt"; String Attrpath = "files/attr.txt";//initialization of raw data decisiontree DecisionTree = new DecisionTree (datapath,attrpath);// Build Decision tree TreeNode TreeNode = Generatedecisiontree (Decisiontree.getalldatas (), Decisiontree.getallattributes ());p rint ( treenode,0);} private static void print (TreeNode treenode,int level) {for (int i=0;i<level;i++) System.out.print ("\ t"); System.out.print (Treenode.getname ()); System.out.print ("("); for(int i=0;i<treenode.getrules (). Size (); i++) System.out.print ((i+1) + ":" +treenode.getrules (). get (i) + "; "); System.out.println (")"); arraylist<treenode> treenodes = Treenode.getchildren (); for (int i=0;i<treenodes.size (); i++) {print ( Treenodes.get (i), level+1);}}}

Criterionid3.java

Package Com.zhyoulun.decision;import java.util.arraylist;import Java.util.hashmap;import java.util.Map;/** * ID3,  Select Division Criteria * * @author Zhyoulun * */public class criterionid3{private arraylist<arraylist<string>> datas;private Arraylist<string> attributes;private map<string, arraylist<arraylist<string>>> subDatasMap /** * Calculates all information gain, gets the largest one as a split attribute * @return */public int attributeselectionmethod () {Double gain = -1.0;int Maxindex = 0;for (i NT I=0;i<this.attributes.size () -1;i++) {Double tempgain = This.calcgain (i); if (tempgain>gain) {gain = Tempgain; Maxindex = i;}} return maxindex;} /** * Calculate Gain (age) =info (d)-info_age (d) ETC * @param index * @return *//** * @param index * @param iscalcsubdatasmap * @retur n */private double calcgain (int index) {Double result = 0;//computes info (D) int lastIndex = datas.get (0). Size ()-1; arraylist<string> ValueSet = Decisiontree.getvalueset (This.datas,lastindex); for (String value:valueset) {int Count = 0;for (int i=0;i<datas.size (); i++) {if (Datas.get (i). Get (LastIndex). Equals (value)) count++;} Result + =-(1.0*count/datas.size ()) *math.log (1.0*count/datas.size ())/math.log (2);//system.out.println (result); System.out.println ("==========");//Calculation info_a (D) ValueSet = Decisiontree.getvalueset (This.datas,index);//for ( String temp:valueset)//system.out.println (temp);//system.out.println ("=========="); for (string value:valueset) { arraylist<arraylist<string>> Subdatas = new arraylist<> (); for (int i=0;i<datas.size (); i++) {if ( Datas.get (i). Get (Index). Equals (value)) Subdatas.add (Datas.get (i)); for (arraylist<string> Temp:subdatas)//{//for (String temp2:temp)//system.out.print (temp2+ "");// System.out.println ();//}arraylist<string> subvalueset = Decisiontree.getvalueset (Subdatas, lastIndex);// System.out.print ("Subvalueset:");//for (String temp:subvalueset)//system.out.print (temp+ "");//system.out.println (); for (String Subvalue:subvalueset) {//system.out.println ("+++++++++++++++");//system.out.println (SubValue);T count = 0;for (int i=0;i<subdatas.size (); i++) {if (Subdatas.get (i). Get (LastIndex). Equals (Subvalue)) count++;} System.out.println (count); result + = -1.0*subdatas.size ()/datas.size () * (-(1.0*count/subdatas.size ()) *math.log ( 1.0*count/subdatas.size ())/math.log (2));//system.out.println (result);}} return result;} Public CriterionID3 (arraylist<arraylist<string>> datas,arraylist<string> attributes) {super (); This.datas = Datas;this.attributes = attributes;} Public arraylist<arraylist<string>> Getdatas () {return datas;} public void Setdatas (arraylist<arraylist<string>> datas) {this.datas = datas;} Public arraylist<string> getattributes () {return attributes;} public void SetAttributes (arraylist<string> attributes) {this.attributes = attributes;} Public map<string, arraylist<arraylist<string>>> getsubdatasmap (int index) {arraylist<string > ValueSet = Decisiontree.getvalueset (This.datas, index); this.subdatasmap = new hashmap<string, Arraylist<arraylist<string>>> (); for (String Value:valueset) {arraylist<arraylist<string> > subdatas = new arraylist<> (), for (int i=0;i<this.datas.size (); i++) {if (This.datas.get (i). Get (Index). Equals (value)) Subdatas.add (This.datas.get (i)); for (int i=0;i<subdatas.size (); i++) {Subdatas.get (i). Remove (index); This.subDatasMap.put (value, Subdatas);} return subdatasmap;} public void Setsubdatasmap (map<string, arraylist<arraylist<string>>> subdatasmap) { This.subdatasmap = Subdatasmap;}}

Treenode.java

Package Com.zhyoulun.decision;import Java.util.arraylist;public class treenode{private String name;// The name of the node (split attribute) private arraylist<string> rules; The splitting rules of nodes (assuming all discrete values)//private arraylist<arraylist<string>> datas; The training tuple divided into the node (Datas.get (i) represents a training tuple)//private arraylist<string> candidateattributes; The candidate attributes (consistent with the number of training tuples) divided into the node are private arraylist<treenode> children; Child node Public TreeNode () {this.name = ""; this.rules = new arraylist<string> (); this.children = new arraylist< Treenode> ();//this.datas = Null;//this.candidateattributes = null;} Public String GetName () {return name;} public void SetName (String name) {this.name = name;} Public arraylist<string> GetRules () {return rules;} public void SetRules (arraylist<string> rules) {this.rules = rules;} Public arraylist<treenode> GetChildren () {return children;} public void Setchildren (arraylist<treenode> children) {This.children = children;} Public arraylist<arraylist<string>> GetDataS ()//{//return datas;//}////public void Setdatas (arraylist<arraylist<string>> datas)//{//this.datas = Datas;//}////public arraylist<string> getcandidateattributes ()//{//return candidateAttributes;//}////public void Setcandidateattributes (arraylist<string> candidateattributes)//{//this.candidateattributes = candidateattributes;//}}

Reference: "Data Mining concepts and Technologies" (3rd edition)

Reprint please specify the source:

Decision tree Induction (ID3 attribute selection metric) Java implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More