id3,c4.5 algorithm series of decision classification tree algorithm

Source: Internet
Author: User
Tags id3

First, Introduction

In the beginning, I was prepared to learn the C4.5 algorithm, and later found that the core of the C4.5 algorithm is ID3 algorithm, so I went back to learn ID3 algorithm, because C4.5 is one of his improvements. As for what is improved, I will mention it in the following description.

second, ID3 algorithm

ID3 algorithm is a classification decision tree algorithm. Through a series of rules, he classifies the data into the form of decision trees at last. The basis for classification is the concept of entropy. Entropy has appeared in the subject of physics, indicating that it is a matter of stability, and here is a concept of the purity of the classification. The formula is:


In the ID3 algorithm, the gain information gain is used as a criterion of classification. He is defined as:


Each time you select the attribute of the maximum information gain as a partitioning attribute, here I implemented a Java version of the ID3 algorithm, in order to simulate the operability of data, the data is written into a input.txt file, as a data source, the format is as follows:

Day OutLook temperature Humidity wind PlayTennis1 Sunny hot high Weak No2 Sunny hot High strong No3 overcast hot high Weak Yes4 Rainy Mild High Weak Yes5 Rainy cool normal Weak Yes6 Rainy cool normal strong No7 overcast cool normal strong Yes8  Sunny Mild High Weak No9 Sunny Cool normal Weak Yes10 Rainy Mild normal Weak Yes11 Sunny Mild normal strong Yes12 Overcast Mild High Strong Yes13 overcast hot Normal Weak Yes14 Rainy Mild High Strong No
The Palytennis property is a structural attribute, which is used as the class identifier, and the middle Outlool,temperature,humidity,wind is the partitioning attribute, which can simulate the huge amount of data by classifying the source data from the executing program. The following is the main program class of ID3, I will ID3 the algorithm is packaged, external only to open a method of building a decision tree, in the constructor, simply pass in a data path file:

Package Dataming_id3;import Java.io.bufferedreader;import Java.io.file;import java.io.filereader;import Java.io.ioexception;import Java.util.arraylist;import Java.util.hashmap;import Java.util.Iterator;import Java.util.map;import java.util.map.entry;import java.util.set;/** * ID3 Algorithm Implementation class * * @author Lyq * */public class Id3tool {/ /class Label value type Private final string yes = "YES";p rivate final String no = "no";//The total number of types of all properties, in this case, the number of columns of the data source value private int attrnum;p Rivate String filepath;//Initial source data, holding a two-dimensional character array to mimic tabular data private string[][] data;//the name of the property row of the data private string[] attrnames;// Value of each property all types private hashmap<string, arraylist<string>> attrvalue;public id3tool (String filePath) { This.filepath = Filepath;attrvalue = new hashmap<> ();} /** * Read data from file */private void Readdatafile () {File File = new file (FilePath); arraylist<string[]> DataArray = new arraylist<string[]> (); try {bufferedreader in = new BufferedReader (new FileReader (file)); String str; String[] Temparray;while (str = in.readLine ()) = null) {Temparray = Str.split ("");d Ataarray.add (Temparray);} In.close ();} catch (IOException e) {e.getstacktrace ();} data = new String[dataarray.size ()] [];d ataarray.toarray (data); attrnum = Data[0].length;attrnames = data[0];/* * for (int i=0; i<data.length;i++) {for (int j=0; j<data[0].length; J + +) {* System.out.print ("" + Data[i][j]);} * * System.out.print ("\ n"); } */}/** * First initializes all types of values for each property, and uses */private void Initattrvalue () {arraylist<string> tempvalues;//to calculate the subsequent sub-class entropy, in the form of columns, Search from left to right for (int j = 1; j < Attrnum; J + +) {//from top to bottom in a column to look for values tempvalues = new arraylist<> (); for (int i = 1; i < dat A.length; i++) {if (!tempvalues.contains (Data[i][j])) {//If the value of this property has not been added, add Tempvalues.add (Data[i][j]);}} The value of a column property has been traversed, copied into the Map attribute table Attrvalue.put (Data[0][j], tempvalues);} /* * for (map.entry Entry:attrValue.entrySet ()) {* SYSTEM.OUT.PRINTLN ("Key:value" + entry.getkey () + ":" + * Entry.getva Lue ()); */}/** * Calculates the entropy of the data in different ways * * @param remaindata * remaining data * @paramAttrname * attributes to be divided, when calculating the gain of the information will be used to * @param attrValue * Sub-attribute values divided by * @param isparent * Whether the molecular attribute division Or is it the same. Partition */private double computeentropy (string[][] remaindata, String attrname,string value, Boolean isparent) {//Total instances  int total = 0;//number of positive instances int posnum = 0;//Negative instance number int negnum = 0;//or column from left to right traverse property for (int j = 1; j < Attrnames.length; J + +) {// The specified property is found if (Attrname.equals (Attrnames[j])) {for (int i = 1; i < remaindata.length; i++) {//If the parent node calculates the entropy directly or is divided by sub-attributes, it If the property value is to be filtered if (isparent| | (!isparent && remaindata[i][j].equals (value))) {if (Remaindata[i][attrnames.length-1].equals (YES)) {//Determines whether this row data is a positive instance posnum++;} else {negnum++;}}}} Total = Posnum + negnum;double posprobobly = (double) posnum/total;double negprobobly = (double) negnum/total;if (PosP robobly = = 1 | | posprobobly = = 0) {//If the data is all of the same type, then the entropy is 0, otherwise the formula entered below will be error return 0;} Double Entropyvalue =-posprobobly * Math.log (posprobobly)/Math.log (2.0)-negprobobly * Math.log (negprobobly)/Math.log (2.0);//Return calculationThe obtained entropy return entropyvalue;} /** * Calculate information gain for an attribute * * @param remaindata * Remaining data * @param value * attribute name to be divided * @return */private doubl E Computegain (string[][] remaindata, String value) {Double Gainvalue = 0;//the size of the source entropy will be compared with the attribute after the partition is divided by double Entropyori = 0;//Sub-entropy and double childentropysum = 0;//attribute subtype number int childvaluenum = 0;//attribute value of the number of species arraylist<string> Attrtypes = Attrvalue.get (v Alue);//Sub-property corresponds to a weight greater than hashmap<string, integer> ratiovalues = new hashmap<> (); for (int i = 0; i < attrtypes.size (); i++) {//The first unified count is 0ratiovalues.put (Attrtypes.get (i), 0);} or follow a column, from left to right traverse for (int j = 1; j < Attrnames.length; J + +) {//To determine if the attribute column is divided if (value.equals (Attrnames[j])) {for (int i = 1; I <= remaindata.length-1; i++) {childvaluenum = Ratiovalues.get (Remaindata[i][j]);//Increase the number and re-deposit childvaluenum++;ratiovalues.put (RemainData[i] [j], Childvaluenum);}}}  Calculates the size of the original entropy Entropyori = Computeentropy (remaindata, value, NULL, TRUE); for (int i = 0; i < attrtypes.size (); i++) {double Ratio = (double) Ratiovalues.get (Attrtypes.get (i))/(remaindata.length-1); Childentropysum + = ratio* computeentropy (remainData, Value, Attrtypes.get (i), false);//System.out.println ("Ratio:value:" + ratio + "" +//computeentropy (Remaindata, value,/ /Attrtypes.get (i), false));} The entropy subtraction is the information gain gainvalue = Entropyori-childentropysum;return gainvalue;} /** * Calculate information gain ratio * * @param remaindata * Remaining data * @param value * To be divided into properties * @return */private double Comput Egainratio (string[][] remaindata, String value) {Double gain = 0;double Spiltinfo = 0;int childvaluenum = number of 0;//property values array list<string> attrtypes = attrvalue.get (value);//Sub-property corresponds to a weight greater than hashmap<string, integer> ratiovalues = new Hashmap<> (); for (int i = 0; i < attrtypes.size (); i++) {///First count as 0ratiovalues.put (Attrtypes.get (i), 0);} or follow a column, from left to right traverse for (int j = 1; j < Attrnames.length; J + +) {//To determine if the attribute column is divided if (value.equals (Attrnames[j])) {for (int i = 1; I <= remaindata.length-1; i++) {childvaluenum = RatiovAlues.get (Remaindata[i][j]);//Add number and re-deposit childvaluenum++;ratiovalues.put (Remaindata[i][j], childvaluenum);}}} Calculate information gain gain = Computegain (remaindata, value);//Compute split information, split information metric is defined as (splitting information is used to measure the breadth and uniformity of attribute splitting data): for (int i = 0; i < attrtypes . Size (); i++) {Double ratio = (double) ratiovalues.get (Attrtypes.get (i))/(remaindata.length-1); Spiltinfo + =-ratio * Math.log (RA TIO)/Math.log (2.0);} Computer information gain rate return gain/spiltinfo;} /** * Using source data to construct decision tree */private void Builddecisiontree (attrnode node, String parentattrvalue,string[][] Remaindata, ArrayList <String> remainattr, Boolean isID3) {node.setparentattrvalue (parentattrvalue); String attrname = "";d ouble gainvalue = 0;double Tempvalue = 0;//If only 1 properties are returned directly if (remainattr.size () = = 1) {System.out.pri NTLN ("attr null"); return;} Select the attribute of the remaining attribute with the maximum information gain as the property for the next classification for (int i = 0; i < remainattr.size (); i++) {//To determine if the ID3 algorithm or C4.5 algorithm if (isID3) {//ID3 algorithm uses the following The value of the information gain is greater than Tempvalue = Computegain (Remaindata, Remainattr.get (i));} The else {//C4.5 algorithm has been improved using the information gain ratio to overcome the use of information gain selectionTempvalue = Computegainratio (Remaindata, Remainattr.get (i)) when choosing a property with a preference for a property with multiple values if (Tempvalue > Gainvalue) {gainvalue = Tempvalue;attrname = Remainattr.get (i);}} Node.setattrname (Attrname); Arraylist<string> ValueTypes = Attrvalue.get (attrname); Remainattr.remove (attrname); attrnode[] Childnode = new attrnode[valuetypes.size ()]; String[][] rdata;for (int i = 0; i < valuetypes.size (); i++) {//move except for this value type of data RData = Removedata (Remaindata, Attrname, Val Uetypes.get (i)); Childnode[i] = new Attrnode (); Boolean sameclass = true; arraylist<string> Indexarray = new arraylist<> (); for (int k = 1; k < rdata.length; k++) {Indexarray.add (rDa TA[K][0]);//Determine whether the same class of if (!rdata[k][attrnames.length-1].equals (Rdata[1][attrnames.length-1])) {//As long as there are 1 unequal, is not the same type of Sameclass = False;break;}} if (!sameclass) {//Creates a new object property, the same reference to the object will be faulted arraylist<string> rattr = new arraylist<> (); for (String Str:remainat TR) {rattr.add (str);} Builddecisiontree (Childnode[i], valuetypes.get (i), rdata,rattr, isID3);} else {//If it is of the same type, the data node is directly childnode[i].setparentattrvalue (Valuetypes.get (i)); Childnode[i].setchilddataindex ( Indexarray);}} Node.setchildattrnode (Childnode);}            /** * attribute is divided and data is removed * * @param srcdata * source data * @param attrname * Attribute name divided by * @param valueType * Property value type */private string[][] Removedata (string[][] srcdata, String attrname,string valueType) {string[][] Desdataarr Ay arraylist<string[]> desdata = new arraylist<> ()//data to be deleted arraylist<string[]> Selectdata = new Arraylist<> (); Selectdata.add (attrnames);//array data into the list for easy removal of for (int i = 0; i < srcdata.length; i++) {Desdata.add (Srcdata[i]);} or the lookup for (int j = 1; j < Attrnames.length; J + +) {if (Attrnames[j].equals (attrname)) {if (int i = 1; i < D) from left to right column Esdata.size (); i++) {if (Desdata.get (i) [J].equals (ValueType)) {//If this data is matched, remove other data Selectdata.add (Desdata.get (i));}}} Desdataarray = new String[selectdata.size ()][];selectdata.toarray (Desdataarray); return desdataarray;} /** * Start Building Decision Tree * * @param isID3 * Whether to use ID3 algorithm architecture decision tree */public void Startbuildingtree (Boolean isID3) {readdatafile (); init AttrValue (); arraylist<string> remainattr = new arraylist<> ();//Add property except the last Class label property for (int i = 1; i < attrnames.length- 1; i++) {Remainattr.add (attrnames[i]);} Attrnode RootNode = new Attrnode () Builddecisiontree (RootNode, "", Data, remainattr, isID3); Showdecisiontree (RootNode, 1);} /** * Show Decision Tree * * @param node * nodes to be displayed * @param blanknum * row spaces, used to display the tree structure */private void Showdecisiont Ree (attrnode node, int blanknum) {System.out.println (); for (int i = 0; i < Blanknum; i++) {System.out.print ("\ t");} System.out.print ("--");//Displays the attribute value of the classification if (Node.getparentattrvalue ()! = null&& Node.getparentattrvalue (). Length ( ) > 0) {System.out.print (Node.getparentattrvalue ());} else {System.out.print ("--");} System.out.print ("--"); if (Node.getchilddataindex ()! = null&& Node.getchilddataindex (). Size () > 0) {String i = Node.getchilddataiNdex (). get (0); System.out.print ("Category:" + data[integer.parseint (i)][attrnames.length-1]); System.out.print ("["); for (String Index:node.getChildDataIndex ()) {System.out.print (index + ",");} System.out.print ("]");} else {//recursively displays child nodes System.out.print ("" "+ node.getattrname () +" "); for (Attrnode ChildNode:node.getChildAttrNode ()) {Show DecisionTree (Childnode, 2 * blanknum);}}}
His scenario calls are implemented in the following way:

/** * ID3 decision tree Classification algorithm test Scenario class * @author Lyq * */public class Client {public static void main (string[] args) {String filePath = "c \ \users\\lyq\\desktop\\icon\\input.txt "; Id3tool tool = new Id3tool (FilePath); Tool.startbuildingtree (True);}}
The end result is:

------"OutLook"--sunny--"humidity"--high--Category: No[1, 2, 8,]--normal--Category: Yes[9, one,]--overcast--Category: Yes[3, 7, 12, 13,]-- rainy--"Wind"--weak--Category: Yes[4, 5,]--strong--Category: No[6, 14,]

Please observe the decision tree from left to right, "" is the category attribute,---xxx----, xxx is the value of the property, at the leaf node for the class tag.

The corresponding classification result graph:


Here the structure of the decision tree and display decision tree using the Dfs method, so it may be difficult to understand, I hope readers can carefully understand, you can debug code, step by step tracking will be more easily understood.

three, C4.5 algorithm

If you have understood the implementation of the above ID3 algorithm, Then understand C4.5 is also very easy, C4.5 and ID3 in the core of the algorithm is the same, but there is a point of the approach is different, C4.5 using the information gain rate as the basis of the Division, overcome the ID3 algorithm using information gain division caused by the attribute selection bias value of the attribute. The formula for the information gain rate is:


The position of the denominator is the split factor, and he calculates the formula:


And the entropy of the formula comparison, the specific information gain rate algorithm is also in the above code, please pay attention to 2 methods:

Select the attribute of the remaining attribute with the maximum information gain as the property for the next classification for (int i = 0; i < remainattr.size (); i++) {//To determine if the ID3 algorithm or C4.5 algorithm if (isID3) {//ID3 algorithm uses the following The value of the information gain is greater than Tempvalue = Computegain (Remaindata, Remainattr.get (i));} The else {//C4.5 algorithm has been improved by using the information gain ratio to overcome the lack of a preference for multiple attributes when selecting attributes with information gain Tempvalue = Computegainratio (Remaindata, Remainattr.get (i ));} if (Tempvalue > Gainvalue) {gainvalue = Tempvalue;attrname = Remainattr.get (i);}}
Supplement and improve C4.5 in other aspects of ID3:

1, the tree can be pruned in the process of constructing decision tree.

2, can be the continuity of the value of the operation of Discretization.

Iv. Some problems encountered in coding

In order to achieve the ID3 algorithm, from the understanding of reading his principles have been used for more time, and then try to read someone else to write the C + + version of the code, but also looked at a few days, finally realized the 2 algorithm, the last in the process of constructing the tree encountered the biggest trouble, because the use of recursive construction tree, For the design of the node is very important, perhaps my own current design is not the best. Take a look at some of the problems and potential problems that my program encounters:

1, in the construction of the decision tree, there is a lack of remainattr value, is recursive when the Remainattr attribute division removed, for the last recursive operation of the properties of the affected, and later found that because I remainattr to use is ArrayList , he is a reference object, by referring to the way, the object with the same, so decisively re-built a ArrayList object, the problem is OK.

Creating a new object property, the same reference to the object will be faulted arraylist<string> rattr = new arraylist<> (); for (String str:remainattr) {Rattr.add ( STR);} Builddecisiontree (Childnode[i], valuetypes.get (i), rdata,rattr, isID3);
2, the second problem is that when the program is divided into the last attribute, if the class identity of the data is not the same class, my processing operation is not processed directly, the direct return, will cause the node has no data properties, and no data index.

private void Builddecisiontree (Attrnode node, String parentattrvalue,string[][] Remaindata, arraylist<string> Remainattr, Boolean isID3) {node.setparentattrvalue (parentattrvalue); String attrname = "";d ouble gainvalue = 0;double Tempvalue = 0;//If only 1 properties are returned directly if (remainattr.size () = = 1) {System.out.pri NTLN ("attr null"); return;} .....
In this case the treatment is not very appropriate personally think so.


id3,c4.5 algorithm series of decision classification tree algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.