Decision Tree Generalization (ID3 attribute selection metric) Java implementation

Source: Internet
Author: User
Tags id3 readline

General Decision tree Induction Framework See previous posts: http://blog.csdn.net/zhyoulun/article/details/41978381


ID3 Attribute Selection Metric principle

ID3 uses information gain as a property selection metric. The measure is based on Shannon's pioneering work in the study of the value of messages or the information theory of "informative content". The node n represents or holds the tuple of partition D. Select the attribute with the highest information gain as the split attribute of node n . This property minimizes the amount of information required for the GANSO classification in the result partition and reflects the minimum randomness or "impurity" in those partitions. This method minimizes the number of desired tests needed to classify an object and ensures that a simple (but not necessarily simplest) tree is found.

The desired information required for tuple classification in D is given by the following formula,


Pi is a non 0 probability of D-loyal arbitrary tuple belonging to CI. Using a 2-based logarithmic function is because the information is encoded in binary. Info (d) is the average amount of information required to identify the class label of the tuple in D. Note that all of our information at this time is just the percentage of the tuple of each class.

Now let's say we're going to divide the tuple in D by a property A, where property A has a V different value {a1,a2,... av} based on the observation of the training data. You can use property A to divide D into v partitions or subsets {d1,d2,..., Dv}, where the DJ contains the tuples in D and their a value is AJ. These partitions correspond to the branches that grow from node N. Ideally, we want this division to produce an accurate classification of tuples. That is, you want each partition to be pure (the actual situation is mostly impure, such as if the partition might contain tuples from different classes). After this division, how much information we need to get an accurate classification. This quantity is measured by the following:


which | dj|/| The d| acts as the weight of the section J partition. Info_a (d) is based on the expected information needed to classify a tuple of D according to a. The smaller the expected information required, the higher the purity of the partition .

Information gain is defined as the difference between the original information requirements (based on the class scale only) and the new information requirements (after a partition). That


In other words, Gain (a) tells us how much we get through the division of a. It is expected to reduce the demand for information that results from a value of a. Select attribute A with the highest information gain gain (a) as the split attribute of node N.


Here is an example.


Data

Data.txt

Youth,high,no,fair,no
youth,high,no,excellent,no
middle_aged,high,no,fair,yes
senior,medium,no, Fair,yes
senior,low,yes,fair,yes
senior,low,yes,excellent,no
middle_aged,low,yes,excellent,yes
youth,medium,no,fair,no
youth,low,yes,fair,yes
senior,medium,yes,fair,yes
Youth,medium,yes, Excellent,yes
middle_aged,medium,no,excellent,yes
middle_aged,high,yes,fair,yes
Senior,medium, No,excellent,no


Attr.txt

Age,income,student,credit_rating,buys_computer


Operation result

Age (1:youth; 2:middle_aged; 3:senior;)
	Credit_rating (1:fair; 2:excellent;)
		Leaf:no () Leaf:yes () Leaf:yes ()
	student (1:no;
	2:yes;)
		Leaf:no ()
		Leaf:yes ()



Finally attach the Java code

Decisiontree.java

Package com.zhyoulun.decision;
Import Java.io.BufferedReader;
Import Java.io.File;
Import Java.io.FileInputStream;
Import java.io.FileNotFoundException;
Import java.io.IOException;
Import Java.io.InputStreamReader;
Import java.util.ArrayList;

Import Java.util.Map; /** * is responsible for reading and writing data, and generating decision tree * * @author Zhyoulun */public class DecisionTree {private arraylist<arraylist<st
	Ring>> Alldatas;
	
	Private arraylist<string> allattributes; /** * Read all relevant data from the file * @param datafilepath * @param attrfilepath/Public DecisionTree (String datafilepath,string
		
		Attrfilepath) {super ();
			try {this.alldatas = new arraylist<> ();
			
			This.allattributes = new arraylist<> ();
			InputStreamReader InputStreamReader = new InputStreamReader (new FileInputStream (New File (DataFilePath));
			BufferedReader BufferedReader = new BufferedReader (InputStreamReader);
			String line = null; while ((Line=bufferedreader.readline ())!=null) {string[] StriNGS = Line.split (",");
				arraylist<string> data = new arraylist<> ();
				for (int i=0;i<strings.length;i++) Data.add (Strings[i]);
			This.allDatas.add (data);
			} InputStreamReader = new InputStreamReader (new FileInputStream (New File (Attrfilepath));
			BufferedReader = new BufferedReader (InputStreamReader);
				while ((Line=bufferedreader.readline ())!=null) {string[] strings = Line.split (",");
			for (int i=0;i<strings.length;i++) This.allAttributes.add (Strings[i]);
			} inputstreamreader.close ();
			
		Bufferedreader.close ();
		catch (FileNotFoundException e) {//TODO auto-generated catch block E.printstacktrace ();
		catch (IOException e) {//TODO auto-generated catch block E.printstacktrace ();
}//for (int i=0;i<this.allattributes.size (); i++)//{//System.out.print (This.allAttributes.get (i) + "");
}//System.out.println ();			for (int i=0;i<this.alldatas.size (); i++)//{//for (int j=0;j<this.alldatas.get (i). Size (), j + +)//{//System.out.print (This.allDatas.get (i). Get (J) + "");
}//System.out.println (); }/** * @param alldatas * @param allattributes/public DecisionTree (arraylist<arraylist<string
		>> Alldatas, arraylist<string> allattributes) {super ();
		This.alldatas = Alldatas;
	This.allattributes = allattributes;
	Public arraylist<arraylist<string>> Getalldatas () {return alldatas;
	public void Setalldatas (arraylist<arraylist<string>> alldatas) {this.alldatas = Alldatas;
	Public arraylist<string> getallattributes () {return allattributes;
	public void Setallattributes (arraylist<string> allattributes) {this.allattributes = allattributes; /** * Recursive generation decision number * @return/public static TreeNode Generatedecisiontree (arraylist<arraylist<string> > datas, arraylist<string> attrs) {TreeNode TreeNode = nEW TreeNode ();			If the elements in D are in the same class C, then if (Isinthesameclass (datas)) {treenode.setname (datas.get (0). Get (Datas.get (0). Size ()-1));
			Rootnode.setname ();
		return treeNode;
		
		//If Attrs is empty, then (this situation does not normally occur, we should be to construct the decision tree for all candidate attribute sets) if (Attrs.size () ==0) return treeNode;
		CriterionID3 criterionID3 = new CriterionID3 (datas, attrs);
		
		int splitingcriterionindex = Criterionid3.attributeselectionmethod ();
		Treenode.setname (Attrs.get (Splitingcriterionindex));
		
		Treenode.setrules (Getvalueset (datas, Splitingcriterionindex));
		
		Attrs.remove (Splitingcriterionindex); map<string, arraylist<arraylist<string>>> subdatasmap = Criterionid3.getsubdatasmap (
Splitingcriterionindex);			For (String Key:subDatasMap.keySet ())//{//System.out.println ("===========");//System.out.println (key); 				for (int i=0;i<subdatasmap.get (key). Size (), i++)//{//for (int j=0;j<subdatasmap.get (key). Size (); j + +)// {//System.out.print (subdatasmAp.get (Key). get (i). Get (J) + "");
}//System.out.println (); }//} for (String Key:subDatasMap.keySet ()) {arraylist<treenode> treenodes = Treenode.getchildren ()
			;
			Treenodes.add (Generatedecisiontree (Subdatasmap.get (key), attrs));
		Treenode.setchildren (TreeNodes);
	return treeNode; /** * Gets the range of index columns in Datas * @param data * @param index * @return/public static arraylist<string& Gt Getvalueset (arraylist<arraylist<string>> Datas,int index) {arraylist<string> values = new ArrayList
		<String> ();
		String r = "";
			for (int i = 0; i < datas.size (); i++) {r = Datas.get (i). get (index);
			if (!values.contains (R)) {Values.add (R);
	} return values; /** * The last column is a class label to determine whether the last column is the same * @param datas * @return/private static Boolean Isinthesameclass (Arraylist<a Rraylist<string>> datas) {String flag = datas.get (0). Get (Datas.get (0). Size ()-1);//Line No. 0, last column assign initial value for (int i=0;i<datas.size (); i++) {if (!datas.get (). () (Datas.get (i). Size ()-1). Equals (flag)) return false;
	return true;
		public static void Main (string[] args) {String datapath = "Files/data.txt";
		
		String Attrpath = "Files/attr.txt";
		
		Initializes the original data decisiontree DecisionTree = new DecisionTree (Datapath,attrpath);
		
		Generate decision tree TreeNode TreeNode = Generatedecisiontree (Decisiontree.getalldatas (), decisiontree.getallattributes ());
	Print (treenode,0);
		private static void print (TreeNode treenode,int level) {for (int i=0;i<level;i++) System.out.print ("T");
		System.out.print (Treenode.getname ());
		System.out.print ("("); for (int i=0;i<treenode.getrules (). Size (), i++) System.out.print ((i+1) + ":" +treenode.getrules (). get (i) + ";
		");
		
		System.out.println (")");
		arraylist<treenode> treenodes = Treenode.getchildren ();
		for (int i=0;i<treenodes.size (); i++) {print (Treenodes.get (i), level+1);
 }
	}
	
	
}



Criterionid3.java

Package com.zhyoulun.decision;
Import java.util.ArrayList;
Import Java.util.HashMap;

Import Java.util.Map; /** * ID3, select split criterion * * @author zhyoulun * */public class CriterionID3 {private arraylist<arraylist<string>
	> datas;
	
	private arraylist<string> attributes;
	
	Private map<string, arraylist<arraylist<string>>> Subdatasmap;
		/** * Calculates all information gain, obtains the largest one as splitting attribute * @return/public int attributeselectionmethod () {Double gain =-1.0;
		int maxindex = 0;
			for (int i=0;i<this.attributes.size () -1;i++) {Double tempgain = This.calcgain (i);
				if (tempgain>gain) {gain = Tempgain;
			Maxindex = i;
	} return Maxindex; /** * Compute Gain (age) =info (d)-info_age (d) * @param index * @return//** * @param index * @param iscalcs
		
		UBDATASMAP * @return * * Private double Calcgain (int index) {double result = 0;
		Calculates info (D) int lastindex = datas.get (0). Size ()-1; Arraylist<string> ValueSET = Decisiontree.getvalueset (This.datas,lastindex);
			for (String value:valueset) {int count = 0;
			for (int i=0;i<datas.size (); i++) {if (Datas.get (i). Get (lastindex). Equals (value)) count++;
Result + =-(1.0*count/datas.size ()) *math.log (1.0*count/datas.size ())/math.log (2);
		SYSTEM.OUT.PRINTLN (result);
		
		}//System.out.println ("==========");
		
Compute info_a (D) Valueset = Decisiontree.getvalueset (This.datas,index);
for (String temp:valueset)//System.out.println (temp);
		
		System.out.println ("==========");
			for (String value:valueset) {arraylist<arraylist<string>> Subdatas = new arraylist<> ();
			for (int i=0;i<datas.size (), i++) {if Datas.get (i). Get (Index). Equals (value)) Subdatas.add (Datas.get (i)); }//For (arraylist<string> Temp:subdatas)//{/for (String temp2:temp)//SYSTEM.OUT.P
Rint (temp2+ "");
System.out.println (); } arraylist<string&Gt
			
			
Subvalueset = Decisiontree.getvalueset (Subdatas, lastindex);
System.out.print ("Subvalueset:");
for (String temp:subvalueset)//System.out.print (temp+ "");
			
			
			System.out.println ();
				for (String subvalue:subvalueset) {//System.out.println ("+++++++++++++++");//System.out.println (Subvalue);
				int count = 0;
				for (int i=0;i<subdatas.size (); i++) {if (Subdatas.get (i). Get (lastindex). Equals (Subvalue)) count++;
				}//System.out.println (count); Result + = -1.0*subdatas.size ()/datas.size () * (-(1.0*count/subdatas.size ()) *math.log (1.0*count/subdatas.size ())/
Math.log (2));
			SYSTEM.OUT.PRINTLN (result);
		
	} return result; Public CriterionID3 (arraylist<arraylist<string>> datas, arraylist<string> attributes) {Supe
		R ();
		This.datas = datas;
	this.attributes = attributes;
	Public arraylist<arraylist<string>> Getdatas () {return datas; } public void SEtdatas (arraylist<arraylist<string>> datas) {this.datas = datas;
	Public arraylist<string> getattributes () {return attributes;
	public void SetAttributes (arraylist<string> attributes) {this.attributes = attributes; Public map<string, arraylist<arraylist<string>>> getsubdatasmap (int index) {arraylist<string& Gt
		Valueset = Decisiontree.getvalueset (This.datas, index);
		
		This.subdatasmap = new hashmap<string, arraylist<arraylist<string>>> ();
			for (String value:valueset) {arraylist<arraylist<string>> Subdatas = new arraylist<> (); for (int i=0;i<this.datas.size (), i++) {if This.datas.get (i). Get (Index). Equals (value)) Subdatas.add (This.dat
			As.get (i));
			for (int i=0;i<subdatas.size (); i++) {Subdatas.get (i). Remove (index);
		} this.subDatasMap.put (value, Subdatas);
	return subdatasmap; } public void Setsubdatasmap (map<string,Arraylist<arraylist<string>>> subdatasmap) {this.subdatasmap = Subdatasmap;
 }
	
	
}



Treenode.java

Package com.zhyoulun.decision;

Import java.util.ArrayList; 								public class TreeNode {private String name; 				The name of the node (splitting attribute) private arraylist<string> rules; 	Node splitting rules (assuming all are discrete values)//private arraylist<arraylist<string>> datas; The training tuple divided into the node (Datas.get (i) represents a training tuple)//private arraylist<string> candidateattributes; 			The candidate attributes (consistent with the number of training tuples) are divided into the nodes, private arraylist<treenode> children;
		Child node Public TreeNode () {this.name = "";
		This.rules = new arraylist<string> ();
This.children = new arraylist<treenode> ();
This.datas = null;
	This.candidateattributes = null;
	Public String GetName () {return name;
	public void SetName (String name) {this.name = name;
	Public arraylist<string> GetRules () {return rules;
	public void SetRules (arraylist<string> rules) {this.rules = rules;
	Public arraylist<treenode> GetChildren () {return children; } public void Setchildren (arraylist<Treenode> children) {This.children = children; }//Public arraylist<arraylist<string>> Getdatas ()//{//return datas;//}////public void Setdatas (Ar raylist<arraylist<string>> datas)//{//This.datas = datas///////Public arraylist<string> GETCA Ndidateattributes ()//{//return candidateattributes;//}////public void Setcandidateattributes (arraylist<string
 > Candidateattributes)//{//this.candidateattributes = Candidateattributes;//}}



Reference: "Data Mining concepts and Technologies (3rd edition)"

Reprint please indicate the source:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.