Re-learning Bayesian network--tan tree-type naive Bayesian algorithm

Last Update:2018-07-26 Source: Internet

Author: User

Tags split unique id

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

In the previous time has studied the NB naive Bayesian algorithm, and just a preliminary study of Bayesian network of some basic concepts and commonly used computational methods. So there is the first knowledge of Bayesian network article, because I have been studying learning << Bayesian network introduction >>, but also exposed to a lot of Bayesian network related knowledge, It can be said that naive Bayesian algorithm these are just a small part of what we know about Bayesian knowledge. Today I want to summarize the study results are based on the NB algorithm, called tree augmented Naive bays, Chinese meaning is the tree-type naïve Bayesian algorithm, simple understanding is the tree-enhanced NB algorithm, then the question, how he enhanced it, please continue to the description of the text. naive Bayesian algorithm

We have to start from the naïve Bayesian algorithm, because in the preface has been said that the tan algorithm is to enhance the NB algorithm, understand the NB algorithm, it must be known that the NB algorithm when used is assumed that the attribute events are independent, and the decision-making attributes of the classification results are dependent on the conditions of each condition, Finally, the value of the maximum posteriori probability in the classification attribute is selected as the decision attribute. For example, the following model can describe a simple model,

Whether the above account is true dependency property conditions have 3, friend density, whether the use of real avatar, log density, assuming that the 3 properties are independent of each other, but in fact, the picture here is true and the friend density is actually related, so more real situation is the following this situation;

The emergence of Ok,tan solves the problem of partial attribute dependencies between conditions. In the above example, we judge the relationship between the Avatar and the friend density according to our subjective consciousness, but in the real algorithm, we certainly hope that the machine can help us to get this kind of relationship according to the data set, and it is pleasing that tan helps us to do this. Tan Algorithm Mutual information Values

The value of mutual information, in the Baidu Encyclopedia is explained as follows:

The value of mutual information is a useful measure in information theory. It can be seen as a quantity of information containing another random variable.

It is shown in the following line.

The middle I (x;y) is the mutual information value, X, Y represents 2 properties. So the following attribute is very well understood, the greater the value of the mutual information, it represents the greater the correlation of 2 attributes. The standard formula for mutual information values is as follows:

However, there will be a little different in tan, there will be the addition of class variable attributes, because the correlation between the properties of the premise is to be recalculated under a certain classification attribute, the different class attribute values will have various attribute associativity. The following is the I in tan (x; Y) Calculation formula:

It doesn't matter if you can't understand it now, you can debug it in the program code given. Algorithm Implementation Process

Tan's algorithm process is not simple, after calculating the mutual information value of each property pair, to make Bayesian network construction, this is the most difficult part of Tan, this part has the following stages.

1, according to the mutual information values of each attribute in descending order, take out the node pairs, followed by the principle of not producing loops, the maximum weight spanning tree is constructed until the n-1 edge is selected (because a total of n attribute nodes, N-1 edge can be determined). The reason for selecting from high to low for the value of the mutual information is to preserve the edge of a higher association dependency.

2, the above process is a non-directed graph, followed by the entire graph to determine the direction of the edge. Select any attribute node as the root node, and the direction from the root node outward to the attribute node.

3, for each attribute node to add the parent node, the parent node is the classification attribute node, the Bayesian network structure has been constructed.

In order to facilitate people to understand, I cut a few pictures on the Internet, the following is the first choice in 5 attribute nodes the maximum value of the mutual information of 4 as a non-graph:

The arrow above is because I chose a as the root node of the tree, and then the direction is all determined, because a directly attached to 4 attribute nodes, and then add the parent node on this basis, that is the following look.

OK, this should be better understand, if not understand, please carefully analyze the program I write, from the code to understand the process can also. calculation of the probability of classification result

The calculation of the probability of classification results is very simple, as long as the condition attribute of the query is passed into the classification model, then the probability value under different class attributes is computed, and the categorical attribute value with the maximal probability value is the final classification result. Here is the formula for the joint probability distribution:

Code Implementation

Test Data Set Input.txt:

OutLook Temperature Humidity Wind Playtennis
Sunny hot High Weak no
Sunny hot high strong no
overcast hot high Weak Yes
Rainy Mild high Weak yes
Rainy cool normal Weak yes
Rainy cool normal strong No
overcast cool No Rmal Strong Yes
Sunny Mild high Weak No
Sunny Cool Normal Weak Yes
Rainy Mild Normal Weak Yes
Sunny Mil D Normal Strong yes
overcast Mild high Strong Yes
overcast hot Normal Weak Yes
Rainy Mild high Strong No

Node class Node.java:

Package Datamining_tan;

Import java.util.ArrayList;

/**
 * Bayesian network Node class
 * 
 * @author lyq * * 
 /Public
class Node {
	//node unique ID, convenient for the subsequent node connection direction to determine the
	int ID ;
	The property name of the node,
	String name;
	The node continuous node
	arraylist<node> connectednodes;

	public Node (int ID, String name) {
		this.id = ID;
		this.name = name;

		Initialize variable
		this.connectednodes = new arraylist<> ();
	}

	/**
	 * Connect the node to the target given node * 
	 * @param node
	 *            downstream nodes *
	 /public
	void Connectnode (Node node {
		//Avoid connecting itself
		if (this.id = = node.id) {
			return;
		}
		
		Add the node to the node list of its own node
		this.connectedNodes.add (node);
		Joins the Self node to the list of target nodes
		Node.connectedNodes.add (this);
	}

	/**
	 * Determine if the target node is the same, the main comparison name is the same can be
	 * 
	 * @param node
	 * Target node *
	 @return *
	* Public Boolean IsEqual (Node node) {
		Boolean isequal;

		IsEqual = false;
		The same node name is considered equal
		if (this.id = = node.id) {
			isequal = true;
		}

		return isequal;
	}
}

The mutual information value class. Java:

Package Datamining_tan;

/**
 * The mutual information value between attributes, indicating the association size between attributes
 * @author LYQ * */Public
class Attrmutualinfo implements comparable <attrmutualinfo>{
	//Mutual Information value
	Double value;
	association attribute value to
	node[] Nodearray;
	
	Public Attrmutualinfo (double value, node Node1, node Node2) {
		this.value = value;
		
		This.nodearray = new Node[2];
		This.nodearray[0] = Node1;
		THIS.NODEARRAY[1] = Node2;
	}

	@Override public
	int compareTo (Attrmutualinfo o) {
		//TODO auto-generated method stub
		return O.value.compareto (This.value);
	}
	
}

Algorithm main program class Tantool.java:

Package Datamining_tan;
Import Java.io.BufferedReader;
Import Java.io.File;
Import Java.io.FileReader;
Import java.io.IOException;
Import java.util.ArrayList;
Import java.util.Collections;

Import Java.util.HashMap;
	/** * Tan-tree naive Bayesian algorithm tool class * * @author LYQ * * */public class Tantool {//test data set address private String FilePath;
	The total number of dataset attributes, one of which is classified as private int attrnum;
	Categorical attribute name private String classattrname;
	Attribute column name row private string[] attrnames;
	The direction of the edge of the Bayesian network, the value within the array is the node ID, from i->j private int[][] edges;
	Attribute name to column subscript mapping private hashmap<string, integer> Attr2column;
	Properties, property pairs of values set map to private hashmap<string, arraylist<string>> attr2values;
	Bayesian Network Total node list private arraylist<node> totalnodes;

	The total test data private arraylist<string[]> Totaldatas;

		Public Tantool (String filePath) {this.filepath = FilePath;
	Readdatafile ();
		/** * Read data from file */private void Readdatafile () {File File = new file (FilePath); arraylist<string[]> DataArray = NEW arraylist<string[]> ();
			try {BufferedReader in = new BufferedReader (new FileReader (file));
			String str;

			String[] Array;
				while ((str = in.readline ()) = null) {array = Str.split ("");
			Dataarray.add (array);
		} in.close ();
		} catch (IOException e) {e.getstacktrace ();
		} This.totaldatas = DataArray;
		This.attrnames = this.totalDatas.get (0);
		This.attrnum = This.attrNames.length;

		This.classattrname = This.attrnames[attrnum-1];
		Node node;
		This.edges = new Int[attrnum][attrnum];
		This.totalnodes = new arraylist<> ();
		This.attr2column = new hashmap<> ();

		This.attr2values = new hashmap<> ();
		The classification attribute Node ID is set to a minimum of 0, node = new node (0, attrnames[attrnum-1]);
		This.totalNodes.add (node); for (int i = 0; i < attrnames.length; i++) {if (I < attrNum-1) {//Create a Bayesian network node, one node per attribute nodes = new node
				(i + 1, attrnames[i]);
			This.totalNodes.add (node); }//Add attribute to column subscript mapping This.attr2Column.put (Attrnames[i], i);
		} string[] temp;
		Arraylist<string> values;

			For the property name, the mapping of the property value pair matches for (int i = 1; i < this.totalDatas.size (); i++) {temp = This.totalDatas.get (i);
					for (int j = 0; J < Temp.length; J + +) {//Determine if the map contains this property name if (This.attr2Values.containsKey (Attrnames[j])) {
				Values = This.attr2Values.get (Attrnames[j]);
				} else {values = new arraylist<> ();
				} if (!values.contains (Temp[j])) {//Add new attribute value Values.add (Temp[j]);
			} this.attr2Values.put (Attrnames[j], values); }}}/** * builds the maximum weight spanning tree based on conditional inter-information degrees, returns the first node as the root node * * @param iarray */Private Node Constructweighttree (arraylist& Lt
		Node[]> iarray) {Node node1;
		Node Node2;
		Node Root;

		Arraylist<node> Existnodes;

		Existnodes = new arraylist<> ();
			For (node[] i:iarray) {node1 = i[0];

			Node2 = i[1];
			Connect the 2 nodes to Node1.connectnode (NODE2);
			Avoid loop phenomenon Addifnotexist (Node1, existnodes); Addifnotexist (Node2, Existnodes);
			if (existnodes.size () = = AttrNum-1) {break;
		}}//returns the first as root node of roots = Existnodes.get (0);
	return root; }/** * Determines the direction of the edge for the tree structure, with the direction of the attribute root node pointing to the other attribute node direction * * @param root * The node currently traversed */private void Confirmgraphd
		Irection (Node currentnode) {int i;
		Int J;

		Arraylist<node> Connectednodes;

		Connectednodes = Currentnode.connectednodes;
		i = currentnode.id;

			for (Node n:connectednodes) {j = n.id; Determine if the direction connecting this 2-node is determined if (edges[i][j] = = 0 && edges[j][i] = = 0) {//If not determined, the direction is i->j edges[i][j] = 1

				;
			Recursive continuation search confirmgraphdirection (n); }}}/** * Add category attribute node for attribute node to parent node * * @param parentnode * Parent node * @param nodeList * child node column

		Table */private void Addparentnode () {//Category attribute nodes node parentnode;
		ParentNode = null;
				for (Node n:this.totalnodes) {if (n.id = = 0) {parentnode = n;
			Break }} for (Node child:this.totalNodes) {Parentnode.conneCtnode (child);
			if (child.id! = 0) {//Determine connection direction this.edges[0][child.id] = 1; }}}/** * Add node in node collection * * @param node * To add Nodes * @param existnodes * Existing node list *

		@return */Public Boolean Addifnotexist (node node, arraylist<node> existnodes) {Boolean canadd;
		Canadd = true;
				for (node N:existnodes) {//If the node list already contains nodes, the add fails if (n.isequal (node)) {Canadd = false;
			Break
		}} if (Canadd) {Existnodes.add (node);
	} return Canadd; /** * COMPUTE node Conditional probability * * @param node * The posteriori Probability of node * @param queryparam * Query Property parameters * @ret
		URN */private double Calconditionpro (node node, hashmap<string, string> queryparam) {int id;
		Double Pro;
		String value;

		String[] AttrValue;
		Arraylist<string[]> Priorattrinfos;
		Arraylist<string[]> Backattrinfos;

		Arraylist<node> Parentnodes;
		Pro = 1;
		id = node.id;
Parentnodes = new arraylist<> ();		Priorattrinfos = new arraylist<> ();

		Backattrinfos = new arraylist<> (); for (int i = 0; i < this.edges.length; i++) {//Look for parent Node ID if (This.edges[i][id] = = 1) {for (node temp:this.
						Totalnodes) {//Search for Target Node ID if (temp.id = = i) {parentnodes.add (temp);
					Break
		}}}}//Get the property value of the prior attribute, first add a priori attribute value = Queryparam.get (Node.name);
		AttrValue = new String[2];
		Attrvalue[0] = Node.name;
		ATTRVALUE[1] = value;

		Priorattrinfos.add (AttrValue);
			Add a posteriori attribute for (Node p:parentnodes) {value = Queryparam.get (p.name);
			AttrValue = new String[2];
			Attrvalue[0] = p.name;

			ATTRVALUE[1] = value;
		Backattrinfos.add (AttrValue);

		} Pro = Queryconditionpro (Priorattrinfos, Backattrinfos);
	return pro; /** * Query Conditional probability * * @param attrvalues * Conditional attribute value * @return */private double Queryconditionpro (Arra Ylist<string[]> priorvalues, arraylist<string[]> backvalues) {//Determine if a priori attribute value is satisfiedConditional Boolean Hasprior;
		Determine whether a posteriori attribute value condition is met Boolean hasback;
		int attrindex;
		Double Backpro;
		Double Totalpro;
		Double Pro;

		String[] TempData;
		Pro = 0;
		Totalpro = 0;

		Backpro = 0;

			Skip the first line of the property name row for (int i = 1; i < this.totalDatas.size (); i++) {tempdata = This.totalDatas.get (i);
			Hasprior = true;

			Hasback = true;

				Determine if a priori condition is met for (string[] array:priorvalues) {attrindex = This.attr2Column.get (array[0]);
					Determine if the value satisfies the condition if (!tempdata[attrindex].equals (array[1])) {Hasprior = false;
				Break

				}}//Determine if a posteriori condition is met for (string[] array:backvalues) {attrindex = This.attr2Column.get (array[0]);
					Determine if the value satisfies the condition if (!tempdata[attrindex].equals (array[1])) {hasback = false;
				Break
				}}//Count statistics, respectively, to calculate the value of the following properties and satisfy the number of conditions if (hasback) {backpro++;
				if (hasprior) {totalpro++;
				}} else if (Hasprior && backvalues.size () = = 0) {//If only a priori probability is a pure probability calculation totalpro++; BaCkpro = 1.0;
		}} if (Backpro = = 0) {Pro = 0;
		} else {//calculates the total probability = both the probability of occurrence/the time probability of only a posteriori condition pro = Totalpro/backpro;
	} return pro; }/** * Enter query condition parameters, calculate probability of occurrence * * @param queryparam * Condition parameter * @return */public double Calhappenedpro (
		String queryparam) {double result;
		Double temp;
		Classification attribute value String classattrvalue;
		String[] Array;
		String[] Array2;

		Hashmap<string, string> params;
		result = 1;

		params = new hashmap<> ();
		Parameter decomposition of the query character array = Queryparam.split (",");
			for (String s:array) {array2 = S.split ("=");
		Params.put (Array2[0], array2[1]);
		} Classattrvalue = Params.get (classattrname);

		Constructing Bayesian network Structure constructbayesnetwork (classattrvalue);

			for (Node n:this.totalnodes) {temp = Calconditionpro (n, params);
			To avoid the occurrence of a condition probability of 0, a minor correction if (temp = = 0) {temp = 0.001;
		}//According to the joint probability formula, the product operation result *= temp;
	} return result;     }/** * Building a tree Bayesian network structure * * @param value *       Category measure */private void Constructbayesnetwork (String value) {Node rootNode;
		Arraylist<attrmutualinfo> Minfoarray;

		Mutual information degree to arraylist<node[]> IArray;
		IArray = null;

		RootNode = null;
		At each rebuild of the Bayesian network structure, empty the original connection structure for (Node n:this.totalnodes) {n.connectednodes.clear ();

		} this.edges = new Int[attrnum][attrnum];
		Remove the property value from the mutual information object to IArray = new arraylist<> ();
		Minfoarray = Calattrmutualinfoarray (value);
		for (Attrmutualinfo V:minfoarray) {iarray.add (V.nodearray);
		}//Build maximum weight span tree RootNode = Constructweighttree (IArray);
		The direction confirmgraphdirection (RootNode) of the edge is determined for the non-directed graph;
	Add a classification property for each attribute node parent node Addparentnode (); }/** * Given categorical variable values, calculates the mutual information value between attributes * * @param value * categorical variable values * @return */Private Arraylist<attrmutu
		Alinfo> Calattrmutualinfoarray (String value) {double ivalue;
		Node Node1;
		Node Node2;
		Attrmutualinfo MInfo;

		Arraylist<attrmutualinfo> Minfoarray; Minfoarray = new arraylist&Lt;> ();
			for (int i = 0; i < this.totalNodes.size ()-1; i++) {Node1 = This.totalNodes.get (i);
			Skip the Classification attribute node if (node1.id = = 0) {continue;
				} for (int j = i + 1; j < This.totalNodes.size (); j + +) {Node2 = This.totalNodes.get (j);
				Skip the Classification attribute node if (node2.id = = 0) {continue;
				}//Calculates the mutual information value between 2 attribute nodes Ivalue = Calmutualinfovalue (Node1, Node2, value);
				MInfo = new Attrmutualinfo (Ivalue, Node1, Node2);
			Minfoarray.add (MInfo);

		}}//The results are sorted in descending order, so that the high value of the mutual information is used to build tree Collections.sort (Minfoarray);
	return minfoarray;
	 /** * Calculates the mutual information value for 2 attribute nodes * * @param node1 * Node 1 * @param node2 * Node 2 * @param vlaue
		* Categorical Variable Value */private double Calmutualinfovalue (node Node1, node Node2, String value) {double ivalue;
		Double temp;
		Posterior probabilities of three different conditions double pxixj;
		Double pXi;
		Double pXj;
		String[] array1;
		String[] Array2;
		Arraylist<string> attrValues1; Arraylist<string> attrValues2;
		Arraylist<string[]> priorvalues;

		The posterior probability, here is the class variable value arraylist<string[]> backvalues;
		Array1 = new String[2];
		Array2 = new String[2];
		Priorvalues = new arraylist<> ();

		Backvalues = new arraylist<> ();
		Ivalue = 0;
		Array1[0] = Classattrname;
		ARRAY1[1] = value;

		The Posteriori properties are all class attributes Backvalues.add (array1);
		Gets the collection of attribute values for the node property attrValues1 = This.attr2Values.get (node1.name);

		AttrValues2 = This.attr2Values.get (node2.name);

				for (string v1:attrvalues1) {for (string v2:attrvalues2) {priorvalues.clear ();
				Array1 = new String[2];
				Array1[0] = Node1.name;
				ARRAY1[1] = v1;

				Priorvalues.add (array1);
				Array2 = new String[2];
				Array2[0] = Node2.name;
				ARRAY2[1] = v2;

				Priorvalues.add (array2);

				Calculate the probability of 3 conditions PXIXJ = Queryconditionpro (priorvalues, backvalues);
				Priorvalues.clear ();
				Priorvalues.add (array1);

				PXi = Queryconditionpro (priorvalues, backvalues);
				Priorvalues.clear (); PriorvalueS.add (array2);

				PXj = Queryconditionpro (priorvalues, backvalues);
				If one of the counting probabilities is 0, the direct assignment is 0 to handle if (Pxixj = = 0 | | pXi = = 0 | | pXj = = 0) {temp = 0;
				} else {//Use a formula to calculate the probability of combining this attribute value pair temp = PXIXJ * Math.log (PXIXJ/(PXi * pXj))/Math.log (2);
			}//The sum of the combined values of the property value pairs is the mutual information value of the entire attribute ivalue + = temp;
	}} return ivalue;
 }
}

Scenario Test Class Client.java:

Package Datamining_tan;

/**
 * tan-tree naive Bayesian algorithm
 * 
 * @author LYQ * *
 *
/public class Client {public
	static void main ( String[] args) {
		String FilePath = "C:\\users\\lyq\\desktop\\icon\\input.txt";
		Conditional query Statement
		String querystr;
		Classification result probability 1
		double classResult1;
		Classification result probability 2
		double classResult2;

		Tantool tool = new Tantool (filePath);
		Querystr = "Outlook=sunny,temperature=hot,humidity=high,wind=weak,playtennis=no";
		CLASSRESULT1 = Tool.calhappenedpro (QUERYSTR);

		Querystr = "Outlook=sunny,temperature=hot,humidity=high,wind=weak,playtennis=yes";
		CLASSRESULT2 = Tool.calhappenedpro (QUERYSTR);

		System.out.println (String.Format ("The probability of the category%s is%s", "Playtennis=no",
				classResult1));
		System.out.println (String.Format ("The probability of the category%s is%s", "Playtennis=yes",
				classResult2));
		if (ClassResult1 > ClassResult2) {
			System.out.println ("categorical category is Playtennis=no");
		} else {
			SYSTEM.OUT.PRINTLN ("Classification category is Playtennis=yes");}}}

Result output:

The probability of a class of Playtennis=no is
3.571428571428571E-5
for a 0.09523809523809525 class of Playtennis=yes. Classification category is Playtennis=no

Reference Documents

Baidu Encyclopedia

Bayesian network classifier and application, Author: Yu Minjie

Research and application of tan classifier for data mining, Author: Sun Lihei etc 4 people

More data mining algorithms

Https://github.com/linyiqun/DataMiningAlgorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More