The core idea of naive Bayesian (Naive Bayesian) algorithm is: calculates the probability that a given sample belongs to each classification, and then selects the highest probability as the guessing result . .
Assuming that the sample has 2 characteristics x and y, then the probability of its classification 1 is recorded as P (c1|x,y), its value can not be directly analyzed training samples, need to use the formula to be indirectly calculated.
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/4D/A7/wKiom1RWzpzAm1IXAAAhVCcT6T4904.jpg "title=" 2.png " alt= "Wkiom1rwzpzam1ixaaahvcct6t4904.jpg"/>
where P (CI) represents the probability of classification as CI in the training sample, which equals the number of CI samples divided by the total number of samples.
P (x, y) represents a sample probability that satisfies 2 characteristics, which equals the number of samples divided by the number of samples by the 1th feature equal to X and the 2nd feature equals Y. You can see that P (x, y) is irrelevant to the current classification probability, so it can be ignored in the actual calculation without affecting the result.
P (x,y| CI) represents a sample probability that satisfies 2 characteristics in CI classification, and in naive Bayesian algorithm, it is considered that x and Y are independent of each other , so P (x,y| Ci) =p (x| Ci) *p (y| Ci), where P (x| CI) represents the probability of a 1th feature equal to X in a sample of a CI classification.
In the example above, only a 2-dimensional case is given, which can actually be extended to n-dimensional, since each feature is assumed to be independent, p (w| Ci) can always be decomposed and evaluated.
On the C # code:
using system;using system.collections.generic;using system.linq;namespace machinelearning{ /// <summary> /// Naive Bayes </summary> public class naivebayes { private list<datavector<string>> m_ trainingset; // / <summary> /// Training /// </summary> /// < Param name= "Trainingset" ></param> public Void train (List<datavector<string>> trainingset) { m_trainingset = trainingset; } /// <summary> /// classification /// </summary> /// <param name= "Vector" ></param> /// <returns></returns> public String classify (Datavector<string> vector) { var classprobdict = new Dictionary<string, double> (); //Get all Categories var typeDict = new Dictionary<string, int> (); foreach (var item in m_ Trainingset) Typedict[item. label] = 0; //calculating probabilities for each classification foreach (string type in Typedict.keys) Classprobdict[type] = getclassprob (Vector, type); //Find maximum Value double max = double. Negativeinfinity; string label = string. Empty; foreach (var type in classprobdict.keys) { if (ClassProbDict[type] > max) { max = classProbDict[type]; label = type; } } return label; } /// <summary> /// Classification (FAST) /// </summary> /// <param name= "Vector" ></param> /// <returns></returns> public string classifyfast (Datavector<string> vector) { var typecount = new Dictionary<string, int> (); Var featurecount = new dictionary<string, dictionary<int, int>> (); //first through a traversal to get the various quantities required foreach (Var item in m_trainingset) { if (!typecount.containskey (item. Label)) { &Nbsp; typecount[item. label] = 0; featurecount[item. Label] = new dictionary<int, int> (); for (int i = 0;i < vector. Dimension;++i) featurecount[item. label][i] = 0; } //the number of samples in the corresponding classification &Nbsp; typecount[item. label]++; //traverse each dimension (feature), Counting of the corresponding features of the additive corresponding classification For (Int i = 0;i < vector. Dimension;++i) { if (String. Equals (vector. Data[i], item. Data[i]) featurecount[item. label][i]++; } } //then begins to calculate the probability double maxprob = double . Negativeinfinity; string bestlabel = string. Empty; foreach (string type in typecount.keys) { //calculation P (Ci) double classprob = typeCount[type] * 1.0 / m_TrainingSet.Count; //Calculation P (f1| Ci) double featureProb = 1.0; for (Int i = 0;i < vector. Dimension;++i) featureProb = featureProb * (featurecount[type][i] * 1.0 / typecount[type]); //calculation P (ci|w), ignoring P (w) part double typeprob = featureprob * classprob; //Reserve Maximum probability if (Typeprob > maxprob) { maxprob = typeProb; bestLabel = type; } } return bestlabel; } /// <summary> /// get the probability of a specified classification /// </summary> /// <param name= "Vector" ></param> / <param name= "Type" ></param> /// <returns></returns> private double Getclassprob (Datavector<string> vector, string type) { double classprob = 0.0; double featureProb = 1.0; //the number of this category in the statistics training sample used to calculate P (Ci) int typecount = m_trainingset.count (p => string. Equals (P.label, type)); //traversal of each dimension (feature) for (Int i = 0;i < vector. Dimension;++i) { //statistics The number of samples in this category that conform to this feature for the calculation of P (Fn| Ci) int featurecount = m_trainingset.count (p => String. Equals (P.data[i], vector. Data[i]) && string. Equals (P.label, type)); //Calculation P (fn| Ci) featureprob = featureProb * (Featurecount * 1.0 / typecount); } //calculation P (Ci|w), ignoring P (w) parts classprob = featureprob * (typecount * 1.0 / m_trainingset.count); return classProb; } }}
The classify method in the code implements the algorithm in the most direct and intuitive way, but it is easy to understand, but because of the number of traversal times, the efficiency is poor when the training sample is more. The Classifyfast method reduces the number of traversal times and improves efficiency. The basic algorithms are consistent and the results are consistent.
It is important to pay attention to the fact that it is possible to encounter more than one classification probability in the actual operation or the probability of each classification is 0, at this time it is generally random to select a classification as the result. But sometimes it should be treated with care, such as using Bayesian to identify spam, if the probability is the same, even if the two probability difference is not large, it should be treated as non-spam, because the failure to identify the impact of spam is far less than the impact of the normal message recognition as garbage.
Or use the last poison mushroom data for a test, this time reduce the amount of data, select 2000 samples for training, and then choose 500 Test error rate.
Public void testbayes () { var trainingset = new list< Datavector<string>> (); var testset = new list< Datavector<string>> (); var file = new streamreader ("Agaricus-lepiota.txt", encoding.default); //Read Data string line = string. Empty; for (int i = 0;i < 2500;++i) { line = file. ReadLine (); if (Line == null) break; var parts = line. Split (', '); &nbSp; var p = new datavector<string> (; ) p.label = parts[0]; for (INT&NBSP;J&NBSP;=&NBSP;0;J&NBSP;<&NBSP;P.DIMENSION;++J) p.Data[j] = parts[j + 1]; if (i < 2000) trainingset.add (P); else Testset.add (P); } file. Close (); //Test var bayes = new naivebayes (); bayes. Train (Trainingset); int error = 0; foreach (var p in testset) { var label = bayes. Classifyfast (P); if (Label != p.label) ++error; } console.writeline ("error = {0}/{1}, {2}%", error, testset.count, ( Error * 100.0 / testset.count));}
The test result is a 0% error rate, a bit unexpected. Change the training sample and test sample, the error rate will change, such as when the same condition as the previous (7,000 training samples + 1124 test samples), the error rate is 4.18%, compared with the error rate of random guess 50%, is quite accurate.
This article is from the "Rabbit Nest" blog, please be sure to keep this source http://boytnt.blog.51cto.com/966121/1571102
Machine learning algorithms: Naive Bayes