Machine learning algorithms: Naive Bayes

Source: Internet
Author: User

The core idea of naive Bayesian (Naive Bayesian) algorithm is: calculates the probability that a given sample belongs to each classification, and then selects the highest probability as the guessing result . .


Assuming that the sample has 2 characteristics x and y, then the probability of its classification 1 is recorded as P (c1|x,y), its value can not be directly analyzed training samples, need to use the formula to be indirectly calculated.

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/4D/A7/wKiom1RWzpzAm1IXAAAhVCcT6T4904.jpg "title=" 2.png " alt= "Wkiom1rwzpzam1ixaaahvcct6t4904.jpg"/>

where P (CI) represents the probability of classification as CI in the training sample, which equals the number of CI samples divided by the total number of samples.

P (x, y) represents a sample probability that satisfies 2 characteristics, which equals the number of samples divided by the number of samples by the 1th feature equal to X and the 2nd feature equals Y. You can see that P (x, y) is irrelevant to the current classification probability, so it can be ignored in the actual calculation without affecting the result.

P (x,y| CI) represents a sample probability that satisfies 2 characteristics in CI classification, and in naive Bayesian algorithm, it is considered that x and Y are independent of each other , so P (x,y| Ci) =p (x| Ci) *p (y| Ci), where P (x| CI) represents the probability of a 1th feature equal to X in a sample of a CI classification.


In the example above, only a 2-dimensional case is given, which can actually be extended to n-dimensional, since each feature is assumed to be independent, p (w| Ci) can always be decomposed and evaluated.


On the C # code:

using system;using system.collections.generic;using system.linq;namespace machinelearning{     /// <summary>    ///  Naive Bayes       </summary>    public class naivebayes    {         private list<datavector<string>> m_ trainingset;                // / <summary>        ///  Training          /// </summary>        /// < Param name= "Trainingset" ></param>        public  Void train (List<datavector<string>> trainingset)          {            m_trainingset = trainingset;         }                 /// <summary>        ///  classification         /// </summary>         /// <param name= "Vector" ></param>         /// <returns></returns>        public  String classify (Datavector<string> vector)         {             var classprobdict = new  Dictionary<string, double> ();                         //Get all Categories              var typeDict = new Dictionary<string, int> ();             foreach (var item in m_ Trainingset)                  Typedict[item. label] = 0;                             //calculating probabilities for each classification              foreach (string type in  Typedict.keys)                  Classprobdict[type] = getclassprob (Vector, type);                              //Find maximum Value              double max = double. Negativeinfinity;            string label  = string. Empty;            foreach (var type in  classprobdict.keys)             {                 if (ClassProbDict[type]  > max)                  {                     max = classProbDict[type];                     label = type;                 }             }            return  label;        }                 /// <summary>         ///  Classification (FAST)         /// </summary>         /// <param name= "Vector" ></param>         /// <returns></returns>         public string classifyfast (Datavector<string> vector)          {            var typecount = new  Dictionary<string, int> ();             Var featurecount = new dictionary<string, dictionary<int, int>> ();                          //first through a traversal to get the various quantities required              foreach (Var item in m_trainingset)              {                 if (!typecount.containskey (item. Label))                 {               &Nbsp;     typecount[item. label] = 0;                     featurecount[item. Label] = new dictionary<int, int> ();                     for (int i = 0;i  < vector. Dimension;++i)                          featurecount[item. label][i] = 0;                 }                                 //the number of samples in the corresponding classification          &Nbsp;      typecount[item. label]++;                                 //traverse each dimension (feature), Counting of the corresponding features of the additive corresponding classification                  For (Int i = 0;i < vector. Dimension;++i)                  {                     if (String. Equals (vector. Data[i], item. Data[i])                          featurecount[item. label][i]++;                }             }                         //then begins to calculate the probability             double maxprob = double . Negativeinfinity;            string bestlabel  = string. Empty;            foreach (string type in  typecount.keys)             {                 //calculation P (Ci)                  double classprob  = typeCount[type] * 1.0 / m_TrainingSet.Count;                                  //Calculation P (f1| Ci)                 double  featureProb = 1.0;                 for (Int i = 0;i < vector. Dimension;++i)                      featureProb = featureProb *  (featurecount[type][i] * 1.0  / typecount[type]);                                      //calculation P (ci|w), ignoring P (w) part                  double typeprob = featureprob * classprob;                                  //Reserve Maximum probability                  if (Typeprob > maxprob)                  {                     maxprob  = typeProb;                     bestLabel = type;                 }             }            return bestlabel;         }                 /// <summary>        ///  get the probability of a specified classification          /// </summary>         /// <param name= "Vector" ></param>        /  <param name= "Type" ></param>        ///  <returns></returns>        private double  Getclassprob (Datavector<string> vector, string type)          {            double classprob  = 0.0;            double featureProb = 1.0;                          //the number of this category in the statistics training sample used to calculate P (Ci)              int typecount = m_trainingset.count (p => string. Equals (P.label, type));                         //traversal of each dimension (feature)              for (Int i = 0;i < vector. Dimension;++i)             {                 //statistics The number of samples in this category that conform to this feature for the calculation of P (Fn| Ci)                  int featurecount = m_trainingset.count (p =>  String. Equals (P.data[i], vector. Data[i])  && string. Equals (P.label, type));                                  //Calculation P (fn| Ci)                 featureprob  = featureProb *  (Featurecount * 1.0 / typecount);             }                         //calculation P (Ci|w), ignoring P (w) parts             classprob = featureprob  *  (typecount * 1.0 / m_trainingset.count);                         return classProb;         }    }}


The classify method in the code implements the algorithm in the most direct and intuitive way, but it is easy to understand, but because of the number of traversal times, the efficiency is poor when the training sample is more. The Classifyfast method reduces the number of traversal times and improves efficiency. The basic algorithms are consistent and the results are consistent.


It is important to pay attention to the fact that it is possible to encounter more than one classification probability in the actual operation or the probability of each classification is 0, at this time it is generally random to select a classification as the result. But sometimes it should be treated with care, such as using Bayesian to identify spam, if the probability is the same, even if the two probability difference is not large, it should be treated as non-spam, because the failure to identify the impact of spam is far less than the impact of the normal message recognition as garbage.



Or use the last poison mushroom data for a test, this time reduce the amount of data, select 2000 samples for training, and then choose 500 Test error rate.

Public void testbayes () {    var trainingset = new list< Datavector<string>> ();    var testset = new list< Datavector<string>> ();         var file = new  streamreader ("Agaricus-lepiota.txt",  encoding.default);     //Read Data      string line = string. Empty;    for (int i = 0;i < 2500;++i)     {         line = file. ReadLine ();         if (Line == null)              break;                     var parts = line. Split (', ');    &nbSp;    var p = new datavector<string> (;   )      p.label = parts[0];        for (INT&NBSP;J&NBSP;=&NBSP;0;J&NBSP;&LT;&NBSP;P.DIMENSION;++J)              p.Data[j] = parts[j + 1];                     if (i < 2000)             trainingset.add (P);         else             Testset.add (P);     }    file. Close ();         //Test     var bayes =  new naivebayes ();     bayes. Train (Trainingset);     int error = 0;    foreach (var p in  testset)     {        var label =  bayes. Classifyfast (P);         if (Label != p.label)              ++error;    }     console.writeline ("error = {0}/{1}, {2}%", error, testset.count,  ( Error * 100.0 / testset.count));}


The test result is a 0% error rate, a bit unexpected. Change the training sample and test sample, the error rate will change, such as when the same condition as the previous (7,000 training samples + 1124 test samples), the error rate is 4.18%, compared with the error rate of random guess 50%, is quite accurate.



This article is from the "Rabbit Nest" blog, please be sure to keep this source http://boytnt.blog.51cto.com/966121/1571102

Machine learning algorithms: Naive Bayes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.