Naive Bayesian classification algorithm

Source: Internet
Author: User

Reference Address: http://www.cnblogs.com/leoo2sk/archive/2010/09/17/naive-bayesian-classifier.html

My Data mining algorithm implementation Source address: Https://github.com/linyiqun/DataMiningAlgorithm

Introduction

To introduce naive Bayesian algorithm (Naive Bayes), we must first introduce Bayesian classification algorithm, Bayesian classification algorithm is a kind of statistical classification algorithm, he is a kind of using probability statistical knowledge of a classification algorithm. Naive Bayesian algorithm is the simplest algorithm in Bayesian algorithm. Why is it called Naive Bayes, because the various conditions within him are independent, so that in the back of the calculation will play a lot of convenient role.

The principle of naive Bayesian algorithm

First, a probability formula is used here:


P (b| A) means that in the case of a event, the probability of occurrence of B event can be understood as conditional probability in probability theory, and the great function of Bayesian formula is to exchange the causal relation, and the above formula can calculate P (a| B) The probability, as long as the conversion through the above.

The above resource address has been very clear about the principle of naive Bayesian algorithm, I made a little comment on his basis to facilitate the understanding of the following code:

The formal definition of Naive Bayes classification is as follows:

1, set as one to be classified, and each A is a characteristic attribute of x. (in the following example x={"Youth", "Medium", "Yes", "Fair"}, the 4 factor is his eigenvector)

2, there is a category collection. (In later categories only buy_computer of the category Yes, No,c={yes, no})

3, calculation. (the task behind the calculation is to calculate the probability of the occurrence of Yes and no events in the X event, P (yes| X, P (no| X)))

4, if, then. (to calculate the above result value, the one with the maximum probability of the value of Yi is his classification, this is very good understanding, under the X condition, the classification type of the probability of a higher classification, here is the P (yes| X, P (no| X))

So the key now is how to calculate the probability of each condition in the 3rd step. We can do this:

1, find a known classification of the set of items to be categorized, this set is called the training sample set.

2. The statistic gets the conditional probability estimate of each characteristic attribute in each category. That

3, if each characteristic attribute is condition independent, then according to Bayes theorem has the following derivation:

Because the denominator is constant for all categories, as long as we can maximize the numerator. And because each characteristic attribute is conditionally independent, there are:

P (ai|yi) can be calculated based on the number of statistics, which will be reflected in the following procedure.

The code implementation of naive Bayesian algorithm:

Input training Set Data input.txt:

Rid Age Income Student creditrating BuysComputer1 Youth High No Fair No2 Youth high No excellent No3 middleaged high No Fa  IR Yes4 Senior Medium No Fair Yes5 Senior Low yes Fair Yes6 Senior low Yes excellent No7 middleaged low Yes excellent YES8 Youth Medium No Fair No9 Youth Low yes Fair Yes10 Senior Medium Yes Fair Yes11 Youth Medium Yes excellent Yes12 middleage D Medium No excellent Yes13 middleaged high Yes Fair Yes14 Senior Medium No excellent No
Naive Bayes tool Call class:

Package Datamining_naivebayes;import Java.io.bufferedreader;import Java.io.file;import java.io.FileReader;import Java.io.ioexception;import java.util.arraylist;import java.util.hashmap;import java.util.Map;/** * Naive Bayesian algorithm Tool class * * @ Author Lyq * */public class Naivebayestool {//class marker, divided into 2 classes, yes and noprivate string yes = "Yes";p rivate string no = "no";//Has Classification training data Set file path private String filepath;//property name Array private string[] attrnames;//training dataset private string[][] data;// Value of each property all types private hashmap<string, arraylist<string>> attrvalue;public naivebayestool (String filePath) { This.filepath = Filepath;readdatafile (); Initattrvalue ();} /** * Read data from file */private void Readdatafile () {File File = new file (FilePath); arraylist<string[]> DataArray = new arraylist<string[]> (); try {bufferedreader in = new BufferedReader (new FileReader (file)); String str; String[] Temparray;while ((str = in.readline ()) = null) {Temparray = Str.split ("");d Ataarray.add (Temparray);} In.close ();} catch (IOException e) {E.getstacktrace ();} data = new String[dataarray.size ()] [];d ataarray.toarray (data) Attrnames = data[0];/* * for (int i=0; i<data.length;i+ +) {for (int j=0; j<data[0].length; J + +) {* System.out.print ("" + Data[i][j]);} * * System.out.print ("\ n"); } */}/** * First initializes all types of values for each attribute, used for the calculation of the subsequent subclass entropy with */private void Initattrvalue () {attrValue = new hashmap<> (); Arraylist<string> tempvalues;//in columns, from left to right (int j = 1; j < Attrnames.length; J + +) {//start looking for values from top to bottom in a column tempvalue s = new arraylist<> (); for (int i = 1; i < data.length; i++) {if (!tempvalues.contains (Data[i][j])) {//If the value of this property is not Added, add Tempvalues.add (Data[i][j]);}} The value of a column property has been traversed, copied into the Map attribute table Attrvalue.put (Data[0][j], tempvalues);}} /** * In case of ClassType, the probability of occurrence of condition condition * * @param condition * Attribute Condition * @param ClassType * Type of classification * @ret Urn */private double computeconditionprobably (string condition, string ClassType) {//condition counter int count = 0;//condition attribute index column int at Trindex = 1;//Yes class marker data arraylist<string[]> yClassdata = new arraylist<> ();//No class marker data arraylist<string[]> Nclassdata = new arraylist<> (); arraylist<string[]> classdata;for (int i = 1; i < data.length; i++) {//data is sorted by Yes and no (Data[i][attrnames.len Gth-1].equals (YES)) {Yclassdata.add (data[i]);} else {Nclassdata.add (data[i]);}} if (Classtype.equals (YES)) {classdata = Yclassdata;} else {classdata = Nclassdata;} If no condition is set, the purely class event probabilities are computed if (condition = = null) {return 1.0 * Classdata.size ()/(data.length-1);} The attribute column looking for this condition Attrindex = getconditionattrname (condition); for (string[] s:classdata) {if (S[attrindex].equals (condition ) {count++;}} return 1.0 * Count/classdata.size ();} /** * Returns the column value of the property to which the condition belongs according to the condition value * * @param condition * condition * @return */private int getconditionattrname (String condition {//condition belongs to property name String Attrname = "";//Condition attribute column index int attrindex = 1;//Temporary property value type Arraylist<string[]> valuetypes;for (map.en Try Entry:attrValue.entrySet ()) {ValueTypes = (arraylist<string[]>) entry.getvalUE (); if (Valuetypes.contains (condition) &&! ( (string) Entry.getkey ()). Equals ("Buyscomputer")) {attrname = (string) entry.getkey ();}} for (int i = 0; i < attrnames.length-1; i++) {if (Attrnames[i].equals (attrname)) {attrindex = I;break;}} return attrindex;} /** * for naive Bayesian classification * * @param data * To be classified */public string naivebayesclassificate (String data) {//Test Data property value characteristics Stri Ng[] datafeatures;//in the condition of Yes, the probability of X event is double xwhenyes = 1.0;//in no condition, the probability of X event occurring double Xwhenno = 1.0;//finally is the total probability of Yes and no classification, With P (x| CI) *p (CI) formula computes double pYes = 1;double PNo = 1;datafeatures = Data.split (""); for (int i = 0; i < datafeatures.length; i++) {//Because the naive Bayesian algorithm is independent of the class conditions, it is possible to accumulate calculations Xwhenyes *= computeconditionprobably (datafeatures[i], YES); Xwhenno *= Computeconditionprobably (Datafeatures[i], NO);}  PYes = Xwhenyes * computeconditionprobably (NULL, YES);p NO = Xwhenno * computeconditionprobably (NULL, NO); return (PYes > PNo? Yes:no);}}
The final test results are:

Youth Medium Yes Fair data is classified as: Yes

the note points of naive Bayesian algorithm:

1, when the value type of the characteristic attribute value is not a discrete value but a continuous value, it is necessary to calculate the probability by Gaussian distribution.

As a result, as long as the average and standard deviation are calculated for each category in the training sample, the desired estimate can be obtained by substituting the above formula. The calculation of the mean and standard deviation is not mentioned here.

2, in order to avoid the probability of statistical probability in the case of 0, where the introduction of Laplace calibration, its idea is very simple, is not the category of all the dividing count plus 1

Naive Bayesian classification algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.