A simple example
Naive Bayesian algorithm is a typical statistical learning method, the main theoretical basis is a Bayesian formula, the basic definition of Bayesian formula is as follows:
Although this formula looks simple, it can summarize history and predict the future. The right side of the formula is a summary of history, the left side of the formula predicts the future, if Y sees the category, X sees the feature, P (yk| x) is the probability of finding the Yk class in the case of a known feature x, whereas P (yk| X) is converted to the feature distribution of the category YK.
For example, when the university, a boy often go to the library late self-study, found that the girl he likes also often go to that study room, the heart of the happy, so every day to buy some delicious in that study room camp, but other girls do not necessarily come every day, see the weather gradually hot, the library does not open air-conditioning, If the girl did not go to the study room, the boy will not go, every time the boys muster up the courage to say: "Hey, you still come tomorrow?" "Ah, don't know, look at the situation." Then the boy every day to take her to study room or not and some other circumstances to do a record, with Y to indicate whether the girl to study room, that is, y={go, not to},x is to go to study room with a series of conditions, such as the day on which door subjects, camp statistics for a period of time, the boy intends not camp today, But first to predict if she will go, now know that today on the ordinary differential method so subjects, so calculate P (y= to | ordinary differential equation) and p (y= not go | Ordinary differential equation), see which probability large, if P (y= go | Ordinary differential equation) >p (y= not go | Ordinary differential equation), The boy no matter how hot all fart to go to study room, otherwise do not go to study room suffer. The calculation of P (y= | Ordinary differential equation) can be converted to a case where she went before that day subjects is the probability p (ordinary differential equation) of the ordinary differential | y=), note that the denominator to the right of the formula is the same for each category (go/No), so the calculation ignores the denominator, so that although the probability value is no longer between 0~1, its size can be selected as a category.
Later he found that there are some other conditions can be dug, such as the day of the week, the day of the weather, and the last time with her in the study room atmosphere, after a period of statistics, the man a calculation, found bad, because summed up the history of the formula:
Here N=3,x (1) means subjects, X (2) for the weather, X (3) for the day of the Week, X (4) to indicate the atmosphere, Y is still {go, not go}, now subjects has 8 doors, the weather is sunny, rain, yin three species, the atmosphere has a+,a,b+,b,c five species, then the total need to estimate the parameters 8*3*7*5* 2 = 1680, only one data per day can be collected, so together 1680 data of the university graduate, the boys are not good, so made an independent hypothesis, assuming that these influence she went to study room because of independent unrelated, so
With this independent hypothesis, the parameters that need to be estimated become, (8+3+7+5) * = 46, and a daily collection of data, can provide 4 parameters, so that the boy predicted more and more accurate.
Naive Bayesian classifier
Speaking of the little story above, we come to the simplicity of the Bayesian classifier representation:
When the feature is X, the conditional probabilities for all categories are computed, and the category with the most conditional probability is selected as the category to be classified. Since the denominator of the above formula is the same for each category, the denominator can be computed without regard to
The simplicity of naive Bayes is embodied in the hypothesis of independence of each condition, and with the independent hypothesis, the parameter hypothesis space is greatly reduced.
The application of text classification
Text classification of many applications, such as spam and spam filtering is a 2 classification problem, news classification, text sentiment analysis can be regarded as a text classification problem, classification problem consists of two steps: training and prediction, to establish a classification model, at least a training data set needs to be. Bayesian models can be applied very naturally to text categorization: Now there is a document D, which determines which category CK it belongs to, and it is only possible to calculate the maximum probability that a document D belongs to which category:
In the classification problem, we do not use all the features, for a document D, we only use some of these feature terms <t1,t2,..., tnd> (ND represents the total number of entries in D), because many of the terms of the classification is not valuable, such as some stop words ", is, in the" In each category will appear, the term will also be fuzzy classification of the decision-making surface, about the selection of feature words, my article is introduced. After a document is represented by a feature word item, the category of the calculated document D is converted to:
Note that P (ck|d) is only proportional to the formula at the back, and the complete calculation has a denominator, but as we discussed earlier, the denominator is the same for each category, so we can classify it just by counting the molecules. In the actual calculation process, the multiplication of multiple probability values P (tj|ck) can easily overflow to 0, thus converting to logarithmic computation, the multiplication becomes cumulative:
We only need to calculate the probability P (tj|ck) of each category's occurrence probability P (CK) and each feature term in each category from the training data set, and the calculation of these probabilities uses the maximum likelihood estimate, in the final analysis, the number of times each word appears in each category and the number of documents in each category:
Among them, Nck represents the number of CK class documents in the training set, n the total number of documents in the training set; TJK represents the number of occurrences of a term TJ in a category CK, and V is a collection of word items for all categories. Here the position of the word is independent hypothesis, that is, two words as long as they appear, regardless of where they appear in the document, their large probability value P (tj|ck) is the same, this position independence hypothesis and the reality very inconsistent, such as "Horse fart" and "horse fart" expression is different content, but practice found that Positional independence assumes that the accuracy of the model is not low, because most text categorization is differentiated by the difference between words, rather than the position of the word, if the position of the word is considered, then the problem will be very complex, so that we can not find the way.
One of the problems to be aware of is that ti may not appear in the CK category of the training set, but appears in the CK category of the test set, so because Tik is 0, resulting in a multiplicative probability value of 0, other feature words appear more, the document will not be divided into the CK category, and in the case of logarithmic accumulation, A value of 0 causes a calculation error, and the way to handle this problem is to sample plus 1 smoothing, which means that each word appears at least once in each category, i.e.
The following example comes from reference 1, assuming the following set of training sets:
Now to calculate whether a test document of DocId 5 belongs to the China category, first calculate the probabilities of each type, P (C=china) =3/4,p (C!=china) =1/4, and then calculate the probability of each class morphemes item:
Note that 8 in the denominator (8+6) indicates that the total number of occurrences of terms in the Chinese class is 8,+6 for smoothing, 6 is the number of total terms, and then the probability of the test document belonging to each category is calculated:
You can see that the test document should belong to the China category.
Text Classification Practice
I found Sogou Sohu News data History Concise version, a total of including automotive, finance, it, health and other 9 categories of news, altogether 16,289 news, Sogou to the data is every piece of news with a TXT file to save, I preprocessed a bit, all the news documents in a text file, each line is a piece of news, The first letter of the Id,id that retains the news is the class label, and the following preprocessing and participle examples are as follows:
I use 6,289 news as a training set, the remaining 10,000 for testing, the use of mutual information for the extraction of text features, the total extracted feature words are about 700.
The results of the classification are as follows:
8343 10000 0.8343
A total of 10,000 news, the correct classification of 8,343, the correct rate of 0.8343, here is mainly to demonstrate the Bayesian classification process, only considered the correct rate and did not consider other evaluation indicators, also did not optimize. Bayesian classification of high efficiency, training, only need to scan the training set, record the number of occurrences of each word, as well as the number of occurrences of various documents, testing only need to scan a test set, from the point of view of operational efficiency, naive Bayesian efficiency is the highest, and accuracy can achieve an ideal effect.
My implementation code is as follows:
1 #!encoding=utf-8 2 import random 3 import sys 4 import math 5 Import Collections 6 Import SYS 7 DEF shuffle (): 8 "The original text is scrambled in order to get the training set and test set ' 9 datas = [Line.strip () for line in Sys.stdin] Random.shuffle (datas) 11 For the datas:12 print line lables = [' A ', ' B ', ' C ', ' D ', ' E ', ' F ', ' G ', ' H ', ' I '] def Lable2id (lable): + for I in Xrange (Len (lables)): if lable = = Lables[i]: return i-Ra Ise Exception (' Error lable%s '% (lable)) def docdict (): [0]*len (lables) + def mutal Info (N,nij,ni_,n_j): #print n,nij,ni_,n_j return NIJ * 1.0/n * Math.log (N * (nij+1) *1.0/(Ni_*n_j))/MATH.L OG (2) countformi def (): 30 "based on the number of occurrences of each word in each category, and the number of documents per class" Doccount = [0] * len (lables) #每个类的词数目 32 WordCount = Collections.defaultdict (docdict)-for-line in sys.stdin:34 Lable,text = Line.strip (). SP Lit (', 1)Dex = Lable2id (lable[0]) words = Text.split (") PNs for word in words:38 wordCount [Word] [index] + = 1 Doccount[index] + = 1 midict = collections.defaultdict (docdict) #互信息值 N = SUM (doccount)-K,vs in Wordcount.items (): $-I in xrange (Len (VS)): N11 = Vs[i] 46 N10 = SUM (VS)-N11 N01 = doccount[i]-N11-N00 = N-n11-n10-n01 49 Mi = mutalinfo (n,n11,n10+n11,n01+n11) + mutalinfo (N,N10,N10+N11,N00+N10) + mutalinfo (n,n01,n01+n11,n01+n00) + Mutalinfo (n,n00,n00+n10,n00+n01) midict[k][i] = mi Wuyi fwords = set () for the I in Xrange (Len (Doccoun T): Keyf = lambda x:x[1][i] sorteddict = sorted (Midict.items (), key=keyf,reverse=true)-FO R J in Xrange (+): Fwords.add (sorteddict[j][0]) print doccount# Prints the number of documents for each class for Fword in Fword s:59 Print Fword 60 Loadfeatureword def (): 63 "' Import feature word ' ' + F = open (' feature.txt ') + doccounts = eval (f. ReadLine ()) features = set () in the f:68 Features.add (Line.strip ()) F.close () Turn Doccounts,features Trainbayes (): 73 "Training Bayesian models, actually calculating the number of occurrences of feature words in each class" ' Doccounts,featu res = Loadfeatureword () WordCount = Collections.defaultdict (docdict) TCount = [0]*len (doccounts) #每类文档特征词 Number of occurrences of sys.stdin:78 Lable,text = Line.strip (). Split (", 1) index = LABLE2ID (lable[0] ) words = Text.split (') Bayi for word in words:82 if Word in features:83 Tcount[index] + = 1 Wordcount[word][index] + = 1 k,v in Wordcount.items (): 86 scores = [(v[i]+1) * 1.0/(Tcount[i]+len (WordCount)) for I in Xrange (Len (v))] #加1平滑 print '%s\t%s '% (k,scores) The DEF loadModel (): 90 "' Import Bayesian model ' ' model.txt ' = ' open ' (' scores ') * * = {}. = Line.strip (). Rsplit (' \ t ', 1) scores[word] = eval (counts) F.close () $ Return scores 98 99 Def predict (): 100 "The class label of the predictive document, standard input per action one document" ' 101 Doccounts,features = Loadfeatureword () 102 docscores = [MA Th.log (Count * 1.0/sum (doccounts)) for count in doccounts]103 scores = Loadmodel () 104 Rcount = 0105 Doccount = 0106 for line in sys.stdin:107 Lable,text = Line.strip (). Split (", 1) 108 index = LABLE2ID (lable[0]) 109 words = Text.split (") prevalues = List (docscores) 111 for word in words:112 If Word in features:113 for I in Xrange (Len (prevalues)): PreVa Lues[i]+=math.log (Scores[word][i]) + M = max (prevalues) Pindex = Prevalues.index (m) 117 if PInd ex = = index:118 Rcount + = 1119 Print lable,lables[pindex],text120 Doccount + 1121 Print Rcount,doccount,rcount * 1.0/doccount 122 123 124 If __name__== "__main__": #shuffle () 126 #countForMI () 127 #trainBayes () Predict ()
In the code, the computational feature words are separate from the training model and the test, and the main method needs to be modified, for example, to calculate the feature words:
$cat Train.txt | Python bayes.py > Feature.txt
Training Model:
$cat Train.txt | Python bayes.py > Model.txt
Predictive models:
$cat Test.txt | Python bayes.py > Predict.out
Summarize
This paper introduces the naive Bayesian classification method, also takes the text classification as an example, gives a concrete application example, naive Bayesian's simple embodiment in the condition variable independence hypothesis, applies to the text classification, has made two hypothesis, one is each characteristic word to the classification influence is independent, The other is that the order of the word items in the document is irrelevant. Naive Bayesian independence hypothesis in practice is not established, but in the classification of the effect is still good, coupled with the independence hypothesis, and belong to the class CK document D, its P (ck|d) is often estimated too high, that is expected P (ck|d) = 0.55, while Naive Bayes is calculated P (ck|d) = 0.99, but this does not affect the classification results, which is the naïve Bayesian classifier in the text classification effect better than expected reasons.
Reference Link: http://www.cnblogs.com/fengfenggirl/p/bayes_classify.html
Naive Bayes of classification algorithm