Turn from: http://blog.163.com/[email protected]/blog/static/1712321772010102802635243/
Pondering two days, for the naïve Bayesian principle made very clear, but to do text classification, read a lot of articles know based on naive Bayesian formula, compare the maximum value of the posterior probability to classify, the calculation of the posterior probability is from the prior probability and the class conditional probability product, The probability of prior probability and class condition is obtained by training data set, which is the naïve Bayesian classification model, which is saved as intermediate result, and the intermediate result is called when the test document is classified. The big idea understood very clearly, but the middle detail can say very important part not to understand, the middle obtains the model how and the new to classify the document to relate? How are the conditional and prior probabilities derived from the training set applied to the test document? Also carefully read a few articles, will have seen before in the brain to tidy up, finally figure out what is going on, hurriedly record down for query, the example is from me from an article I think to write a more detailed article stick over, a look on the understanding.
1, the basic definition:
The
Classification is the division of a thing into a category. A thing has many attributes and regards its many attributes as a vector, namely x= (x1,x2,x3,..., xn) , x This vector to represent this thing, x collection is recorded as x , called a property set. The categories are also available in a collection of c={c1,c2,... cm} . General x and c The relationship is indeterminate and can be x and c The is considered a random variable, p (c| X) is called c has a posteriori probability, relative to it, p (C) called c a priori probability.
According to the Bayesian formula, the posteriori probability P (c| X) =p (x| c) P (c)/p (x), but when comparing the posterior probabilities of different C values, the denominator p (X) is always constant, ignored, and the posterior probability p (c| X) =p (x| c) p (c), a priori probability p (c) can be calculated by calculating the proportion of training samples belonging to each class in the training concentration, easily estimated, on the class condition probability p (x| C) estimates, here I only say naive Bayes classifier method, because naive Bayes assumes that the properties of things are independent of each other,P (x| C)=∏p (XI|CI).
2. Text categorization process
For example, document: good good study day-up can be represented by a text eigenvector, x= (good, good, study, day, Day, up) . In the text category, suppose we have a document d ∈ x , category c Also known as labels. We put a bunch of tagged documents together <d,c> ∈<d,c>={beijing joins the World trade Organization, China} for this one-word document, we classify it into China , which is china tag.
Naive Bayesian classifier is a supervised learning, there are two kinds of models, the polynomial model (multinomial models) is the word frequency type and the Bernoulli model (Bernoulli models) is the document type. The computational granularity of the two is different, the polynomial model takes the word as the granularity, the Bernoulli model takes the document as the granularity, therefore both the prior probability and the class conditional probability are calculated differently. When calculating the posteriori probability, for a document D, the polynomial model, only the words that appear in D will participate in the posteriori probability calculation, the Bernoulli model does not appear in D , but the words appearing in the global Word table will also participate in the calculation. However, they are involved as " opposing parties " . In this paper, the feature extraction is not considered, and the logarithm of the class condition probability of the 0 phenomenon is avoided to eliminate the test document.
2.1 Polynomial model
1) Fundamentals
In a polynomial model, a document d= (t1,t2,..., tk)is created, andtk is the word that appears in the document, allowing repetition,
Priori probability P (c) = Total number of words under Class C / number of words for the entire training sample
Class conditional probability P (tk|c) = ( the sum of the number of occurrences of the word TK in each document in class C + 1)/( the total number of words under Class C +| v|)
V is the word list of the training sample (that is, the word is extracted, the word appears multiple times, and only one is counted),| v| Indicates how many words the training sample contains. p (tk|c) can be thought of as the word TK provides much evidence in proving that D belongs to class C , while P (c) can be considered a category C The overall percentage ( how large ).
2) example
Given a good class of text training data, as follows:
DocId |
Doc |
Category In C=china? |
1 |
Chinese Beijing Chinese |
Yes |
2 |
Chinese Chinese Shanghai |
Yes |
3 |
Chinese Macao |
Yes |
4 |
Tokyo Japan Chinese |
No |
Given a new sample Chinese Chinese Chinese Tokyo Japan, classify it. The text is represented by a property vector as d= (Chinese, Chinese, Chinese, Tokyo, Japan), the category set is Y={yes, no}.
A total of 8 words under Class Yes , a total of 3 words under Class no , the total number of training sample words is one, so p (yes) =8/11, P (NO) =3/11 . The probability of a class condition is calculated as follows:
P (Chinese | yes) = (5+1)/(8+6) =6/14=3/7
P (Japan | yes) =p (Tokyo | yes) = (0+1)/(8+6) =1/14
P (Chinese|no) = (+)/(3+6) =2/9
P (Japan|no) =p (tokyo| No) = (+)/(3+6) =2/9
The 8in the denominator refers to the length of the TEXTC under the Yes category, that is, the total number of words in the training sample,6 refers to the training sample Chinese,beijing,shanghai, Macao, Tokyo, Japan A total of 6 words,3 refers to a total of 3 words under no category.
With the above-mentioned conditional probabilities, we begin to calculate the posteriori probabilities:
P (yes | d) = (3/7) 3x1/14x1/14x8/11=108/184877≈0.00058417
P (No | d) = (2/9) 3x2/9x2/9x3/11=32/216513≈0.00014780
Compare size to know that this document belongs to category China.
2.2 Bernoulli model
1) Fundamentals
P (c) = Total files under Class C / number of files for the entire training sample
P (tk|c) = ( number of files with Word tk under Class C + 1)/( total number of words under Class C +2)
2) example
Using the data from the previous example, the model is replaced by the Bernoulli model.
There are a total of 3 files under class Yes , there are 1 files under Class no , the total number of training sample files is one, so p (yes) =3/4, p (Chinese | yes ) = (3+1)/(3+2) =4/5, the condition probabilities are as follows:
P (Japan | yes) =p (Tokyo | yes) = (0+1)/(3+2) =1/5
P (Beijing | yes) = p (macao|yes) = P (shanghai |yes) = (+)/(3+2) =2/5
P (Chinese|no) = (+)/(1+2) =2/3
P (Japan|no) =p (tokyo| No) = (+)/(1+2) =2/3
P (beijing| No) = P (macao| No) = P (Shanghai | no) = (0+1)/(1+2) =1/3
With the above-mentioned conditional probabilities, we begin to calculate the posteriori probability,
P (yes|d) =p (yes) XP (Chinese|yes) XP (Japan|yes) XP (Tokyo|yes) x (1-p (Beijing|yes)) x (1-p (Shanghai|yes)) x (1-p) (Macao|yes )) =3/4x4/5x1/5x1/5x (1-2/5) x (1-2/5) x (1-2/5) =81/15625≈0.005
P (no|d) = 1/4x2/3x2/3x2/3x (1-1/3) x (1-1/3) x (1-1/3) =16/729≈0.022
Therefore, this document does not belong to the category China.
PostScript: text classification is used as discrete data, the previous confusion is the continuous type and discrete mixed piece, naive Bayesian used in many aspects, the data will have continuous and discrete, continuous type can be normal distribution, but also available intervals, the data of the various attributes divided into several intervals to calculate the probability, Test to see the value of its properties in which interval to use which conditional probability. Then there are TF, TDIDF, these are just the different calculation methods when describing the properties of things, such as text classification, can be described in the number of times the word in this document can be used to describe a document, it appears or does not appear as 0 and one description, You can also use the number of occurrences of a word in this document to be expressed in conjunction with the number of occurrences of the word in the remaining class (decreasing the importance of this attribute to a class).
Naive Bayesian Classification algorithm (2)