1 Text Categorization process
For example documents:Good good study dayCan be represented by a text feature vector,x= (good, good, study, day, Day, up)。 In the text category, let's say we have a document d∈X, category C is also known as a label. We put together a bunch of tagged documents <d,c> as a training sample,<d,c>∈xxc. For example: <d,c>={beijing joins the World tradeOrganization, China} for this one-word document, we classify it into China, that is, the China label.
Naive Bayesian classifier is a supervised learning, there are two kinds of models, the polynomial model (multinomial models) is the word frequency type and the Bernoulli model (Bernoulli models) is the document type. The computational granularity of the two is different, the polynomial model takes the word as the granularity, the Bernoulli model takes the document as the granularity, therefore both the prior probability and the class conditional probability are calculated differently. When calculating the posteriori probability, for a document D, the polynomial model, only the words that appear in D , will participate in the posteriori probability calculation, the Bernoulli model does not appear in D , But the words that appear in the global Word list also participate in the calculation, but participate as " opposing " . In this paper, the feature extraction is not considered, and the logarithm of the class condition probability of the 0 phenomenon is avoided to eliminate the test document .
1.1 Polynomial model
1) Fundamentals
In a polynomial model, a document d= (t1,t2,..., tk)is created, andtk is the word that appears in the document, allowing repetition,
Priori probability P (c) = Total number of words under Class C / number of words for the entire training sample
Class conditional probability p (Tk|c) = ( class < Span style= "font-family: ' Times New Roman ';" >c Next word tk number of occurrences in each document +1)/( class +| v|)
v is the word list of the training sample (that is, the word is extracted, the word appears more than once, only one is counted), | v| Indicates how many words the training sample contains. p (tk|c) can be thought of as a word TK in proof c provides much evidence, and p (c) can be considered a category c The overall percentage of ( how much of a possibility .
2) example
Given a good class of text training data, as follows:
DocId |
Doc |
Category In C=china? |
1 |
Chinese Beijing Chinese |
Yes |
2 |
Chinese Chinese Shanghai |
Yes |
3 |
Chinese Macao |
Yes |
4 |
Tokyo Japan Chinese |
No |
Given a new sample Chinese Chinese Chinese Tokyo Japan, classify it. The text is represented by a property vector as d= (Chinese, Chinese, Chinese, Tokyo, Japan), the category set is y={yes, no}.
A total of 8 words under the class Yes , a total of 3 words under Class no , the total number of training sample words is one , so P (yes) =8/11, P (no) =3/11. The probability of a class condition is calculated as follows:
P (Chinese | yes) = (5+1)/(8+6) =6/14=3/7
P (Japan | yes) =p (Tokyo | yes) = (0+1)/(8+6) =1/14
P (Chinese|no) = (+)/(3+6) =2/9
P (Japan|no) =p (tokyo| No) = (+)/(3+6) =2/9
The 8 in the denominator refers to the length of the TEXTC under the Yes category, which is the total number of words in the training sample, and6 refers to the training sample having Chinese,beijing,shanghai, Macao, Tokyo, Japan A total of 6 words,3 means no There are a total of 3 words under the class.
With the above-mentioned conditional probabilities, we begin to calculate the posteriori probabilities:
P (yes | d) = (3/7) 3x1/14x1/14x8/11=108/184877≈0.00058417
P (No | d) = (2/9) 3x2/9x2/9x3/11=32/216513≈0.00014780
Compare size to know that this document belongs to category China .
1.2 Bernoulli model
1) Fundamentals
P (c) = Total Files under class C / number of files for the entire training sample
P (tk|c) = ( number of files with Word TK under class C + 1)/( Total number of words under Class C +2)
2) example
Using the data from the previous example, the model is replaced by the Bernoulli model.
There are a total of 3 files under class Yes, there are 1 files under Class no , and the total number of training sample files is one , so P (yes) =3/4, p (Chinese | yes) = (3+1)/(3+2) =4/5, the conditional probabilities are as follows:
P (Japan | yes) =p (Tokyo | yes) = (0+1)/(3+2) =1/5
P (Beijing | yes) = p (macao|yes) = P (shanghai |yes) = (+)/(3+2) =2/5
P (Chinese|no) = (+)/(1+2) =2/3
P (Japan|no) =p (tokyo| No) = (+)/(1+2) =2/3
P (beijing| No) = P (macao| No) = P (Shanghai | no) = (0+1)/(1+2) =1/3
With the above-mentioned conditional probabilities, we begin to calculate the posteriori probability,
P (yes|d) =p (yes) XP (Chinese|yes) XP (Japan|yes) XP (Tokyo|yes) x (1-p (Beijing|yes)) x (1-p (Shanghai|yes)) x (1-p) (Macao|yes )) =3/4x4/5x1/5x1/5x (1-2/5) x (1-2/5) x (1-2/5) =81/15625≈0.005
P (no|d) = 1/4x2/3x2/3x2/3x (1-1/3) x (1-1/3) x (1-1/3) =16/729≈0.022
Therefore, this document does not belong to the category China .
PostScript: text classification is used as discrete data, the previous confusion is the continuous type and discrete mixed piece, naive Bayesian used in many aspects, the data will have continuous and discrete, continuous type can be normal distribution, but also available intervals, the data of the various attributes divided into several intervals to calculate the probability, Test to see the value of its properties in which interval to use which conditional probability. Then there are TF, TDIDF, these are just the different calculation methods when describing the properties of things, such as text classification, can be described in the number of times the word in this document can be used to describe a document, it appears or does not appear as 0 and one description, You can also use the number of occurrences of a word in this document to be expressed in conjunction with the number of occurrences of the word in the remaining class (decreasing the importance of this attribute to a class).
The polynomial and the effort model in Naviebayes