The polynomial and the effort model in Naviebayes

Source: Internet
Author: User

1 Text Categorization process

For example documents:Good good study dayCan be represented by a text feature vector,x= (good, good, study, day, Day, up)。 In the text category, let's say we have a document d∈X, category C is also known as a label. We put together a bunch of tagged documents <d,c> as a training sample,<d,c>∈xxc. For example: <d,c>={beijing joins the World tradeOrganization, China} for this one-word document, we classify it into China, that is, the China label.

Naive Bayesian classifier is a supervised learning, there are two kinds of models, the polynomial model (multinomial models) is the word frequency type and the Bernoulli model (Bernoulli models) is the document type. The computational granularity of the two is different, the polynomial model takes the word as the granularity, the Bernoulli model takes the document as the granularity, therefore both the prior probability and the class conditional probability are calculated differently. When calculating the posteriori probability, for a document D, the polynomial model, only the words that appear in D , will participate in the posteriori probability calculation, the Bernoulli model does not appear in D , But the words that appear in the global Word list also participate in the calculation, but participate as " opposing " . In this paper, the feature extraction is not considered, and the logarithm of the class condition probability of the 0 phenomenon is avoided to eliminate the test document .

1.1 Polynomial model

1) Fundamentals

In a polynomial model, a document d= (t1,t2,..., tk)is created, andtk is the word that appears in the document, allowing repetition,

Priori probability P (c) = Total number of words under Class C / number of words for the entire training sample

Class conditional probability p (Tk|c) = ( class < Span style= "font-family: ' Times New Roman ';" >c Next word tk number of occurrences in each document +1)/( class +| v|)

v is the word list of the training sample (that is, the word is extracted, the word appears more than once, only one is counted), | v| Indicates how many words the training sample contains.  p (tk|c) can be thought of as a word TK in proof c provides much evidence, and p (c) can be considered a category c The overall percentage of ( how much of a possibility .

2) example

Given a good class of text training data, as follows:

DocId

Doc

Category

In C=china?

1

Chinese Beijing Chinese

Yes

2

Chinese Chinese Shanghai

Yes

3

Chinese Macao

Yes

4

Tokyo Japan Chinese

No

Given a new sample Chinese Chinese Chinese Tokyo Japan, classify it. The text is represented by a property vector as d= (Chinese, Chinese, Chinese, Tokyo, Japan), the category set is y={yes, no}.

A total of 8 words under the class Yes , a total of 3 words under Class no , the total number of training sample words is one , so P (yes) =8/11, P (no) =3/11. The probability of a class condition is calculated as follows:

P (Chinese | yes) = (5+1)/(8+6) =6/14=3/7

P (Japan | yes) =p (Tokyo | yes) = (0+1)/(8+6) =1/14

P (Chinese|no) = (+)/(3+6) =2/9

P (Japan|no) =p (tokyo| No) = (+)/(3+6) =2/9

The 8 in the denominator refers to the length of the TEXTC under the Yes category, which is the total number of words in the training sample, and6 refers to the training sample having Chinese,beijing,shanghai, Macao, Tokyo, Japan A total of 6 words,3 means no There are a total of 3 words under the class.

With the above-mentioned conditional probabilities, we begin to calculate the posteriori probabilities:

P (yes | d) = (3/7) 3x1/14x1/14x8/11=108/184877≈0.00058417

P (No | d) = (2/9) 3x2/9x2/9x3/11=32/216513≈0.00014780

Compare size to know that this document belongs to category China .

1.2 Bernoulli model

1) Fundamentals

P (c) = Total Files under class C / number of files for the entire training sample

P (tk|c) = ( number of files with Word TK under class C + 1)/( Total number of words under Class C +2)

2) example

Using the data from the previous example, the model is replaced by the Bernoulli model.

There are a total of 3 files under class Yes, there are 1 files under Class no , and the total number of training sample files is one , so P (yes) =3/4, p (Chinese | yes) = (3+1)/(3+2) =4/5, the conditional probabilities are as follows:

P (Japan | yes) =p (Tokyo | yes) = (0+1)/(3+2) =1/5

P (Beijing | yes) = p (macao|yes) = P (shanghai |yes) = (+)/(3+2) =2/5

P (Chinese|no) = (+)/(1+2) =2/3

P (Japan|no) =p (tokyo| No) = (+)/(1+2) =2/3

P (beijing| No) = P (macao| No) = P (Shanghai | no) = (0+1)/(1+2) =1/3

With the above-mentioned conditional probabilities, we begin to calculate the posteriori probability,

P (yes|d) =p (yes) XP (Chinese|yes) XP (Japan|yes) XP (Tokyo|yes) x (1-p (Beijing|yes)) x (1-p (Shanghai|yes)) x (1-p) (Macao|yes )) =3/4x4/5x1/5x1/5x (1-2/5) x (1-2/5) x (1-2/5) =81/15625≈0.005

P (no|d) = 1/4x2/3x2/3x2/3x (1-1/3) x (1-1/3) x (1-1/3) =16/729≈0.022

Therefore, this document does not belong to the category China .

PostScript: text classification is used as discrete data, the previous confusion is the continuous type and discrete mixed piece, naive Bayesian used in many aspects, the data will have continuous and discrete, continuous type can be normal distribution, but also available intervals, the data of the various attributes divided into several intervals to calculate the probability, Test to see the value of its properties in which interval to use which conditional probability. Then there are TF, TDIDF, these are just the different calculation methods when describing the properties of things, such as text classification, can be described in the number of times the word in this document can be used to describe a document, it appears or does not appear as 0 and one description, You can also use the number of occurrences of a word in this document to be expressed in conjunction with the number of occurrences of the word in the remaining class (decreasing the importance of this attribute to a class).

The polynomial and the effort model in Naviebayes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.