The polynomial and the effort model in Naviebayes

Last Update:2015-09-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 Text Categorization process

For example documents:Good good study dayCan be represented by a text feature vector,x= (good, good, study, day, Day, up)。 In the text category, let's say we have a document d∈X, category C is also known as a label. We put together a bunch of tagged documents <d,c> as a training sample,<d,c>∈xxc. For example: <d,c>={beijing joins the World tradeOrganization, China} for this one-word document, we classify it into China, that is, the China label.

Naive Bayesian classifier is a supervised learning, there are two kinds of models, the polynomial model (multinomial models) is the word frequency type and the Bernoulli model (Bernoulli models) is the document type. The computational granularity of the two is different, the polynomial model takes the word as the granularity, the Bernoulli model takes the document as the granularity, therefore both the prior probability and the class conditional probability are calculated differently. When calculating the posteriori probability, for a document D, the polynomial model, only the words that appear in D , will participate in the posteriori probability calculation, the Bernoulli model does not appear in D , But the words that appear in the global Word list also participate in the calculation, but participate as " opposing " . In this paper, the feature extraction is not considered, and the logarithm of the class condition probability of the 0 phenomenon is avoided to eliminate the test document .

1.1 Polynomial model

1) Fundamentals

In a polynomial model, a document d= (t1,t2,..., tk)is created, andtk is the word that appears in the document, allowing repetition,

Priori probability P (c) = Total number of words under Class C / number of words for the entire training sample

Class conditional probability p (Tk|c) = ( class < Span style= "font-family: ' Times New Roman ';" >c Next word tk number of occurrences in each document +1)/( class +| v|)

v is the word list of the training sample (that is, the word is extracted, the word appears more than once, only one is counted), | v| Indicates how many words the training sample contains. p (tk|c) can be thought of as a word TK in proof c provides much evidence, and p (c) can be considered a category c The overall percentage of ( how much of a possibility .

2) example

Given a good class of text training data, as follows:

DocId	Doc	Category In C=china?
1	Chinese Beijing Chinese	Yes
2	Chinese Chinese Shanghai	Yes
3	Chinese Macao	Yes
4	Tokyo Japan Chinese	No

Given a new sample Chinese Chinese Chinese Tokyo Japan, classify it. The text is represented by a property vector as d= (Chinese, Chinese, Chinese, Tokyo, Japan), the category set is y={yes, no}.

A total of 8 words under the class Yes , a total of 3 words under Class no , the total number of training sample words is one , so P (yes) =8/11, P (no) =3/11. The probability of a class condition is calculated as follows:

P (Chinese | yes) = (5+1)/(8+6) =6/14=3/7

P (Japan | yes) =p (Tokyo | yes) = (0+1)/(8+6) =1/14

P (Chinese|no) = (+)/(3+6) =2/9

P (Japan|no) =p (tokyo| No) = (+)/(3+6) =2/9

The 8 in the denominator refers to the length of the TEXTC under the Yes category, which is the total number of words in the training sample, and6 refers to the training sample having Chinese,beijing,shanghai, Macao, Tokyo, Japan A total of 6 words,3 means no There are a total of 3 words under the class.

With the above-mentioned conditional probabilities, we begin to calculate the posteriori probabilities:

P (yes | d) = (3/7) 3x1/14x1/14x8/11=108/184877≈0.00058417

P (No | d) = (2/9) 3x2/9x2/9x3/11=32/216513≈0.00014780

Compare size to know that this document belongs to category China .

1.2 Bernoulli model

1) Fundamentals

P (c) = Total Files under class C / number of files for the entire training sample

P (tk|c) = ( number of files with Word TK under class C + 1)/( Total number of words under Class C +2)

2) example

Using the data from the previous example, the model is replaced by the Bernoulli model.

There are a total of 3 files under class Yes, there are 1 files under Class no , and the total number of training sample files is one , so P (yes) =3/4, p (Chinese | yes) = (3+1)/(3+2) =4/5, the conditional probabilities are as follows:

P (Japan | yes) =p (Tokyo | yes) = (0+1)/(3+2) =1/5

P (Beijing | yes) = p (macao|yes) = P (shanghai |yes) = (+)/(3+2) =2/5

P (Chinese|no) = (+)/(1+2) =2/3

P (Japan|no) =p (tokyo| No) = (+)/(1+2) =2/3

P (beijing| No) = P (macao| No) = P (Shanghai | no) = (0+1)/(1+2) =1/3

With the above-mentioned conditional probabilities, we begin to calculate the posteriori probability,

P (no|d) = 1/4x2/3x2/3x2/3x (1-1/3) x (1-1/3) x (1-1/3) =16/729≈0.022

Therefore, this document does not belong to the category China .

PostScript: text classification is used as discrete data, the previous confusion is the continuous type and discrete mixed piece, naive Bayesian used in many aspects, the data will have continuous and discrete, continuous type can be normal distribution, but also available intervals, the data of the various attributes divided into several intervals to calculate the probability, Test to see the value of its properties in which interval to use which conditional probability. Then there are TF, TDIDF, these are just the different calculation methods when describing the properties of things, such as text classification, can be described in the number of times the word in this document can be used to describe a document, it appears or does not appear as 0 and one description, You can also use the number of occurrences of a word in this document to be expressed in conjunction with the number of occurrences of the word in the remaining class (decreasing the importance of this attribute to a class).

The polynomial and the effort model in Naviebayes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The polynomial and the effort model in Naviebayes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The polynomial and the effort model in Naviebayes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support