Naive Bayesian Classification algorithm (2)

Source: Internet
Author: User

Turn from: http://blog.163.com/[email protected]/blog/static/1712321772010102802635243/

Pondering two days, for the naïve Bayesian principle made very clear, but to do text classification, read a lot of articles know based on naive Bayesian formula, compare the maximum value of the posterior probability to classify, the calculation of the posterior probability is from the prior probability and the class conditional probability product, The probability of prior probability and class condition is obtained by training data set, which is the naïve Bayesian classification model, which is saved as intermediate result, and the intermediate result is called when the test document is classified. The big idea understood very clearly, but the middle detail can say very important part not to understand, the middle obtains the model how and the new to classify the document to relate? How are the conditional and prior probabilities derived from the training set applied to the test document? Also carefully read a few articles, will have seen before in the brain to tidy up, finally figure out what is going on, hurriedly record down for query, the example is from me from an article I think to write a more detailed article stick over, a look on the understanding.

1, the basic definition:

The

Classification is the division of a thing into a category. A thing has many attributes and regards its many attributes as a vector, namely x= (x1,x2,x3,..., xn) , x This vector to represent this thing, x collection is recorded as x , called a property set. The categories are also available in a collection of c={c1,c2,... cm} . General x and c The relationship is indeterminate and can be x and c The is considered a random variable, p (c| X) is called c has a posteriori probability, relative to it, p (C) called c a priori probability.

According to the Bayesian formula, the posteriori probability P (c| X) =p (x| c) P (c)/p (x), but when comparing the posterior probabilities of different C values, the denominator p (X) is always constant, ignored, and the posterior probability p (c| X) =p (x| c) p (c), a priori probability p (c) can be calculated by calculating the proportion of training samples belonging to each class in the training concentration, easily estimated, on the class condition probability p (x| C) estimates, here I only say naive Bayes classifier method, because naive Bayes assumes that the properties of things are independent of each other,P (x| C)=∏p (XI|CI).

2. Text categorization process

For example, document: good good study day-up can be represented by a text eigenvector, x= (good, good, study, day, Day, up) . In the text category, suppose we have a document d ∈ x , category c Also known as labels. We put a bunch of tagged documents together <d,c> ∈<d,c>={beijing joins the World trade Organization, China} for this one-word document, we classify it into China , which is china tag.

Naive Bayesian classifier is a supervised learning, there are two kinds of models, the polynomial model (multinomial models) is the word frequency type and the Bernoulli model (Bernoulli models) is the document type. The computational granularity of the two is different, the polynomial model takes the word as the granularity, the Bernoulli model takes the document as the granularity, therefore both the prior probability and the class conditional probability are calculated differently. When calculating the posteriori probability, for a document D, the polynomial model, only the words that appear in D will participate in the posteriori probability calculation, the Bernoulli model does not appear in D , but the words appearing in the global Word table will also participate in the calculation. However, they are involved as " opposing parties " . In this paper, the feature extraction is not considered, and the logarithm of the class condition probability of the 0 phenomenon is avoided to eliminate the test document.

2.1 Polynomial model

1) Fundamentals

In a polynomial model, a document d= (t1,t2,..., tk)is created, andtk is the word that appears in the document, allowing repetition,

Priori probability P (c) = Total number of words under Class C / number of words for the entire training sample

Class conditional probability P (tk|c) = ( the sum of the number of occurrences of the word TK in each document in class C + 1)/( the total number of words under Class C +| v|)

V is the word list of the training sample (that is, the word is extracted, the word appears multiple times, and only one is counted),| v| Indicates how many words the training sample contains. p (tk|c) can be thought of as the word TK provides much evidence in proving that D belongs to class C , while P (c) can be considered a category C The overall percentage ( how large ).

2) example

Given a good class of text training data, as follows:

DocId

Doc

Category

In C=china?

1

Chinese Beijing Chinese

Yes

2

Chinese Chinese Shanghai

Yes

3

Chinese Macao

Yes

4

Tokyo Japan Chinese

No

Given a new sample Chinese Chinese Chinese Tokyo Japan, classify it. The text is represented by a property vector as d= (Chinese, Chinese, Chinese, Tokyo, Japan), the category set is Y={yes, no}.

A total of 8 words under Class Yes , a total of 3 words under Class no , the total number of training sample words is one, so p (yes) =8/11, P (NO) =3/11 . The probability of a class condition is calculated as follows:

P (Chinese | yes) = (5+1)/(8+6) =6/14=3/7

P (Japan | yes) =p (Tokyo | yes) = (0+1)/(8+6) =1/14

P (Chinese|no) = (+)/(3+6) =2/9

P (Japan|no) =p (tokyo| No) = (+)/(3+6) =2/9

The 8in the denominator refers to the length of the TEXTC under the Yes category, that is, the total number of words in the training sample,6 refers to the training sample Chinese,beijing,shanghai, Macao, Tokyo, Japan A total of 6 words,3 refers to a total of 3 words under no category.

With the above-mentioned conditional probabilities, we begin to calculate the posteriori probabilities:

P (yes | d) = (3/7) 3x1/14x1/14x8/11=108/184877≈0.00058417

P (No | d) = (2/9) 3x2/9x2/9x3/11=32/216513≈0.00014780

Compare size to know that this document belongs to category China.

2.2 Bernoulli model

1) Fundamentals

P (c) = Total files under Class C / number of files for the entire training sample

P (tk|c) = ( number of files with Word tk under Class C + 1)/( total number of words under Class C +2)

2) example

Using the data from the previous example, the model is replaced by the Bernoulli model.

There are a total of 3 files under class Yes , there are 1 files under Class no , the total number of training sample files is one, so p (yes) =3/4, p (Chinese | yes ) = (3+1)/(3+2) =4/5, the condition probabilities are as follows:

P (Japan | yes) =p (Tokyo | yes) = (0+1)/(3+2) =1/5

P (Beijing | yes) = p (macao|yes) = P (shanghai |yes) = (+)/(3+2) =2/5

P (Chinese|no) = (+)/(1+2) =2/3

P (Japan|no) =p (tokyo| No) = (+)/(1+2) =2/3

P (beijing| No) = P (macao| No) = P (Shanghai | no) = (0+1)/(1+2) =1/3

With the above-mentioned conditional probabilities, we begin to calculate the posteriori probability,

P (yes|d) =p (yes) XP (Chinese|yes) XP (Japan|yes) XP (Tokyo|yes) x (1-p (Beijing|yes)) x (1-p (Shanghai|yes)) x (1-p) (Macao|yes )) =3/4x4/5x1/5x1/5x (1-2/5) x (1-2/5) x (1-2/5) =81/15625≈0.005

P (no|d) = 1/4x2/3x2/3x2/3x (1-1/3) x (1-1/3) x (1-1/3) =16/729≈0.022

Therefore, this document does not belong to the category China.

PostScript: text classification is used as discrete data, the previous confusion is the continuous type and discrete mixed piece, naive Bayesian used in many aspects, the data will have continuous and discrete, continuous type can be normal distribution, but also available intervals, the data of the various attributes divided into several intervals to calculate the probability, Test to see the value of its properties in which interval to use which conditional probability. Then there are TF, TDIDF, these are just the different calculation methods when describing the properties of things, such as text classification, can be described in the number of times the word in this document can be used to describe a document, it appears or does not appear as 0 and one description, You can also use the number of occurrences of a word in this document to be expressed in conjunction with the number of occurrences of the word in the remaining class (decreasing the importance of this attribute to a class).

Naive Bayesian Classification algorithm (2)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.