Common machine learning algorithms principles + Practice Series 6 (naive Bayesian classification)

Source: Internet
Author: User
Tags natural logarithm

Naive Bayesian NB

Native Bayes is a simple and effective classification algorithm, and Bayes ' law is represented by this conditional probability formula:

P (a| B) = P (b| A) * p (a)/P (B), where P (a| b) means that, in the case of B, the probability of a is occurring, p (A), P (B) represents the probability that a and B occur in reality, which actually depends on the case of our input sample. Bayesian classification algorithms are widely used in many scenarios, such as message classification, text categorization, and so on. For example, in the message classification, it can be so simple to understand, if an e-mail with this word combination (W1,W2,....WN) to indicate, then this message is the probability of spam is how much. is actually the probability of seeking p (spam |w), and according to the formula, p (spam |w) =p (w| spam) *p (junk e-mail)/P (W), we can find the final probability, And the above formula on the right side of the three probabilities can be obtained in the input sample through training (sometimes just to compare the probability of different classifications, so it is not necessary to ask for P (W), it can be thought that the input sample is unchanged, this is unchanged, different classifications will not affect this value).

The following example illustrates the process of the entire Bayesian classification by using Python examples:

1. use vectors to represent a message or a text

Assuming that there are N words in the glossary, we can use 1*n vectors to represent a message content, where each value indicates whether a word appears in the message, 1 represents, and 0 does not appear.

Create a glossary, using a set to represent:

Then use a vector to represent a message, such as [0,1,0,0,1,1 ...]

2. find three probability values during the training phase

3 , using the probability values returned by the training phase to classify

Bayesian assumes that each feature (i.e. each word w1,w2 ...) is independent, that is, p (w| spam) is equivalent to:

P (w1| junk e-mail) *p (w2| spam) * ... Suppose a two classification problem, in the case of W, find P (spam | W), and P (normal mail | W), that value is large, it is considered that classification, and in practical applications, in order to ensure accuracy, the rules are flexible, such as already p>0.99 is considered accurate.

Then our goal is to:

P (junk e-mail | W) > P (normal Mail | W)? is spam: is normal mail

That is, compare the size of the following two values:

P (w| spam) *p (junk e-mail)/p (W)

P (w| normal mail) *p (normal Mail)/p (W)

If you do not consider the denominator and then take the natural logarithm separately, compare the following two value sizes:

ln (p (w| spam) *p (spam)) = ln (p (w| spam)) + ln (junk e-mail) = ln (p (w1| spam)) + ... + ln (p (wn| spam)) + ln (junk e-mail)

ln (p (w| normal mail) *p (normal mail)) = ln (p (w| normal mail)) + ln (normal mail) = ln (p (w1| normal mail)) + ... + ln (p (wn| normal mail)) + ln (normal mail)

Some improvement points:

    1. The above example simply uses a (1) or none (0) to represent the vector, which in fact leads to a problem with accuracy, can be used as the number of occurrences of the word, or even the TD-IDF value of the word.
    2. In the above question we are solving P (c| W), that is, the existence of a word combination is, the message is the probability of classification C, when the word appears more time, will come to the problem of accuracy, you can dissolve the problem into a joint probability, that is, the probability of each word to find P (c| Wi), and then take out the probability of the largest topn to solve, such as n=10,n=15, and so on, the joint probability formula is as follows:

p=p1*p2*p3*....pn/(p1*p2*p3*....pn+ (1-P1) * (1-P2) * (1-P3) ... * (1-PN)), where P1-PN is our chosen topn probability.

Common machine learning algorithms principles + Practice Series 6 (naive Bayesian classification)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.