[Machine learning] naive Bayesian algorithm (Naive Bayes)

Source: Internet
Author: User

many occasions in life need to use classification, such as news classification, Patient classification and so on.

This paper introduces naive Bayesian classifier (Naive Bayes classifier), which is a simple and effective common classification algorithm.

I. Examples of patient classifications

Let me start with an example, and you'll see that the Bayesian classifier is understood and not difficult at all.

A hospital received six outpatient patients in the morning, such as the following table.

Symptoms of occupational diseases

Sneezing nurse catching a cold
Sneezing farmer Allergy
Headache for construction workers
Headache Construction workers cold
Sneezing teacher catching a cold
Headache Teacher Concussion

Now there's a seventh patient, a sneezing construction worker. What is the probability of his catching a cold?

According to Bayes theorem:

P (a| B) = P (b| A) P (a)/P (B)

Can get

P (Cold | Sneezing x construction workers)
= p (sneezing x construction worker | cold) x P (Cold)
/P (sneezing x construction workers)

It is assumed that the two characteristics of "sneezing" and "construction worker" are independent, so the above equation becomes

P (Cold | Sneezing x construction workers)
= P (Sneezing | cold) x P (construction worker | cold) x P (Cold)
/p (sneezing) x p (construction workers)

This can be calculated.

P (Cold | Sneezing x construction workers)
= 0.66 x 0.33 x 0.5/0.5 x 0.33
= 0.66

As a result, the sneezing construction worker has a 66% chance of catching a cold. In the same vein, you can calculate the likelihood of a patient suffering from allergies or concussions. By comparing these probabilities, you can know what disease he is most likely to have.

This is the basic method of Bayesian classifier: On the basis of statistical data, according to some characteristics, the probability of each category is calculated and the classification is realized.

formula of naive Bayesian classifier

Suppose an individual has n features (Feature), F1, F2 、...、 Fn, respectively. The existing m categories (category) are C1, C2 、...、 Cm, respectively. The Bayesian classifier is the one that calculates the largest probability, that is, the maximum value of the following equation:

P (c| F1f2 ... Fn)
= P (F1f2 ... fn| c) P (c)/p (F1f2 ... Fn)

Because P (f1f2 ... Fn) is the same for all categories and can be omitted, and the problem becomes

P (F1f2 ... fn| c) P (c)

The maximum value.

Naive Bayes classifier is further, assuming that all features are independent of each other, so

P (F1f2 ... fn| c) P (c)
= P (f1| C) P (f2| C) ... P (fn| c) P (c)

Each entry to the right of the upper equals sign can be obtained from the statistics, thus calculating the probability of each category to find the maximum probability of that class.

Although the hypothesis that "all characteristics are independent of each other" is unlikely to be true in reality, it can greatly simplify calculations, and studies have shown little impact on the accuracy of the classification results.

Here are two examples of how naive Bayesian classifiers can be used.

Iii. Examples of account classification

This example is excerpted from Zhang Yang's naive Bayesian classification algorithm for----classification of grocery stores.

According to the sampling statistics of a community website, 89% of the 10,000 accounts of the station are real accounts (set to C0) and 11% are false accounts (set to C1).

C0 = 0.89

C1 = 0.11

Next, it is necessary to use statistical data to determine the authenticity of an account. Assume that an account has the following three characteristics:

F1: Number of logs/days of registration
F2: Number of friends/registration days
F3: Whether to use real picture (real avatar is 1, non-real avatar is 0)

F1 = 0.1
F2 = 0.2
F3 = 0

Is this account a real account or a fake account?

The method is to use Naive Bayes classifier to calculate the value of the following formula.

P (f1| C) P (f2| C) P (f3| c) P (c)

Although the above values can be obtained from statistics, there is a problem here: F1 and F2 are continuous variables, and it is not appropriate to calculate probabilities according to a particular value.

One technique is to change a continuous value to a discrete value and calculate the probability of an interval. For example, F1 decomposition into [0, 0.05], (0.05, 0.2), [0.2, +∞] three intervals, and then calculate the probability of each interval. In our example, F1 equals 0.1 and falls in the second interval, so the probability of occurrence of the second interval is used when calculating.

According to statistics, it is possible to:

P (f1| C0) = 0.5, P (f1| C1) = 0.1
P (f2| C0) = 0.7, P (f2| C1) = 0.2
P (f3| C0) = 0.2, P (f3| C1) = 0.9

So

P (f1| C0) P (f2| C0) P (f3| C0) P (C0)
= 0.5 x 0.7 x 0.2 x 0.89
= 0.0623

P (f1| C1) P (f2| C1) P (f3| C1) P (C1)
= 0.1 x 0.2 x 0.9 x 0.11
= 0.00198

It can be seen that although the user does not use the real picture, but he is the probability of real account, than the false account more than 30 times, so judge this account is true.

Iv. Examples of gender classifications

This is an excerpt from Wikipedia, another way to deal with continuous variables.

The following is a set of statistical data on human body characteristics.

Gender height (feet) weight (lb) foot (inch)

Male 6 180 12
Male 5.92 190 11
Male 5.58 170 12
Male 5.92 165 10
Female 5 100 6
Female 5.5 150 8
Female 5.42 130 7
Female 5.75 150 9

Is it a man or a woman who is known to be 6 feet tall, 130 pounds and 8 inches in the palm of his foot?

Based on Naive Bayes classifier, the value of the following equation is calculated.

P (Height | gender) x P (Weight | gender) x p (Foot | gender) x P (gender)

The difficulty here is that because height, weight, and feet are continuous variables, the probability cannot be calculated using discrete variables. And because the sample is too small, it can not be divided into interval calculation. What to do?

At this point, it can be assumed that the height, weight and foot of the male and female are normally distributed, and the mean and variance are calculated by the sample, which is the density function of the normal distribution. With the density function, you can put the value into the value of a certain point of the density function.

For example, a male's height is a normal distribution with a mean value of 5.855 and a variance of 0.035. Therefore, the relative value of a man's 6-foot-height probability is equal to 1.5789 (greater than 1 is not related, since this is the value of the density function and is used only to reflect the relative probability of each value).

With this data, the gender classification can be calculated.

P (Height =6| male) x p (weight =130| male) x P (foot Palm =8| male) x p (male)
= 6.1984 x e-9

P (Height =6| female) x p (weight =130| female) x P (foot Palm =8| female) x P (female)
= 5.3778 x e-4

It can be seen that the probability of a woman is nearly 10,000 times times higher than that of a male, so the person is judged to be female.

[Machine learning] naive Bayesian algorithm (Naive Bayes)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.