ML | Naive Bayes

Last Update:2014-07-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What's xxx

In machine learning, Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes 'theorem with strong (naive) independence assumptions between the features.

Naive Bayes is a popular (baseline) method for text categorization, the problem of judging events as belonging to one category or the other (such as spam or legitimate, sports or politics, etc .) with word frequencies as the features. with appropriate preprocessing, it is competitive in this domain with more advanced methods including support vector machines.

In simple terms, a naive Bayes classifier assumes that the value of a particle feature is unrelated to the presence or absence of any other feature, given the class variable.

An advantage of Naive Bayes is that it only requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix.

Abstractly, the probability model for a classifier is a conditional Model

$ P (C \ vert F_1, \ dots, f_n) \, $
Over a dependent class variable C with a small number of outcomes or classes, conditional on several feature variables $ F_1 $ through $ f_n $. the problem is that if the number of features N is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. we therefore reformulate the model to make it more tractable.

Using Bayes 'theorem, this can be written

$ P (C \ vert F_1, \ dots, f_n) = \ frac {P (c) \ P (F_1, \ dots, f_n \ vert c)} {P (F_1, \ dots, f_n )}. \, $
In plain English, using Bayesian probability terminology, the above equation can be written

$ \ Mbox {posterior }=\ frac {\ mbox {prior} \ times \ mbox {likelihood }{\ mbox {eviod }}. \, $

$ \ Begin {Align}
P (C, F_1, \ dots, f_n) & = P (c) \ P (F_1, \ dots, f_n \ vert c )\\
& = P (c) \ P (F_1 \ vert c) \ P (F_2, \ dots, f_n \ vert C, F_1 )\\
& = P (c) \ P (F_1 \ vert c) \ P (F_2 \ vert C, F_1) \ P (F_3, \ dots, f_n \ vert C, F_1, f_2 )\\
& = P (c) \ P (F_1 \ vert c) \ P (F_2 \ vert C, F_1) \ P (F_3 \ vert C, F_1, F_2) \ P (F_4, \ dots, f_n \ vert C, F_1, F_2, F_3 )\\
& = P (c) \ P (F_1 \ vert c) \ P (F_2 \ vert C, F_1) \ dots P (f_n \ vert C, F_1, F_2, F_3, \ dots, F _ {n-1 })
\ End {Align} $

Now the "Naive" conditional independence assumptions come into play: assume that each feature $ f_ I $ is conditionally independent of every other feature $ f_j $ for $ J \ neq I $ given the category C. this means that

$ P (f_ I \ vert C, f_j) = P (f_ I \ vert c )\,,
P (f_ I \ vert C, f_j, f_k) = P (f_ I \ vert c )\,,
P (f_ I \ vert C, f_j, f_k, f_l) = P (f_ I \ vert c) \, $
And so on, for $ I \ ne J, K, L $. Thus, the joint model can be expressed

$ \ Begin {Align}
P (C \ vert F_1, \ dots, f_n) & \ varpropto P (C, F_1, \ dots, f_n )\\
& \ Varpropto P (c) \ P (F_1 \ vert c) \ P (F_2 \ vert c) \ P (F_3 \ vert c) \ cdots \\
& \ Varpropto P (c) \ prod _ {I = 1} ^ n p (f_ I \ vert c )\,.
\ End {Align} $
This means that under the above independence assumptions, the conditional distribution over the class variable C is:

$ P (C \ vert F_1, \ dots, f_n) = \ frac {1} {z} p (c) \ prod _ {I = 1} ^ n p (f_ I \ vert c) $
Where the evidence $ z = P (F_1, \ dots, f_n) $ is a scaling factor dependent only on $ F_1, \ dots, f_n $, that is, A constant if the values of the feature variables are known.

One common rule is to pick the hypothesis that is most probable; this is known as the maximum a posteriori or map demo-rule. the corresponding classifier, a Bayes classifier, is the function $ \ mathrm {classify} $ defined as follows:

$ \ Mathrm {classify} (F_1, \ dots, f_n) =\underset {C }{\ operatorname {argmax }}\ P (C = C) \ displaystyle \ prod _ {I = 1} ^ n p (f_ I = f_ I \ vert C = C ). $

All model parameters (I. E ., class priors and feature probability distributions) can be approximated with relative frequencies from the training set. these are maximum likelihood estimates of the probabilities. A class 'prior may be calculated by assuming equiprobable classes (I. E ., priors = 1/(number of classes), or by calculating an estimate for the class probability from the training set (I. E ., (Prior for a given class) = (number of samples in the class)/(total number of samples )). to estimate the parameters for a feature's distribution, one must assume a distribution or generate nonparametric models for the features from the training set.

Algorithm

1. calculate the anterior probability, class priors and feature probability distributions; $ P (c) $ and $ z = P (F_1, \ dots, f_n) = \ prod _ {I = 1} ^ {n} p (f_ I) $

2. Assume a probability distribution for different features; $ P (f_ I \ vert c) $;

When dealing with continuous data, a typical assumption is that the continuous values associated with each class are distributed according to a Gaussian distribution.

Another common technique for handling continuous values is to use binning to discretize the feature values, to obtain a new set of Bernoulli-distributed features.

In general, the distribution method is a better choice if there is a small amount of training data, or if the precise distribution of the data is known. the discretization method tends to do better if there is a large amount of training data because it will learn to fit the distribution of the data. since Naive Bayes is typically used when a large amount of data is available (as more computationally expensive models can generally achieve better accuracy) the discretization method is generally preferred over the distribution method.

3. calculate the probability of each class and take the class with the highest probability;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

ML | Naive Bayes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

ML | Naive Bayes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support