Machine Learning-Stanford: Learning note 6-Naive Bayes

Last Update:2015-04-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Naive Bayesian

This course outline:

1. naive Bayesian

- naive Bayesian event model

2. Neural network (brief)

3. Support Vector Machine (SVM) matting – Maximum interval classifier

Review:

1. Naive Bayes

A generation learning algorithm that models P (x|y).

Example: Junk e-mail classification

With the mail input stream as input, the output Y is {0,1},1 as spam, and 0 is not junk e-mail.

Represents the message text as an input vector x

1) xi∈{0,1}, indicating whether the word "I" in the dictionary appears in a message

2) x length is the number of words N,n dictionary

3) The model is called the multivariate Bernoulli event model.

Assuming that Xi is conditionally independent at the given y , the probability of x under a given Y can be simplified to:

According to Naive Bayes formula, the maximum y of P (y|x) is obtained:

Algorithm Change version:

1) Let Xi take multiple values, xi∈{1,2,..., k}, similar to the above: P (x|y) =∏p (xi|y), but P (xi|y) into a polynomial distribution, rather than Bernoulli distribution.

Example: estimate the housing area to predict whether the house can be sold, the housing area is divided into several discrete areas, such as 0-,1000 for xi=1,1000-1500 for xi=2,1500-2000 xi=3,2000 above for xi=4

2) as in the previous example processing message (text), the x vector records the number of occurrences of each word (rather than whether it appears)

Polynomial event Model

Take the example above, give a message and represent it as a eigenvector:

,NI represents the number of messages morphemes, and XJ is an index to the dictionary that represents the position of the word in the dictionary.

If the message has 300 words, then the eigenvector x (i) length is 300, if the dictionary has 50,000 words, each element of the XJ value range is {,..., 50000}

The combined probability P (XY) of the resulting model is:

n is the message length

On the understanding that the message content satisfies some probability distribution, there are some randomly distributed in generating these messages. The process is: first determine y, whether it is spam, decide whether a person to send you spam, traverse the message 300 words, according to a certain probability distribution to generate some words, based on whether they send you spam

Model parameters:

When someone decides to send you junk e-mail (y=1), the probability of choosing the word K is similar to the following:

After giving the training set, the maximum likelihood estimate is obtained:

Get:

The first formula above, the molecule means, sum all the messages labeled 1, and then sum the word k in spam, so the molecule is actually the number of morphemes K occurrences of all spam messages in the training set. The denominator is the length of all spam messages in the training set. The meaning of the ratio is the proportion of the word k in all spam messages. Indicates the probability of selecting the word k when generating spam messages.

Apply Laplace smoothing, numerator plus 1, denominator plus total number of words (dictionary size, Xi may take a number of values):

In fact, the polynomial event model is better than the previous model, possibly because of the number of occurrences of the word. But there is still debate about this issue.

Nonlinear classification algorithm

Example: In logistic regression, the assumed value is less than 0.5 forecast 0, which is greater than 0.5 forecast 1. Given a training set, logistic regression finds a straight line (Newton's method or gradient descent), separating the positive and negative samples reasonably. But sometimes the data can not be separated by a straight line, need an algorithm, learn the non-linear boundary.

The corollary of the previous speech:

X|y=1 ~ expfamily (η1), x|y=0 ~ expfamily (η0) = P (y=1|x) is a logistic function

That is, the distribution of x|y belongs to the exponential distribution family, and the post-inspection distribution is a logistic function.

Naive Bayesian model also belongs to exponential distribution family, so it is also used in logistic linear classifier. A nonlinear classifier is described below.

2. Neural network

Assuming that the feature is x0,x1,x2,x3,x0 set to 1, the logistic regression unit is represented by a line, the circle represents the compute node, the middle node takes x0 and other features as input, hθ (x) as the output, which is a sigmoid function. In order to find the nonlinear boundary, we need to find a way to express the hypothesis that the nonlinear dividing line can be output.

Put the previously drawn units together to get the neural network. The feature is input to several sigmoid units, and the input to another sigmoid cell is output. The output value of the intermediate node is set to A1,a2,a3. These intermediate nodes are called hidden layers, and neural networks can be composed of multiple hidden layers.

Each intermediate node has a series of parameters:

A2,a3. G is the sigmoid function. The final output value is:

Where a vector consists of a1,a2,a3.

One way to learn the parameters of a model is to use the cost function J (θ) to minimize j (θ) using the gradient descent. Gradient descent allows the prediction of neural networks to be as close as possible to the sample labels in the training set you observe. In a neural network, this method is called reverse propagation.

3. Support Vector Machine Matting – maximum interval classifier

Another learning algorithm that can generate nonlinear classifiers. This lesson first introduces another class of linear classifiers, in the next lecture or next, using the idea of support vector machines, make some ingenious changes and extensions, so that it can generate a nonlinear dividing line.

Two kinds of visual comprehension of classification:

1) Consider logistic regression and calculate ΘTX:

Output 1 <=>θTx>=0; output 0 <=>θtx<0

If θtx>>0, fairly deterministic predictions are y=1; if θtx<<0, fairly deterministic predictions y=0

For all I, if Y=1,ΘTX (i) >>0, if Y=0,θtx (i) <<0, then we think the classifier is good. That is, if we find the parameters according to the training set, our learning algorithm not only needs to ensure the classification result is correct, but also to guarantee the certainty of the categorical result.

2) Assume that the training set is linearly divisible, that there must be a straight line to separate the training set. So intuitively, we will choose a straight line with a certain distance from the positive and negative samples. The following is a formal discussion of the geometry interval of the classifier.

Symbols changed in the support vector machine:

Output Y∈{-1,+1}

The assumed value of the H output is also changed to { -1,+1}

G (Z) = {1, if z>=0; -1, if z<0}

Before using the formula: hθ (x) =g (ΘTX), assuming that x0=1 and X are n+1 dimensional vectors, now ignore these two assumptions, expressed as: hw.b (x) =g (wtx+b), where B is equivalent to the original θ0,w the original θ to remove the remainder of the θ0, the length is n-dimensional. The Intercept B is presented to facilitate the derivation of support vector machine.

function Interval :

A hyper-planar (w,b) and a specific training sample (X (i), Y (i)) correspond to the function interval defined as:

The parameter (w,b) defines a classifier, for example a linear dividing line is defined.

If y (i) = 1, in order to obtain a large function interval, wtx (i) +b >> 0 need to be made;

If Y (i) =-1, in order to obtain a large function interval, need to make wtx (i) +b << 0

If Y (i) (WTx (i) +b) > 0 means that the classification results are correct

The function interval for a hyper-plane (w,b) and the entire training set is defined as:

That is, the function interval relative to the entire training set is defined as the worst case scenario for all the function intervals relative to the sample (as mentioned above, the longer the boundary distance from the sample, the better the result).

Geometry interval:

The geometric distance is defined as the distance from the point of the training sample to the dividing line determined by the superelevation plane. The distance AB, such as a to divider, is the geometric distance.

The unit vectors perpendicular to the divider line are expressed as: w/| | w| |,ab This distance is expressed as γ (i), Gamma has a small triangle representing the function interval, does not represent the geometric interval. If point a indicates x (i), then point B is:

Since point B is on the divider line, it should also satisfy:

Can solve the following:

The above description shows that for a training sample x (i), the distance between the separated planes determined by the parameters W and B can be obtained from the above formula.

Since the above assumptions have been made to classify the samples correctly, more generally, the geometry interval is defined as:

This definition is similar to the function interval, and the difference is that the vector w is normalized. Also, it is desirable to have a geometric interval as large as possible.

Conclusion: If | | w| | =1, the function interval equals the geometric interval. More generally, the geometric interval is equal to the function interval divided by | | w| |.

The geometry interval for a hyper-plane (w,b) and the entire training set is defined as:

Similar to the function interval, take the smallest geometric interval in the sample.

The maximum interval classifier can be regarded as the predecessor of the support vector machine, and is a learning algorithm, which chooses the specific W and b to maximize the geometrical interval. The maximum classification interval is an optimization problem such as the following:

That is, the selection of γ,w,b is the maximum gamma, while satisfying the condition: the maximum geometric interval chosen must ensure that each sample has at least a binding interval of γ.

The effect of the maximum interval classifier is almost as good as that of logistic regression, in-depth study of this sub-classifier can be used in a more ingenious way to support infinite dimensional feature space and to obtain an effective nonlinear classifier.

Machine Learning-Stanford: Learning note 6-Naive Bayes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine Learning-Stanford: Learning note 6-Naive Bayes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Machine Learning-Stanford: Learning note 6-Naive Bayes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support