(note) Stanford machine Learning--generating learning algorithms

Source: Internet
Author: User

Contents of this lecture

1. Generative Learning algorithms (Generate learning Algorithm)

2. GDA (Gaussian discriminant analysis)

3. Naive Bayes (Naive Bayes)

4. Laplace Smoothing (Laplace smoothing)

1. Generate learning Algorithms and discriminant learning algorithms

Discriminant Learning algorithm: Direct learning or learning a hypothesis directly outputs 0 or 1. Logistic regression is an example of discriminant learning algorithm.

Generate Learning Algorithms: Model The probability of the occurrence of a feature in the case of a given category. modeling is also done for technical reasons.

The posteriori probability is obtained according to the Bayesian formula.

which

       

2. Gaussian discriminant analysis

Gaussian discriminant analysis belongs to generative learning algorithm

First, the two assumptions of Gaussian discriminant analysis:

(1)., and is a continuous value

(2). belong to Gaussian distribution

Multivariate Gaussian distribution

When a random variable z satisfies a multivariate Gaussian distribution,

So the probability density function of z is as follows

Vectors are mean values of Gaussian distributions, matrices are covariance matrices,

The value of the diagonal element of the covariance matrix controls how much the image is undulating, and the value of the inverse diagonal element controls the direction in which the image is undulating.

The mean controls the location of the image center.

Gaussian discriminant analysis model

Suppose y obeys the Bernoulli distribution,

Modeling with Gaussian distribution pairs

The parameters of this model are

Maximum likelihood estimation for these parameters

This formula is called joint likelihood (joint likelihood)

For discriminant learning algorithms, our parametric likelihood formula is defined as follows

This formula is called conditional likelihood (conditional likelihood)

Therefore, for generating the learning algorithm, we make a maximum likelihood estimation for the joint likelihood function of the parameters.

For the discriminant learning algorithm, we make the maximum likelihood estimation for the conditional likelihood function of the parameters.

The maximum likelihood estimation of the joint likelihood function is obtained, and the value of the parameter is

After the parameter value is obtained, for a new sample x, the predicted value is

The relationship between Gaussian discriminant analysis and logistic regression

Assuming that it belongs to the Gaussian distribution, then it must be possible to launch a logistic function, but the inverse is not true .

This shows that the assumption of obeying the Gaussian distribution is stronger than the assumption that the logistic distribution is obeyed .

So how to weigh the Gaussian discriminant analysis model and logistic regression model?

Gda made a stronger assumption that the Gaussian distribution would be obeyed, and once the hypothesis was correct or approximate, the GDA algorithm would perform better than the logistic regression.

Because the algorithm takes advantage of more data, the algorithm knows that the data obeys the Gaussian distribution.

Conversely, if the uncertainty of the distribution, then the logistic regression will be better, and the weak hypothesis of logistic regression makes the logistic regression algorithm has better robustness, and can still get a better result for the uncertain data distribution.

Further, assuming that the distribution belongs to the exponential distribution cluster , then it must be possible to launch a logistic function, but the inverse is not true .

It turns out that the advantage of using a generative learning algorithm is that it takes less data to generate a learning algorithm than a discriminant learning algorithm.

Even in the case of a particularly small amount of data, the Generative learning algorithm can still fit a good model.

3. Naive Bayes

Naive Bayesian belongs to generative learning algorithm

For a spam classification problem, this is a two classification problem

So how do we create a feature vector x for a message?

The common practice is to create a dictionary, assuming that the dictionary contains 50,000 words, then for the words that appear in the message, we place 1 in the corresponding position of the dictionary, the message does not appear in the word,

The corresponding position in the dictionary is set to 0. So for each message, we can get a 50000-length feature vector x

Solve the problem of the representation of eigenvectors, then how to model it?

Assuming that the polynomial distribution, the value of the eigenvector x is 2^50000, then need to 2^50000-1 a parameter, the number of parameters is too large.

So naive Bayes makes a very strong hypothesis: when given y, it is independent of each other, ( refers to the word I position in the message )

(Chain-law)

(The strong hypothesis of naive Bayes)

It is obvious that the naïve Bayesian hypothesis is impossible to establish, but it turns out that naive Bayesian algorithm is really a very effective algorithm.

The parameters of the model:

The Union likelihood function is

Make a maximum likelihood estimate (partial derivative, then equal to 0) to obtain the parameter

So for a new e-mail x, the predicted value y is

What is the difference between Gaussian discriminant analysis model and naive Bayesian model?

When the value of the random variable x is a continuous value, the GDA model can be used to predict

Naive Bayesian model can be used when the value of random variable x is a discrete value.

For example of the above message, whether spam is a two classification problem, so the model is modeled as Bernoulli distribution

In the case of a given Y, naive Bayes assumes that each word appears to be independent of each other, and that each word appears to be a two classification problem, that is, it is also modeled as a Bernoulli distribution.

In the GDA model, it is assumed that we are still dealing with a two classification problem, and that the models are still modeled as Bernoulli distributions.

In the case of a given y, the value of x is a continuous value, so we modeled it as a Gaussian distribution.

In the above message example, there is a problem: If a message appears in a previous message that never appears, then when predicting whether the message is spam,

Existence makes and both of the 0

Then the model fails.

In other words, there are no features in the limited training set, and it is not easy to think that the probability of the feature will appear later is 0.

The fix method is to use Laplace smoothing.

4.laplace Smoothing

Assuming y can take a different value of K, then

That is, the numerator plus 1 and the denominator plus k to avoid the case where the numerator denominator is 0 when predicting new values that have never been seen.

In the above mail problem,

Similarly

Finish the lecture.

(note) Stanford machine Learning--generating learning algorithms

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.