Machine Learning-discriminative model and generative model

Source: Internet
Author: User

I. Introduction

This document is based on Andrew Ng's machine learning course http://cs229.stanford.edu.

In the previous supervised learning regression model, we used the training set to directly model the conditional probability P (Y | X; θ). For example, Logistic Regression uses hθ (X) = g (θ Tx) Modeling P (Y | X; θ) (where g (z) is the sigmoid function ). Assuming that there is a classification problem, the elephant (y = 1) and dog (y = 0) should be distinguished based on the characteristics of some animals ). Given such a dataset, regression models such as logistic regression will try to find a straight line, that is, the decision boundary, to distinguish between the elephant and the dog, and then for new samples, the regression model calculates which side of the Decision boundary the sample falls on based on the features of the new sample to obtain the corresponding classification results.

Now we consider another modeling method: First, based on the elephant sample in the training set, we can build an elephant model. Based on the dog sample in the training set, we can build a dog model. Then, for the new animal sample, we can make it match the elephant model to see how much probability there is, and the matching with the dog model to see how much probability there is, and which probability is big is that classification.

The Discriminative model directly models the conditional probability P (Y | X; θ. Common discriminant models include linear regression model, linear discriminant analysis, SVM, and neural networks.

Generative model models the joint distribution p (x, y) of X and Y, and then obtains P (yi | x) through Bayesian formula ), then select the largest Yi for P (yi | X), that is:

Common generative models include Hidden Markov Model hmm, Naive Bayes model, Gaussian mixture model GMM and lda.

 

Ii. Gaussian discriminant analysis Gaussian Discriminant Analysis

Gaussian Discriminant Analysis (GDA) is a generative model. In GDA, assume that p (x | Y) satisfies the normal distribution of multiple values. The multi-value normal distribution is described as follows:

  2.1 multi-value Normal Distribution Multivariate Normal Distribution

A n-dimensional multi-value normal distribution can be expressed as multi-variable Gaussian distribution. Its parameter is mean vector, covariance matrix, and its probability density is expressed:

Intuitive representation of the probability density when the mean vector is 2 dimensions:

The graph on the left shows that the mean is 0, and the covariance matrix Σ = I. The middle graph shows that the mean is 0, and the covariance matrix Σ = 0.6i. The graph on the right shows that the mean is 0, covariance Matrix Σ = 2I. It can be observed that the larger the covariance matrix, the more flat the probability distribution. The smaller the covariance matrix, the higher the probability distribution.

 

  2.2 Gaussian Discriminant Analysis Model

If the Input Feature X of the training set is a random continuous value, Gaussian discriminant analysis can be used. Assume that p (x | Y) satisfies the normal distribution of multiple values, that is:

The probability distribution formula of this model is:

The parameters in the model are Phi, Σ, μ 0 and μ 1. So the likelihood function (the Joint Distribution of X and Y) is:

Here, the diameter is the probability of Y = 1, the Σ is the covariance matrix, and the μ 0 is the mean value of the feature vector x corresponding to Y = 0, μ 1 is the mean value of feature vector x corresponding to y = 1, so their calculation formula is as follows:

In this way, we can model P (x, y) to obtain the probability P (y = 0 | X) and P (y = 1 | X) to obtain the classification tag. The result is shown in:

 

Iii. Naive Bayes model

In Gaussian Discriminant Analysis (GDA), feature vector X is a continuous real number. If feature vector X is a discrete value, the naive Bayes model can be used.

  3.1 spam Classification

Suppose we have a dataset marked as spam, and we need to create a spam classifier. A simple method is used to describe the characteristics of a mail. There is a dictionary. If an email contains the I word in the dictionary, set xi = 1. If this word is not used, set xi = 0, and then form the feature vector X:

This feature vector indicates that the mail contains the word "a" and the word "buy", but does not contain the word "mongodvark," mongodwolf "," zygmurgy ". The dimension of feature vector X is equal to the dictionary size. Assume that the dictionary contains 5000 words, then feature vector X is a 5000-dimension vector containing 0/1 words. If we create a polynomial distribution model, the output result in 25000 is, this means that there are nearly 25000 parameters. It is very difficult to model so many parameters.

Therefore, in order to model P (x | Y), we must make strong constraint assumptions. Here we assume that feature X is conditional independent for a given Y. This assumption condition is called Naive Bayes hypothesis, the obtained model is called the naive Bayes model. For example, if y = 1 indicates spam, which contains the word 200 "buy" and the word 300 "price ", assume that the words 200 "buy" x200 and 300 "price" x300 are independent of each other and can be expressed as P (x200 | y) = P (x200 | y, x300 ). Note that this assumption is different from x200 and x300. x200 and x300 can be written independently: P (x200) = P (x200 | x300). This assumption is for the given y, the x200 and x300 are conditional independent.

Therefore, based on the chain rule, we can use the above assumptions to obtain:

The model has three parameters:

,,

So. Based on the generative model rules, we need to maximize the joint probability:

Based on the meanings of the three parameters, we can obtain their respective calculation formulas:

In this way, the complete model of the naive Bayes model is obtained. For the new mail feature vector X, you can calculate:

In fact, we only need to compare the numerator. the denominator is the same for Y = 0 and y = 1. In this case, we only need to compare P (y = 0 | X) and P (y = 1 | x) can determine whether the email is spam.

 

  3.2 Laplace Smoothing

The naive Bayes model can work well in most cases. However, this model has one drawback: It is sensitive to data sparse issues.

For example, in the mail classification, for junior students, Nips seems too tall and may not appear in the mail. Now a new mail "nips call for papers" is introduced ", assume that the position of the word "nips" is 35000 In the dictionary. However, the word "nips" has never been used in training data. This is the first time that the word "nips" appears. Therefore, when calculating the probability:

Since nips never appear in spam or normal emails, the result can only be 0. The posterior probability is as follows:

In this case, we can use Laplace smoothing. for unused features, we assign a small value instead of 0. The specific smoothing method is:

Assume that the discrete random variable value is {1, 2,..., k}. The original estimation formula is:

After smoothing with Laplace, the new estimation formula is:

That is to say, the number of occurrences of each k value increases by 1, and the total number of denominator increases by K, which is similar to the smoothing in NLP. For details, refer to Zong Chengqing's book "statistical natural language processing.

For the preceding Naive Bayes model, the parameter calculation formula is changed:

Machine Learning-discriminative model and generative model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.