Generate Learning Algorithm (generative learning algorithms)

Source: Internet
Author: User

First, Introduction

The algorithm we talked about earlier is to model the \ (P (y|x;\theta\) directly in the case of a given x. For example, logistic regression uses \ (H_\theta (x) =g (\theta^t x) \) to \ (P (y|x;\theta\) modeling.

Now, considering such a classification problem, we would like to distinguish the animal from the Elephant (Y=1) or the dog (y=0) according to some characteristics. Given such a training set, the logical regression or perceptron algorithm is to find a decision boundary and separate the elephant from the dog's sample. But if a change of mind, first of all, according to the characteristics of elephants to learn an elephant model, and then according to the characteristics of dogs to learn the model of the dog, and finally for a new sample, extract its characteristics first put in the elephant's model to find the probability of the elephant, and then put in the dog model to find the probability of a dog, Finally, we compare two probabilities which is large, that is, determine which type of animal is. Also called P (x|y) (also includes P (y)), y is the output, and X is characteristic.

As described above, let's try to define these two ways to solve the problem:

discriminant Learning Algorithm (discriminative learning algorithm): Direct Learning P (y|x) or method of direct mapping from input to output

Generate learning Algorithm (generative Learning algorithm): models P (x|y) (also including P (y)).

To deepen our understanding of generative learning algorithms, we look at

Y: Output variable, take two value, if the elephant takes 1, the dog takes 0

P (x|y = 0): Model The characteristics of a dog

P (x|y = 1): Modeling the characteristics of elephants

When we model P (x|y) and P (Y), using the Bayesian formula, we can calculate the probability of y in the case of X, as follows:

P (x) = P (x|y = 1) p (y = 1) + P (x|y =0) p (y = 0)

As we are concerned about which of the Y discrete results is more likely than the probability of a specific requirement, the above formula can be expressed as:

Common generation models include: Hmm, naive Bayesian model, Gaussian mixture model GMM, LDA, etc. two: Gaussian discriminant analysis (Gaussian discriminantanalyses)

The first build learning algorithm, GDA, is described below. In Gda, suppose P (x|y) is a polynomial normal distribution with more than 2.1 normal distributions (the multivariate normal distribution)

n-dimensional polynomial distributions also become a Gaussian distribution. The parameters are mean vector μ∈rn , covariance matrix ∑∈rnxn, and the probability density is denoted as:

|σ| represents the determinant of the Matrix Σ (determinant)

Mean value:

Covariance: Cov (z) =e[(z?) E[z]) (Z? E[z]) T]=e[zzt]? (E[z]) (E[z]) T =σ. If X ~ N (μ,σ), then CoV (X) =σ.

Let's take a look at a number of examples of Gaussian distributions in conjunction with an image.

Description

The first diagram on the left, μ is a vector of 2x1, the value is 0, the covariance matrix ∑=i (2x2 unit vector) at this time the Gaussian distribution is called the standard normal distribution

Second diagram, μ unchanged, ∑=0.6i

The third diagram, μ unchanged, ∑=2i

So, Determining the center position determines the direction and size of the projected ellipse. 2.2 Gaussian discriminant analysis model (the Gaussian discriminant analytical models)

Now that there is a classification problem, the characteristic value x of the training set is a random continuous value , then we can use the Gaussian discriminant analysis model, assuming P (x|y) satisfies the multivalued normal distribution, namely:

The probability distribution is:

The model parameter is φ,σ,μ0 andμ1, and the logarithmic likelihood function is:

Note that there are two parameters here, which means that the eigenvalues are different under different result models, but we assume the covariance is the same. This is reflected in the diagram where the center of the model is different, but the shape is the same. In this way, a straight line can be used to distinguish the separation.

All parameters are obtained:

Is the proportion of the result y=1 in the training sample.

is a characteristic mean in a sample of y=0.

is a characteristic mean in a sample of Y=1.

Is the mean value of the sample feature variance.

So, as described above, draw an image such as:

the Y values are different on both sides of the line, but the covariance matrix is the same, so the shapes are the same. Different, so the location is different.
2.3 Discussion of GDA and logistic regression (DISCUSSION:GDA and logistic regression)

Now we see p (y = 1|x;φ,μ0,μ1,σ) as the function of x, it can be expressed as:

Θ is a function of parameter φ,σ,μ0,μ1, which is the form of logistic regression.

Logistic regression and Gda we get two different decision boundaries when we train the same data set, so how do we choose the model:

As mentioned above, if P (x|y) is a multidimensional Gaussian distribution, then P (y|x) can introduce a logistic function, and conversely is not correct, p (y|x) is a logistic function and cannot be introduced P (x|y) to obey the Gaussian distribution. That means Gda made a stronger model hypothesis than logistic regression.

If P (x|y) obeys or approaches the Gaussian distribution, the GDA is more efficient than the logistic regression.

When the training sample is large, there is no better algorithm than GDA (no matter how accurate the predictions).

It turns out that even if the sample size is small, Gda relative logisic is a better algorithm.

However, logistic regression makes weaker assumptions, with better robustness relative to incorrect model assumptions (robust). Many different assumptions can be introduced in the form of a logistic function. For example, if so P (y|x) is a logistic. Logstic regression performs well in this type of Poisson data. But if we use the GDA model to apply Gaussian distributions to not Gaussian data, the result is not predictable, and GDA is not very good. Three: Naive Bayes (Naive Bayes)

In Gda, the eigenvector x is a continuous real vector, so now talk about what happens when X is discrete.

We use an example of classifying spam, and we want to differentiate whether a message is spam or not. Classified mail is an application of text categorization

Compares an e-mail as an input eigenvector to an existing dictionary, if the word "I" in the dictionary appears in the message, Xi = 1, otherwise XI = 0, so now we assume that the input eigenvectors are as follows:

Once the eigenvectors are selected, the P (x|y) is now modeled:

Suppose there are 50,000 words in the dictionary, x∈{0, 1}50000 If you use polynomial modeling, there will be 250000 results, 250000-1-dimensional parameter vectors, so there are too many obvious parameters. So in order to model P (x|y), it is necessary to make a strong hypothesis , assuming that the characteristics of X are conditionally independent, this hypothesis is called Naive Bayes hypothesis (Naive Bayes (NB) assumption), this algorithm is called Naive Bayesian classification (Naive Bayes classifier).

Explain:

If there is a spam message (Y=1), the word buy in the e-mail is in the 2087 position it has no effect on the word "price" of 39831, that is, we can express P (x2087|y) = P (x2087|y, x39831), This is independent of x2087 and x39831, and if they are independent, they can be written as P (x2087) = P (x2087|x39831), and we assume that in the case of a given y, x2087 and x39831 are independent.

Now we go back to the question, and after making assumptions, we can get:

Explain

The first equals sign uses the nature of the probability chain law

The second equation uses the Naïve Bayes hypothesis

The Naïve Bayes hypothesis is a very restrictive hypothesis, in general, buy and price are related, here we assume that the conditions are independent, independent and independent of the conditions are not the same

Model parameters:

Φi|y=1 = P (xi= 1|y = 1)

Φi|y=0 = P (xi = 1|y = 0)

Φy = P (y = 1)

For the training set {(X (i), Y (i)), I =1, ..., M}, the Union likelihood function (joint likelihood) is based on the build Learning model rule:

Get the maximum likelihood estimate:

The last expression is the ratio of the number of samples in the Y=1 to the total number of samples, the first two representing the proportions of the feature xj=1 in a sample of Y=1 or 0.

Once all the parameters have been fitted, if we want to predict a new sample now, the feature is x:

In fact, as long as the comparison of molecules, the denominator for y = 0 and y = 1 is the same, this time as long as the comparison p (y = 0|x) and p (y = 1|x) which is large to determine whether the message is spam. 3.1 Laplace Smoothing (Laplace smoothing)

Naive Bayesian models can work well in most cases. However, the model has one drawback: it is sensitive to data sparse issues.

For example, in the mail classification, for the lower grades of graduate students, NIPS appear too tall on the mail may not appear, and now a new mail "NIPS Call for Papers", assuming that the word NIPS in the dictionary position is 35000, However, the word nips never appeared in the training data, this is the first occurrence of nips, so the probability:

Since Nips never appeared in Spam and normal mail, the result is only 0. So the final posteriori probability:

For such a situation, we can use Laplace smoothing, for features that are not present, we give a small value instead of 0. The specific smoothing methods are:

Assuming that the discrete random variable value is {1,2,,k}, the original estimate formula is:

With Laplace smoothing, the new estimate formula is:

That is, each k value occurrences plus 1, the total denominator plus k, similar to the smoothing in NLP, specific reference Zongchengqing Teacher's "Statistical natural Language Processing" a book.

For the naïve Bayesian model above, the parameter calculation formula is changed to:

Generate Learning Algorithm (generative learning algorithms)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.