Stanford CS229 Machine Learning course Note four: GDA, Naive Bayes, multiple event models

Source: Internet
Author: User

Generative Learning and discriminant learning

Like logistic regression, hθ (x) = g (ΘTX) is used to model P (y|x;θ) directly, or, like a perceptron, directly from the input space to the output space (0 or 1), they are called discriminant Learning (discriminative learning).
In contrast to generative learning (generative learning), P (x|y) and P (Y) are modeled, and then the posterior conditional probability distributions are derived by Bayesian law.

The calculation rule for the denominator is the full probability formula: P (x) = P (x|y = 1) p (y = 1) + P (x|y =0) p (y = 0).
The three algorithms Gda, Naive Bayes, and multiple event models introduced in this section belong to the domain of generative learning.

GDA (Gaussian discriminant analysis) 1. Multivariate Normal distribution

Because GDA is going to model P (x|y) through a multivariate normal distribution, let's introduce multivariate normality:

Two parameters: μ is mean vector, Σ is covariance matrix.
Next, we start with a relatively easy-to-visualize two-yuan normal distribution image to understand the effect of parameter changes on the probability distribution:








2. Model



Note: Although there are two different mean vectors when Y is 0 and 1 respectively, this model generally accepts the same covariance matrix.

3. Strategy

Maximum likelihood function

4. Algorithms

The maximal likelihood function can get the analytic solution:

The relation and comparison between 5.GDA and logistic regression

If we use P (Y=1 | x;φ,μ0,μ1,σ) as the function of x, then the GDA model can be represented in the form of logistic regression:

Here, Theta is a function of φ,μ0,μ1,σ.
Further, if P (x|y) is a multivariate normal distribution (shared covariance matrix σ), then P (y|x) will obey a logical function, but conversely, if P (y|x) is a logical function, then P (x|y) is not necessarily a multivariate normal distribution. That is, GDA has a stronger modeling hypothesis than logistic regression, and if these assumptions are correct, then GDA will fit a better model, especially if P (x|y) is indeed a multivariate normal distribution of two shared covariance matrices, then Gda becomes asymptotically Efficient (that is, when the amount of data is very large, no algorithm can be better than GDA), in this context, even if the amount of data is small, we will assume that the effect of GDA is better than the logistic regression.
However, logistic regression is more robust and not as sensitive to modeling assumptions as GDA, such as: if X|y=0 ~ possion (λ0) x|y=1 ~ possion (λ1), then P (y|x) will obey the logical model, but if you use GDA to model it, the effect will be unsatisfactory. When the data does not obey the normal distribution, and the data volume is very large, logistic regression is almost always superior to GDA, so in practice, the use of logistic regression is more common than GDA.
Let's look at a more popular generation learning algorithm: Naive Bayesian

Naive Bayesian Naive Bayes

In Gda, eigenvector x is a continuous variable (a random variable that corresponds to a multivariate normal distribution), while the Naive Bayes feature vector x is a discrete variable.

1. Naive Bayes (NB) hypothesis

Naive Bayesian models are often used in text categorization, such as spam screening: all the words extracted from the training set are then given a glossary (vaocabulary), each message is represented by a eigenvector (the length of the eigenvector is the same as the length of the glossary). If the term I appears in the message XI is 1 otherwise 0 (P (xi|y) in addition to the model can be modeled as a Bayesian effort distribution, can also be modeled as a multi-item distribution). Finally, the message is classified by the input eigenvector to determine whether it is spam.
Now Model P (x|y), assuming that there are 50,000 words in our glossary, if you Model X with multiple distributions, there will be 250000 possible results, that is, 250000-1 parameters, which is obviously not very operational. So we make a NB hypothesis: The XI condition is independent of the given Y ==> p (x1|y) = P (x1|y, x2)
A typical example of condition independence: the length of the human arm and the reading ability condition are independent of the given age. (adults tend to have longer, more reading skills than teenagers, but when age is certain, the length of the hand and the ability to read are independent of each other)
In this way, we can drastically reduce the number of parameters:

While this assumption is very demanding, the results for many problem algorithms are good enough.

2. Model

3. Strategy

Maximum likelihood function:

4. Algorithms

The maximal likelihood function can get the analytic solution:

==>

5. Laplace smoothing (Laplace smoothing)

It is not a good idea to estimate the probability of not appearing in the training set to 0, such as the problem of spam categorization, when a new message contains new words that have not been seen in the vocabulary of the training set, then P (y=1|x) = P (y=0|x) = 0 will not be categorized.
So, we have the Laplace smoothing. For example: Parameter estimates for multiple distributions (m training samples, variable Z values are {1,..., k})

So that we can both guarantee the

It will not be estimated to be 0 if it is not present in the training set.
The parameter estimation of Naive Bayes is based on Laplace smoothing, which can be adjusted as follows:

Multiple event Models

Designed for text categorization.

1. Model

Xi represents the index of the term I in the message in the glossary (that is, Xi in {1,..., | v|} Value in | V| is the vocabulary of the lexicon), n-word messages will be represented by a vector of length n, and the length of the vectors for different articles will probably not be the same.
In the multiple event model, we assume that this is the case with the message: first determine whether this is a spam message through P (Y), and then independently determine each word by multiple distributions P (x|y). The probability of the final generation of the entire message is

Note: This is based on the assumption that each word appears in a message regardless of the probability of its location.

2. Strategy

Maximum likelihood function:

3. Algorithms

The maximal likelihood function can get the analytic solution:

After Laplace smoothing is obtained:

Stanford CS229 Machine Learning course Note four: GDA, Naive Bayes, multiple event models

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.