Stanford University Machine Learning-note2

Source: Internet
Author: User

Part IV Generation Learning Algorithm


So far, we have largely discussed the learning Algorithm model: P (y|x;θ), given x, the conditional probability distribution of Y. For example, the logistic regression model: P (y|x;θ), Where:

Here the function g is a sigmoid function. In this article, we will discuss another type of learning algorithm.

Consider a classification problem: we need to differentiate the elephant (y = 1) and the dog (y = 0) According to the characteristics of the animal, and for a given training set, a logistic regression or a perceptual algorithm, we need to try to find a straight line that is used to differentiate the elephant from the dog

Decision-making boundaries. Then, when classifying a new animal either as an elephant or a dog, the classifier checks its characteristics on the side of the decision-making boundary and makes a corresponding judgment.

A different idea is presented here. First, 2 different models are built according to the characteristics of elephants and dogs. Then, in order to classify an animal, we use 2 model alignment to match, and under the prior knowledge condition of training set, we judge

The animal is more like an elephant, or more like a dog.

If an algorithm learns directly from P (y | x), or directly maps the input x to the output label {0,1}. So we call this kind of algorithm: discriminant Learning Algorithm (discriminativelearning algorithm).

Here, we will discuss a new algorithm, which is modeled on P{x | y}, p{y}, which is called: Generate Learning Algorithm (generative Learning algorithm). For example, if y represents the category of an animal and the dog

(y=0), Elephant (Y=1). Then the p{x|y=0} model represents the characteristic distribution of the dog. The P{x|y=1} model represents the characteristic distribution of elephants.

When the p{y} (prior probability) and p{x|y} (conditional probabilities) are modeled, the posterior distribution of y in a given x can be deduced using the Bayesian formula:

, the denominator p (x) = P (x|y = 1) p (y = 1) + P (x|y =0) p (y = 0), and P (x|y) and P (Y) can also be obtained from the study of the sample set. In fact, by calculating P (y|x) to get a prediction, there is no need to know the denominator, because:


1. Gaussian discriminant analysis

The first generation of learning algorithms, let's take a look at Gaussian discriminant analysis (GDA). In this model, we assume that P (x | Y) is based on a multivariate normal distribution.

Before we begin to introduce the GDA model itself, it is easy to understand the nature of the multivariate normal distribution.

1.1. Multivariate Normal distribution

The multivariate Gaussian distribution is also called the multivariable Gaussian distribution, whose parameters are: mean vector Μ and covariance matrix Σ(positive definite matrix). usually write "N (µ,σ)":

For random variable x, which is subjected to a multivariate normal distribution, the mean is obtained by the following formula: The covariance matrix is obtained by the covariance function Cov: Cov (x) =σ The following examples, the probability density function of the multivariate Gaussian distribution is shown:


The above three images, the mean values are 0 vectors. The leftmost is a standard normal distribution, and the covariance is a unit matrix I (2*2identity matrix). The middle picture covariance is 0.6*i. The picture on the right has a covariance of 2*i. It can be seen that the larger the covariance matrix, the more the probability distribution is "expanded". The smaller the covariance matrix, the more the probability distribution is "compressed". More examples are as follows:

The Gaussian distribution of the above figure is 0 vectors, and the covariance matrix is as follows:

The leftmost graph, which we are familiar with the standard Gaussian distribution, can be seen, with the addition of the diagonal diagonal values of the matrix, the probability density function becomes more "compressed" to the 45-degree line (given by x1=x2). From the contour of the probability density, we can see this more clearly:


Continue to change the covariance matrix to obtain the following probability density contour:

Covariance matrix for the above diagram:



In the last example, we make the covariance matrix fixed to the unit matrix and then modify the mean u to get the graph of the probability density function as follows:


The corresponding mean vector for the above image is:



1.2 Gaussian discriminant analysis model

When we have such a classification problem, its input characteristics are continuous random variables. Then we can apply Gaussian discriminant analysis (GDA): Use a multivariate Gaussian distribution to model P (x|y), as follows:


The distributions are written like this:


Here, the parameters of our model are φ,σ,μ0 and μ1 (note that there are 2 different mean vectors, but only one covariance matrix). Its logarithmic likelihood function is as follows:


The maximum likelihood estimate is:



After drawing the parameters, our model can be displayed as follows:


The above figure is a training sample, and a contour of 2 multivariate Gaussian distributions. A classification boundary (P (y= 1|x) = 0.5) can be derived from 2 Gaussian distributions.


1.3 Discussion: Gda and logistic regression

Slightly........


2. Naive Bayes

In the GDA model, the eigenvector is a continuous real variable. Now we will discuss another algorithm whose characteristic vectors are discrete values.

Consider using machine learning to build a spam classification system. The system can automatically filter spam or summarize spam into separate mailing groups. In fact, message categorization is an example of text categorization.

If, now we have a training set (the messages in that collection are marked as spam or not spam).

First, we use a eigenvector to represent a message. The length of the eigenvector is equal to the number of words in the dictionary, if a message contains the dictionary I vocabulary, then the feature vector's XI = 1, otherwise XI = 0.

For example, the following eigenvectors, which represent messages containing "a" and "buy", do not contain "aardvark", "Aardwolf", "Zygmurgy".


The message is encoded into a glossary vector that is equal to the number of words in the dictionary. We then modeled P (x|y). If a glossary has 50,000 words, then X is a 5000-dimensional vector. If you are using polynomial modeling

2^50000-1 parameters, which is obviously not feasible. Therefore, in order to model P (x|y), we make a strong assumption: Suppose that for a given y, Xi is conditionally independent.

This hypothesis is called Naive Bayes hypothesis, and the resulting algorithm is called naive Bayesian classifier .

According to assumptions, then there are:



The first line above equals is the basic nature of the probability, and the second line equals the Naïve Bayes hypothesis. We note that even if the naïve Bayes hypothesis is a very strong hypothesis, the resulting algorithm line is still very effective for many problems.

The parameters of our model are: Φi|y=1=p (xi= 1|y= 1), φi|y=0=p (xi=1|y= 0), andφy=p (y= 1). Typically, for a given training set, {x (i), Y (i)};i={1,..., m}.

You can write a joint likelihood estimate:


The parameter values are calculated according to the likelihood estimate:



In the above style, the "^" symbol represents the logical and the upper formula has a very natural explanation. such as Φj | Y=1 represents the percentage of the spam message that appears in the J Vocabulary.

With the above parameters. By predicting a new sample feature x, you can simply calculate:


Then, selecting which category has the highest posteriori probability, the new sample belongs to that category.

Finally, although the naïve Bayesian algorithm discussed above, the main application scenario is that the sample characteristic x obeys the Bernoulli distribution. But features can be generalized to multiple distributions. In addition, in fact, even some of the original input features (for example, the living surface of the house

Product, in our previous example) is a continuous variable. Naive Bayesian algorithm can also be applied after discretization.

When the original continuous value cannot be modeled well by a multivariate normal distribution, then using Naive Bayes (instead of GDA) usually leads to a better classifier.


2.1 Laplace Smoothing

Slightly.....



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.