Please indicate the source when reprinting: http://www.cnblogs.com/jerrylead1discriminant model and generation model

The regression model mentioned in the previous report is a discriminant model, that is, the probability of the result based on the feature value. It is formally expressed as the probability of solving conditions when parameters are determined. The general interpretation is the probability of prediction results after a given feature.

For example, if you want to determine whether a goat is a goat or a sheep, you can use the discriminant model to first learn the model from historical data, then, we extract the characteristics of the goat to predict the probability of the goat as a sheep. In another way, we can first learn a goat model based on the goat's characteristics, and then learn a sheep model based on the characteristics of the sheep. Then, we extract the features from the goat and put them into the Goat Model to see the probability, and then the sheep model to see the probability and the big one. Formal Representation is used to evaluate (also, y is the model result, and x is the feature.

Bayesian formula is used to find the uniformity of the two models:

Because we are concerned about which probability is high in the discrete value result of y (for example, the goat probability and the sheep probability), rather than the specific probability, the above formula is rewritten:

This is called posterior probability and a anterior probability.

Therefore, the discriminant model is used to calculate the conditional probability, and the generated model is used to calculate the joint probability.

Common discriminant models include linear regression, logarithm regression, linear discriminant analysis, SVM, boosting, Conditional Random Field, and neural network.

Common production models include hidden Markov models, Naive Bayes models, Gaussian Mixture Models, LDA, and Restricted Boltzmann Machine.

This blog introduces two models in detail:

Http://blog.sciencenet.cn/home.php? Mod = space & uid = 248173 & do = blog & id = 227964

2. Gaussian discriminant analysis)

1) multi-value Normal Distribution

The normal multi-variable distribution describes the distribution of n-dimensional random variables. Here the normal distribution is changed to a vector and a matrix. Writing. Suppose there are n random variables X1, X2 ,..., Xn. The I-th component of is E (Xi), and.

The probability density function is as follows:

Where | is the deciding factor, is the covariance matrix, and is symmetric semi-definite.

When it is two-dimensional, it can be represented as follows:

It determines the center position and the orientation and size of the projection elliptic.

For example:

The corresponding values are different.

2) Model Analysis and Application

If the Input Feature x is a continuous random variable, the Gaussian discriminant analysis model can be used to determine p (x | y ).

The model is as follows:

The output results are subject to the bernuoli distribution, and the features in the given model conform to the multi-value Gaussian distribution. Generally speaking, in goat models, continuous variables such as beard length, angle size, and hair length conform to Gaussian distribution, and the feature vectors they form conform to multi-value Gaussian distribution.

In this way, we can provide the probability density function:

The maximum likelihood is estimated as follows:

Note that there are two parameters, indicating that the feature mean is different under different result models, but we assume that the covariance is the same. As shown in the figure, the center locations of different models are different, but the shapes are the same. In this way, we can use a straight line for separation and discrimination.

After the derivation, the parameter estimation formula is obtained:

Is the percentage of y = 1 in the training sample.

Is the feature mean of the sample with y = 0.

Is the feature mean of the sample with y = 1.

Is the mean of sample feature variance.

As described above, the diagram is as follows:

The y values on both sides of the line are different, but the covariance matrix is the same, so the shape is the same. Different, so the location is different.

3) Relationship between Gaussian Discriminant Analysis (GDA) and logistic Regression

To describe GDA in the conditional probability mode, the statement is as follows:

Y is a function of x, which is a parameter.

Further Export

Here is the function.

This form is the form of logistic regression.

That is to say, if p (x | y) conforms to the multivariate Gaussian distribution, p (y | x) conforms to the logistic regression model. Otherwise, it is not true. Why is it not true? Because GDA has stronger assumptions and constraints.

If the training data satisfies the multivariate Gaussian distribution, GDA can be the best model in the training set. However, we often do not know in advance what distribution the training data meets, and do not make a strong assumption. The conditional hypothesis of Logistic regression is weaker than that of GDA, so logistic regression is used more often.

For example, the training data satisfies the Poisson distribution,

, P (y | x) is also logistic regression. If GDA is used at this time, the effect will be relatively poor, because the distribution of training data features is not a multivariate Gaussian distribution, but a Poisson distribution.

This is why logistic regression is used.

3. Naive Bayes model

In GDA, we require feature vector x to be a continuous real number vector. If x is a discrete value, the naive Bayes classification method can be considered.

If you want to classify spam and normal emails. Classified mail is an application of text classification.

Assume that the simplest feature description method is used. First, find an English dictionary and list all the words in it. Then, each mail is represented as a vector. Each dimension in the vector is the 0/1 value of a word in the dictionary. 1 indicates that the word appears in the mail, and 0 indicates that it does not appear.

For example, if "a", "buy" appear in an email, and "mongodvark", "mongodwolf", and "zygmurgy" appear in an email, they can be formally represented:

Assume that there are a total of 5000 words in the dictionary, then x is 5000-dimensional. In this case, we need to establish a polynomial distribution model (expansion of Two-item distribution ).

Polynomial distribution (multinomial distribution) In a random experiment, if k possible endings are A1, A2 ,..., Ak, whose probability distribution is p1, p2 ,..., Pk, so in the summary of N samples, A1 appears n1 times, A2 appears n2 times ,..., The probability of occurrence of such events as Ak occurrence of nk times P has the following formula: (Xi indicates occurrence of ni times) |

Corresponding to the above problem, taking each email as a random test, there is a possibility of results. This means that there is one pi, too many parameters, and cannot be used for modeling.

For another idea, we need p (y | x). Based on the model definition, we can calculate p (x | y) and p (y ). Assume that the features in x are conditional independent. This is called Naive Bayes hypothesis. If an email is spam (y = 1) and the word "buy" appears in this email is irrelevant to whether the email shows "price, therefore, the conditions between "buy" and "price" are independent.

Formally expressed as, (if given Z, the X and Y conditions are independent ):

It can also be expressed:

Back to Problem

This is a bit similar to the n-element syntax model in NLP, which is equivalent to unigram.

Here we find that naive Bayes assumptions are highly restrictive assumptions. "buy" is generally related to "price". Here we assume that conditions are independent. (Note that the condition independence and independence are different)

Create a formal model representation:

What we want is that the model has the largest probability product on the training data, that is, the maximum likelihood estimation is as follows:

Note that the maximum joint probability distribution product indicates that naive Bayes is used to generate a model.

Solution:

The last formula represents the proportion of the number of samples with y = 1 to the total number of samples. The first two represent the ratio of feature Xj = 1 in y = 1 or 0 samples.

However, what we want is

Actually, we can find the numerator. the denominator is the same for y = 1 and y = 0.

Of course, the naive Bayes method can be extended to the case where x and y both have multiple discrete values. When a feature is a continuous value, we can also use the segmentation method to convert a continuous value to a discrete value. We can use the information gain measurement method to determine how the conversion can be optimal (See Chapter decision tree of Mitchell ).

For example, the house size can be divided into discrete values as follows:

4 Laplace Smoothing

The naive Bayes method has a fatal drawback: It is too sensitive to data sparse issues.

For example, in the mail category mentioned above, a new mail is sent with the title "NIPS call for papers ". We use a larger online dictionary (the number of words is changed from 5000 to 35000). Assume that the position of the word NIPS in the dictionary is 35000. However, the word "NIPS" does not appear in the training data. This mail appears for the first time. Then we calculate the probability as follows:

Since NIPS has never appeared in any previous spam or normal emails, the result can only be 0.

Obviously, the probability of the final condition is also 0.

The reason is that our feature probability conditions are independent and we use the multiplication method to obtain the results.

To solve this problem, we intend to assign a "small" value instead of 0 to the non-existent feature value.

The specific smoothing method is as follows:

Assume that the discrete random variable z has {1, 2 ,..., K} values are used to represent the probability of each value. Assume that in m training samples, the observed value of z is one of the k values corresponding to each observed value. According to the original estimation method

To put it bluntly, the ratio of z = j appears.

The Laplace smoothing method adds 1 to the number of occurrences of each k value in advance. In general, it is assumed that they have appeared once.

The modified expression is:

Each z = j molecule is added with 1 and the denominator is added with k. Visible.

This is a bit like the increment smoothing method in NLP. Of course there are n smoothing methods. I will not detail them here.

Technorati label: Machine Learning

Return to the mail category question. The modified formula is as follows:

5. Text Classification event model

Recall the naive Bayes model we just used for text classification. This model is called the multi-variate Bernoulli event model ). In this model, we first randomly select the mail type (SPAM or common mail, that is, p (y), and then read the dictionary by one person, from the first word to the last word, the system randomly determines whether a word should appear in the email. If it appears, it is marked as 1; otherwise, it is marked as 0. Then, combine the words that appear into an email. Determines whether a word appears according to the probability p (xi | y ). The probability of this email can be marked.

Let's change our mindset. This time we choose to start with emails instead of dictionaries. Let I represent the I word in the email, and xi indicate the position of the word in the dictionary. The value of xi is {1, 2 ,... | V |}, | V | the number of words in the dictionary. In this way, an email can be expressed as "n", because the number of words in each email is different. Then we randomly take one of the | V | values for each xi to form an email. This is equivalent to repeatedly throwing | V | the dice, and recording the observed value forms an email. Of course, the probability of each surface is subject to p (xi | y), and each test condition is independent. In this way, the mail probability is. What are the differences between them? Note that the first n is all words in the dictionary, and the next n is the number of words in the mail. The preceding xi indicates whether a word appears. There are only 0 and 1 values. The probability of both is 1. The following xi indicates a value in | V |, and the sum of | V | p (xi | y) is 1. Is a multi-value binary distribution model. The above x vectors are all 0/1 values, and the following x vectors are all locations in the dictionary.

Formal Representation:

M training samples are represented:

Indicates that there are ni words in the I sample, and each word is numbered in the dictionary.

Then we still use Naive Bayes's method to obtain the maximum likelihood estimation probability.

Solution,

Compared with the previous formula, the denominator has an additional ni, and the molecule is changed from 0/1 to k.

For example:

X1 |
X2 |
X3 |
Y |

1 |
2 |
- |
1 |

2 |
1 |
- |
0 |

1 |
3 |
2 |
0 |

3 |
3 |
3 |
1 |

Assume that the email contains only the three words a, B, and c. They are in the dictionary: 1, 2, and 3. The first two emails contain only two words, the last two are three words.

Y = 1 is spam.

So,

If the new email is B and c, the feature is {2, 3 }.

So

The probability of the email being spam is 0.6.

Note that the difference between this formula and Naive Bayes is that the formula is used to calculate the overall sample, while Naive Bayes uses this formula for each feature, and the feature value dimension here is uneven.

If Laplace smoothing is used, the formula is as follows:

Indicates that each k value has occurred at least once.

In addition, although Naive Bayes is sometimes not the best classification method, it is simple, effective, and fast.

From: http://www.cnblogs.com/jerrylead/archive/2011/03/05/1971903.html

Please indicate the source when reprinting: http://www.cnblogs.com/jerrylead1discriminant model and generation model

The regression model mentioned in the previous report is a discriminant model, that is, the probability of the result based on the feature value. It is formally expressed as the probability of solving conditions when parameters are determined. The general interpretation is the probability of prediction results after a given feature.

For example, if you want to determine whether a goat is a goat or a sheep, you can use the discriminant model to first learn the model from historical data, then, we extract the characteristics of the goat to predict the probability of the goat as a sheep. In another way, we can first learn a goat model based on the goat's characteristics, and then learn a sheep model based on the characteristics of the sheep. Then, we extract the features from the goat and put them into the Goat Model to see the probability, and then the sheep model to see the probability and the big one. Formal Representation is used to evaluate (also, y is the model result, and x is the feature.

Bayesian formula is used to find the uniformity of the two models:

Because we are concerned about which probability is high in the discrete value result of y (for example, the goat probability and the sheep probability), rather than the specific probability, the above formula is rewritten:

This is called posterior probability and a anterior probability.

Therefore, the discriminant model is used to calculate the conditional probability, and the generated model is used to calculate the joint probability.

Common discriminant models include linear regression, logarithm regression, linear discriminant analysis, SVM, boosting, Conditional Random Field, and neural network.

Common production models include hidden Markov models, Naive Bayes models, Gaussian Mixture Models, LDA, and Restricted Boltzmann Machine.

This blog introduces two models in detail:

Http://blog.sciencenet.cn/home.php? Mod = space & uid = 248173 & do = blog & id = 227964

2. Gaussian discriminant analysis)

1) multi-value Normal Distribution

The normal multi-variable distribution describes the distribution of n-dimensional random variables. Here the normal distribution is changed to a vector and a matrix. Writing. Suppose there are n random variables X1, X2 ,..., Xn. The I-th component of is E (Xi), and.

The probability density function is as follows:

Where | is the deciding factor, is the covariance matrix, and is symmetric semi-definite.

When it is two-dimensional, it can be represented as follows:

It determines the center position and the orientation and size of the projection elliptic.

For example:

The corresponding values are different.

2) Model Analysis and Application

If the Input Feature x is a continuous random variable, the Gaussian discriminant analysis model can be used to determine p (x | y ).

The model is as follows:

The output results are subject to the bernuoli distribution, and the features in the given model conform to the multi-value Gaussian distribution. Generally speaking, in goat models, continuous variables such as beard length, angle size, and hair length conform to Gaussian distribution, and the feature vectors they form conform to multi-value Gaussian distribution.

In this way, we can provide the probability density function:

The maximum likelihood is estimated as follows:

Note that there are two parameters, indicating that the feature mean is different under different result models, but we assume that the covariance is the same. As shown in the figure, the center locations of different models are different, but the shapes are the same. In this way, we can use a straight line for separation and discrimination.

After the derivation, the parameter estimation formula is obtained:

Is the percentage of y = 1 in the training sample.

Is the feature mean of the sample with y = 0.

Is the feature mean of the sample with y = 1.

Is the mean of sample feature variance.

As described above, the diagram is as follows:

The y values on both sides of the line are different, but the covariance matrix is the same, so the shape is the same.