Discriminant model, generative model and naive Bayesian methodPlease indicate the source when reproduced:http://www.cnblogs.com/jerrylead 1 discriminant model and generation model
The regression model mentioned in the previous report is the discriminant model, which is the probability of finding the result based on the eigenvalue. Formal representation is, in the case of the parameter determination, to solve the conditional probability. The popular explanation is the probability of predicting the result after a given feature.
For example, to determine whether a sheep is a goat or a sheep, the method of discriminant model is to learn the model from the historical data first, and then by extracting the characteristics of the sheep to predict the probability that the goat is a goat, is the probability of the sheep. In another way, we can learn a goat model based on the characteristics of the goat, and then learn a sheep model according to the characteristics of the sheep. Then extract the characteristics from this sheep, put it in the goat model to see what the probability is, and then put in the sheep model to see what the probability is, which is the big one. Formalization is expressed as a request (also, y is a model result, and X is a feature.
The unity of two models was found by Bayesian formula:
Since we are concerned about which probability is large in the result of the discrete value of Y (such as the probability of a goat and the probability of a sheep), rather than the specific probability, the above-described rewrite is:
Which is called the posteriori probability, called the priori probability.
, so sometimes it is called the discriminant model to find the conditional probability, and the generation model is the joint probability.
The common discriminant models are linear regression, logarithmic regression, linear discriminant analysis, support vector machine, boosting, conditional random field, neural network and so on.
The common production models are hidden Markov model, naive Bayesian model, Gaussian mixture model, LDA, Restricted Boltzmann machine and so on.
This blog provides a more detailed introduction to two models:
http://blog.sciencenet.cn/home.php?mod=space&uid=248173&do=blog&id=227964
2 Gaussian discriminant analysis (Gaussian discriminant analyst)
1) Multi-valued normal distribution
The multivariate normal distribution describes the distribution of n-dimensional random variables, where it becomes a vector and becomes a matrix. Writing. Suppose there are n random variables x1,x2,..., Xn. The first component of I is E (Xi), and.
The probability density function is as follows:
The determinant, the covariance matrix, and the symmetric semi-definite.
When it is two-dimensional, it can be expressed as:
This determines the center position, which determines the orientation and size of the projected ellipse.
Such as:
The corresponding are different.
2) model Analysis and application
If the input feature X is a continuous random variable, then the Gaussian discriminant analysis model can be used to determine P (x|y).
The model is as follows:
The output is subjected to the Bernoulli distribution, and the characteristic conforms to the multi-valued Gaussian distribution under the given model. Popularly speaking, under the goat model, its beard length, angular size, hair length and other continuous variables conform to the Gaussian distribution, and their composition of the eigenvector conforms to the multi-valued Gaussian distribution.
In this way, the probability density function can be given:
The maximum likelihood estimates are as follows:
Note that there are two parameters here, which means that the eigenvalues are different under different result models, but we assume the covariance is the same. This is reflected in the diagram where the center of the model is different, but the shape is the same. In this way, a straight line can be used to distinguish the separation.
After derivation, the parameter estimation formula is obtained:
Is the proportion of the result y=1 in the training sample.
is a characteristic mean in a sample of y=0.
is a characteristic mean in a sample of Y=1.
Is the mean value of the sample feature variance.
As mentioned earlier, it is shown on the diagram as:
The Y values are different on both sides of the line, but the covariance matrix is the same, so the shapes are the same. Different, so the location is different.
3) The relationship between Gaussian discriminant analysis (GDA) and logistic regression
The GDA is expressed in conditional probability mode as follows:
Y is a function of x, all of which are parameters.
Further push export
Here's the yes function.
This form is the form of logistic regression.
That is, if P (x|y) conforms to the multivariate Gaussian distribution, then P (y|x) conforms to the logistic regression model. On the contrary, it is not established. Why is this not true in turn? Because the GDA has stronger assumptions and constraints.
If the training data satisfies the multivariate Gaussian distribution, then the GDA can be the best model in the training set. However, we often do not know beforehand the training data satisfies what kind of distribution, cannot make the very strong hypothesis. The conditions of logistic regression are assumed to be weaker than GDA, so the logistic regression method is more often used.
For example, the training data satisfies the Poisson distribution,
, then P (y|x) is also logistic regression. This time if the use of GDA, then the effect will be poor, because the training Data feature distribution is not a multi-Gaussian distribution, but the Poisson distribution.
This is also the reason for the logistic regression.
3 naive Bayesian model
In Gda, we require that eigenvector x be a continuous real vector. If x is a discrete value, it is possible to consider the naive Bayes classification method.
If you want to classify spam and normal mail. Classified mail is an application of text categorization.
Assuming that the simplest feature description method is used, first find an English dictionary and list all the words in it. Each message is then represented as a vector, and each dimension in the vector is a 0/1 value of a word in the dictionary, 1 indicates that the word appears in the message, and 0 indicates that it does not appear.
For example, "a" and "buy" appear in an e-mail message, there is no "aardvark", "Aardwolf" and "Zygmurgy", then it can be formally expressed as:
Assuming a total of 5,000 words in a dictionary, X is 5000-dimensional. This is the case if you want to create a polynomial distribution model (an extension of two distributions).
Polynomial distribution (multinomial distribution) If a random experiment has k probable outcome A1,a2,...,ak, their probability distribution is P1,P2,...,PK, then in the total result of n sampling, A1 appears N1 times, A2 appears n2 times, ..., The probability that the occurrence of this occurrence of the NK occurrence of AK has the following formula: (xi stands for NI times) |
In response to the above question, each email as a random test, then the likelihood of the outcome of a kind. It means that Pi has one, too many parameters, and it can't be used to model.
To put it another way, we ask for P (y|x), according to the generation model definition we can ask P (x|y) and P (y). Assume that the characteristics in X are conditionally independent. This is called Naive Bayes hypothesis. If a message is spam (Y=1) and the word "buy" appears in this message regardless of whether the message appears "price", then "buy" and "price" are conditionally independent.
Formally expressed as, (in case of a given z, the x and Y conditions are independent):
It can also be expressed as:
Back to the question
This is somewhat similar to the N-ary syntax model in NLP, which is equivalent to Unigram.
Here we find that the naïve Bayes hypothesis is a very binding hypothesis, and "buy" is usually related to "price", which we assume is conditional independence. (Note that condition independence and independence are not the same)
Establish a formalized model representation:
So what we want is that the model can maximize the probability product on the training data, that is, the maximum likelihood estimate is as follows:
Note that this is the largest joint probability distribution product, which shows that naive Bayes is the model of generation.
The solution was:
The last expression is the ratio of the number of samples in the Y=1 to the total number of samples, the first two representing the proportions of the feature xj=1 in a sample of Y=1 or 0.
But what we're asking for IS
The actual is to find molecules, the denominator of Y=1 and y=0 are the same.
Of course, the naïve Bayesian approach can be extended to cases where both X and Y have multiple discrete values. For cases where the feature is continuous, we can also use a segmented method to convert a continuous value to a discrete value. Specifically how the conversion is optimal, we can use the information gain measurement method to determine (see Mitchell's "Machine Learning" decision Tree chapter).
For example, house size can be divided into discrete values as follows:
4 Laplace Smoothing
One of the fatal drawbacks of naive Bayesian methods is that it is too sensitive to data sparse problems.
For example, the previously mentioned message classification, now a new e-mail, the mail title is "NIPS Call for Papers". We use a larger network dictionary (the number of words is changed from 5000 to 35000), assuming that the word nips is in the dictionary position is 35000. The word nips, however, did not appear in the training data, the first time this email appeared nips. Then we calculate the probability of the time as follows:
Since Nips in the past, whether spam or normal mail has not appeared, then the result can only be 0.
Obviously the final conditional probability is also 0.
The reason is that our characteristic probability condition is independent, and we use multiplication to get the result.
To solve this problem, we intend to give a "small" value instead of 0 to the eigenvalues that are not present.
The specific smoothing methods are as follows:
Suppose that the discrete random variable Z has {,..., k} values, which we use to represent the probability of each value. Assuming that there is a M training sample, the observed value of Z is one of each of the observed values corresponding to the K values. Then according to the original estimation method can be obtained
Plainly speaking is the proportion that Z=J appears.
The Laplace smoothing method increases the number of occurrences of each K-value in advance by 1, in layman's words, assuming they all appeared once.
Then the modified expression is:
Each z=j molecule is added 1, the denominator plus K. Visible.
This is a bit like the addition of a smoothing method in NLP, of course, there are more than N smoothing method, which is no longer detailed here.
Technorati Tags: machine learning
Back to the message classification problem, the modified formula is:
5 Event models for text categorization
Recall the naïve Bayesian model we just used for text categorization, called the multivalued Bernoulli event model (Multi-variate Bernoulli). In this model, we first randomly selected the type of the message (garbage or ordinary mail, that is, p (y)), and then a person through the dictionary, from the first word to the last word, randomly determine whether a word to appear in the message, marked 1, otherwise marked as 0. The words that appear are then formed into an email. Determines whether a word appears according to probability P (xi|y). Then the probability of this email can be marked as.
Let us change a way of thinking, this time we do not start from the dictionary, but choose to start from the mail. Let I denote the word "I" in the message, Xi indicates the position of the word in the dictionary, then the XI value range is {,... | v|},| V| is the number of dictionary morphemes. Such an email can be represented as n can vary, because the number of words in each message is different. Then we randomly from | For each XI Take one of the v| values to form an email. This is equivalent to repeating the throw | V| the dice on the face, record the observations and form an email. Of course the probability of each face obeys P (xi|y), and each test condition is independent. So the probability of the email we get is. It's the same as above, so what's the difference? Note that the first n is all the words in the dictionary, and the following n is the number of words in the message. The above XI indicates whether a word appears, only 0 and 12 values, both probabilities and 1. The following XI represents | A value in the v|, | v| P (xi|y) is added and is 1. is a multi-valued two-item distribution model. The x vectors above are 0/1 values, and the vectors for the x below are the positions in the dictionary.
Formal representations are:
The M training samples are expressed as:
Represents the first sample, a total of NI words, each word in the dictionary number.
Then we still follow the naïve Bayes method to obtain the maximum likelihood estimation probability as
Solution,
Compared with the previous formula, the denominator has a NI, the molecule from 0/1 into K.
As an example:
X1 |
X2 |
X3 |
Y |
1 |
2 |
- |
1 |
2 |
1 |
- |
0 |
1 |
3 |
2 |
0 |
3 |
3 |
3 |
1 |
If the Mail only a,b,c these three words, their position in the dictionary is three-in-one, the first two messages are only 2 words, the last two letters 3 words.
Y=1 is junk mail.
So
If the new e-mail is b,c then the feature is represented as {2,3}.
So
Then the e-mail message is the spam probability is 0.6.
Note that this formula differs from naive Bayes in that it is for the whole sample, whereas the Naive Bayes is for each special request, and the eigenvalue dimension is uneven.
Here if Laplace smoothed, get the formula:
Indicates that each K-value has occurred at least once.
In addition, naive Bayesian is sometimes not the best classification method, but it is simple and effective, and fast.
"Reprint" discriminant model, generative model and naive Bayesian method