Contents of this lecture
1. Generative Learning algorithms (Generate learning Algorithm)
2. GDA (Gaussian discriminant analysis)
3. Naive Bayes (Naive Bayes)
4. Laplace Smoothing (Laplace smoothing)
1. Generate learning Algorithms and discriminant learning algorithms
Discriminant Learning algorithm: Direct learning or learning a hypothesis directly outputs 0 or 1. Logistic regression is an example of discriminant learning algorithm.
Generate Learning Algorithms: Model The probability of the occurrence of a feature in the case of a given category. modeling is also done for technical reasons.
The posteriori probability is obtained according to the Bayesian formula.
which
2. Gaussian discriminant analysis
Gaussian discriminant analysis belongs to generative learning algorithm
First, the two assumptions of Gaussian discriminant analysis:
(1)., and is a continuous value
(2). belong to Gaussian distribution
Multivariate Gaussian distribution
When a random variable z satisfies a multivariate Gaussian distribution,
So the probability density function of z is as follows
Vectors are mean values of Gaussian distributions, matrices are covariance matrices,
The value of the diagonal element of the covariance matrix controls how much the image is undulating, and the value of the inverse diagonal element controls the direction in which the image is undulating.
The mean controls the location of the image center.
Gaussian discriminant analysis model
Suppose y obeys the Bernoulli distribution,
Modeling with Gaussian distribution pairs
The parameters of this model are
Maximum likelihood estimation for these parameters
This formula is called joint likelihood (joint likelihood)
For discriminant learning algorithms, our parametric likelihood formula is defined as follows
This formula is called conditional likelihood (conditional likelihood)
Therefore, for generating the learning algorithm, we make a maximum likelihood estimation for the joint likelihood function of the parameters.
For the discriminant learning algorithm, we make the maximum likelihood estimation for the conditional likelihood function of the parameters.
The maximum likelihood estimation of the joint likelihood function is obtained, and the value of the parameter is
After the parameter value is obtained, for a new sample x, the predicted value is
The relationship between Gaussian discriminant analysis and logistic regression
Assuming that it belongs to the Gaussian distribution, then it must be possible to launch a logistic function, but the inverse is not true .
This shows that the assumption of obeying the Gaussian distribution is stronger than the assumption that the logistic distribution is obeyed .
So how to weigh the Gaussian discriminant analysis model and logistic regression model?
Gda made a stronger assumption that the Gaussian distribution would be obeyed, and once the hypothesis was correct or approximate, the GDA algorithm would perform better than the logistic regression.
Because the algorithm takes advantage of more data, the algorithm knows that the data obeys the Gaussian distribution.
Conversely, if the uncertainty of the distribution, then the logistic regression will be better, and the weak hypothesis of logistic regression makes the logistic regression algorithm has better robustness, and can still get a better result for the uncertain data distribution.
Further, assuming that the distribution belongs to the exponential distribution cluster , then it must be possible to launch a logistic function, but the inverse is not true .
It turns out that the advantage of using a generative learning algorithm is that it takes less data to generate a learning algorithm than a discriminant learning algorithm.
Even in the case of a particularly small amount of data, the Generative learning algorithm can still fit a good model.
3. Naive Bayes
Naive Bayesian belongs to generative learning algorithm
For a spam classification problem, this is a two classification problem
So how do we create a feature vector x for a message?
The common practice is to create a dictionary, assuming that the dictionary contains 50,000 words, then for the words that appear in the message, we place 1 in the corresponding position of the dictionary, the message does not appear in the word,
The corresponding position in the dictionary is set to 0. So for each message, we can get a 50000-length feature vector x
Solve the problem of the representation of eigenvectors, then how to model it?
Assuming that the polynomial distribution, the value of the eigenvector x is 2^50000, then need to 2^50000-1 a parameter, the number of parameters is too large.
So naive Bayes makes a very strong hypothesis: when given y, it is independent of each other, ( refers to the word I position in the message )
(Chain-law)
(The strong hypothesis of naive Bayes)
It is obvious that the naïve Bayesian hypothesis is impossible to establish, but it turns out that naive Bayesian algorithm is really a very effective algorithm.
The parameters of the model:
The Union likelihood function is
Make a maximum likelihood estimate (partial derivative, then equal to 0) to obtain the parameter
So for a new e-mail x, the predicted value y is
What is the difference between Gaussian discriminant analysis model and naive Bayesian model?
When the value of the random variable x is a continuous value, the GDA model can be used to predict
Naive Bayesian model can be used when the value of random variable x is a discrete value.
For example of the above message, whether spam is a two classification problem, so the model is modeled as Bernoulli distribution
In the case of a given Y, naive Bayes assumes that each word appears to be independent of each other, and that each word appears to be a two classification problem, that is, it is also modeled as a Bernoulli distribution.
In the GDA model, it is assumed that we are still dealing with a two classification problem, and that the models are still modeled as Bernoulli distributions.
In the case of a given y, the value of x is a continuous value, so we modeled it as a Gaussian distribution.
In the above message example, there is a problem: If a message appears in a previous message that never appears, then when predicting whether the message is spam,
Existence makes and both of the 0
Then the model fails.
In other words, there are no features in the limited training set, and it is not easy to think that the probability of the feature will appear later is 0.
The fix method is to use Laplace smoothing.
4.laplace Smoothing
Assuming y can take a different value of K, then
That is, the numerator plus 1 and the denominator plus k to avoid the case where the numerator denominator is 0 when predicting new values that have never been seen.
In the above mail problem,
Similarly
Finish the lecture.
(note) Stanford machine Learning--generating learning algorithms