Generate learning Algorithms
This course outline:
1. Generate learning Algorithms
2. Gaussian discriminant analysis (Gda,gaussian discriminant)
- Gaussian distribution (brief)
- Contrast Generation learning Algorithm & discriminant Learning Algorithm (brief)
3. Naive Bayes
4. Laplace Smoothing
Review:
Classification algorithm: Given a training set, if using the logistic regression algorithm, it works by observing this set of data, trying to find a line to separate the different classes in the diagram, as shown below.
This lesson introduces a different algorithm: Generating a learning algorithm, which is the discriminant learning algorithm.
1. Generate Learning Algorithm
Examples: Classification of malignant tumors and benign tumors
In addition to looking for a line that distinguishes two types of data, you can also use the following methods:
1) traverse the training set, find all malignant tumor samples, and directly model the characteristics of malignant tumors; Similarly, model benign tumors.
2) A new sample classification, that is, when a new patient, to determine whether it is malignant or benign, using the sample to match the malignant tumor model and benign tumor model, to see which model matching better, the prognosis is malignant or benign.
This approach is to generate learning algorithms.
definitions of two learning algorithms:
1) discriminant Learning algorithm:
-Direct Learning P (y|x), that is, given the input characteristics, the class to which the output belongs
-or learn to get a hypothesis hθ (x), direct output 0 or 1
2) Generate Learning algorithm:
-To model P (x|y), p (x|y) indicates the probability of a feature being displayed in the case of the class to which it belongs. For technical reasons, p (Y) is also modeled.
-X in P (x|y) represents a generation model that establishes a probabilistic model for sample characteristics, and Y indicates that a given sample belongs to the class under the condition
Example: In the above example, assume that a tumor condition y is malignant and benign, the resulting model will be modeled on the probability distribution of tumor symptom x under this condition
-After P (x|y) and P (Y) modeling, according to Bayesian formula P (y|x) = P (XY)/p (x) = P (x|y) p (Y)/p (x), you can calculate: P (y=1|x) = P (x|y=1) p (y=1)/p (x), where P (x) = P (x| y=0) P (y=0) + P (x|y=1) p (y=1)
2. Gaussian discriminant analysis Gda
Gda is a generative learning algorithm.
Gda Assumptions :
1) Assume that the input feature x∈RN and is a continuous value.
2) assuming P (x|y) satisfies the Gaussian distribution
* Basic knowledge of Gaussian distribution:
The random variable z satisfies the multivariate Gaussian distribution, z~n (μ,∑), the mean vector is μ, and the covariance matrix is ∑.
The probability density function is:
The multi-Gaussian distribution is the generalization of the one-element Gaussian distribution, also the bell-shaped curve, and Z is a high-dimensional vector.
Multiple Gaussian distributions note two parameters:
-Mean value vector μ
-Covariance matrix ∑= e[(Z-e[z]) (Z-e[z]) t]=e[(x-μ) (x-μ) T]
Multivariate Gaussian distribution diagram:
Left Image: Μ=0,∑=i (unit matrix)
Middle Picture: μ=0,∑=0.6i, the graph is steep
Right: Μ=0,∑=2i, the graphic becomes flattened
The μ=0,∑ in the three figures are as follows:
Visible increases the value of the diagonal elements of the matrix, that is, increases the correlation between the variables, and the Gaussian surface tends to flatten along the z1=z2 (two horizontal axes). The horizontal projection is as follows:
This increases the ∑ diagonal element, which is deflected into an oval shape along the 45° angle.
If the ∑ diagonal element is negative, the graph is as follows:
∑, respectively, are:
The graphs of the different μ are as follows:
μ, respectively, is:
μ determines the position of the center of the distribution curve.
Gda Fitting:
Give the training sample as shown in the following figure:
-Observe the positive sample (x in the image) and fit the Gaussian distribution of the positive sample, as in the lower left circle of the figure, indicating P (X|y=1)
-Observe negative samples (the circle in the diagram) and fit the Gaussian distribution of negative samples, as in the upper right-hand circle of the figure, indicating P (x|y=0)
-The two Gaussian distributions of the density function define two categories of separators, the line in the graph
-This divider line is more complex than the previous logistic fitting line
GDA Model:
Write out its probability distribution:
Parameters include Φ,μ0,μ1,∑, logarithmic likelihood:
Because the first equation is the joint probability of XY, the model is named joint likelihood (Joint likelihood).
* Compare logarithmic likelihood in logistic regression:
This model is named conditional likelihood (conditional likelihood) because it calculates the probability of y under x conditions.
By estimating the maximum likelihood of the logarithm likelihood, the result of the parameter is:
φ: Percentage of samples labeled 1 in the Training sample
Μ0: The denominator is the number of samples labeled 0, the numerator is the X (i) of the sample labeled 0, the combination is to the label 0 of the sample x (i) mean, and the Gaussian distribution parameter μ is the meaning of the mean value match
μ 1: Same as μ0, label changed to 1
Gda Forecast:
The predicted result should be the most probable y in the case of a given x, and the operator Argmax to the left of the equation represents the Y value when the maximum of P (y|x) is calculated, and the prediction formula is as follows:
P (x) can be omitted because P (x) is independent of Y.
* if P (y) is evenly distributed, that is, the probability of each type is the same, then P (y) can be omitted, which is required to make P (x|y) the largest y. However, this is not a common situation.
links to Gda and logistic regression:
Example: Suppose there is a one-dimensional training set, including some positive and negative samples, as shown in the x-axis of the fork and ring, the fork is 0, the ring is 1, using GDA to two types of samples respectively fitted Gaussian probability density function p (x|y=0) and P (X|y=1), as shown in the figure two bell-shaped curve. The sample is traversed along the x-axis, and its corresponding p (y=1|x) is drawn above the x-axis.
If you choose the x-axis left point, then it belongs to 1 probability of almost 0,p (y=1|x) = 0, two bell curve intersection, 0 or 1 is the same probability, p (y=1|x) =0.5,x axis on the right point, the probability of output 1 is almost 1,p (y=1|x) = 1. It turns out that the resulting curves are very similar to the sigmoid function curves.
Simply put, when using the GDA model, p (x|y) belongs to the Gaussian distribution, and when P (y|x) is calculated, it is almost always the same function as the sigmoid function used in logistic regression. But there is an essential difference in reality.
advantages and disadvantages of using the Build learning algorithm:
Give two inferences:
Corollary 1:
X|y obeys Gaussian distribution and P (y=1|x) is a logistic function
The inference is not tenable in the opposite direction.
Corollary 2:
X|y=1 ~ Poisson (λ1), x|y=0 ~ Poisson (λ0) = P (y=1|x) is a logistic function
X|y=1 ~ Poisson (λ1) indicates that the X|y=1 obey parameter is λ1 poisson distribution
Corollary 3:
X|y=1 ~ expfamily (η1), x|y=0 ~ expfamily (η0) = P (y=1|x) is a logistic function
The generalization of inference 2, that is, the distribution of x|y belongs to the exponential distribution family, can introduce the conclusion. Shows the robustness of logistic regression in modeling hypothesis selection .
Advantages:
Inference 1 The opposite direction is not tenable, because the x|y obeys the Gaussian distribution this hypothesis is stronger, the GDA model has made a stronger hypothesis, therefore, if x|y obeys or approximates obeys the Gaussian distribution, then GDA will be better than the logistic regression , because it uses more information about the data, The algorithm knows that the data obeys the Gaussian distribution.
Disadvantages:
if the distribution of x|y is not determined, then the discriminant algorithm has better logistic regression performance . For example, the pre-assumed data obeys a Gaussian distribution, but in fact the data obeys the Poisson distribution, and according to the inference 2,logistic regression can still achieve good results.
generating learning algorithms requires less data than a decision-learning algorithm . such as the Gda hypothesis is strong, so with less data can be fitted with a good model. The assumption of logistic regression is weaker, the hypothesis of the model is more robust, and more samples are needed to fit the data.
3. Naive Bayes
Another way to generate learning algorithms.
Example: Junk e-mail classification
Implement a spam classifier that takes the message input stream as input to determine whether the message is spam. The output y is {0,1},1 is spam, 0 is not junk e-mail.
First of all, to represent the message text as an input vector x, a dictionary with n words is known, and the first element of vector x {0,1} Indicates whether the word "I" in the dictionary appears in the message, the X example is as follows:
To model P (x|y), X is an n-dimensional {0,1} vector, assuming n=50000, then X has 2^50000 possible values, one method is to model the polynomial distribution (Bernoulli distribution to 01 modeling, polynomial distribution to K-results modeling), which requires 2^ 50000-1 parameters, too many visible parameters, the following is the naïve Bayesian method.
Assuming that Xi is conditionally independent at the given y , the probability of x under a given Y can be simplified to:
This hypothesis intuitively understands that it is not known whether a message is spam (Y), and whether some words appear in the message, which does not help you predict if other words appear in the message. Although this hypothesis is not entirely correct, naive Bayes is still used to classify messages, classify them, and so on.
* For naive Bayesian, my understanding is: By specifying some spam keywords to calculate the probability that a message is spam. In particular, given a dictionary, give the P (xi|y=1) of each word, that is, the probability that the word XI appears in spam, and then, for a message, the probability of the message being a spam message is to multiply the P (xi|y) of all the words in the message. To simplify some, specify P (xi|y=1) ={0,1}, that is, to delimit some keywords, the probability that these keywords appear in the message is the probability that the message is spam.
Model parameters include:
Φi|y=1 = P (xi=1|y=1)
φi|y=0 = P (xi=1|y=0)
Φy = P (y=1)
Likelihood of union:
Obtain the result of the parameter:
The Φi|y=1 molecule is the number of messages with the word j in the message labeled 1, and the denominator is the number of spam messages, and the overall significance is the proportion of spam messages in the spam that the word J appears in the training set.
Φi|y=0 is the ratio of non-spam messages that appear in Word j to non-spam.
Φy is the proportion of spam in all messages.
To find out the above parameters, we know P (x|y) and P (Y), use the Bernoulli distribution to model P (Y), use the product of P (xi|y) to model P (x|y), and obtain P (y|x) by Bayesian formula.
* In practice, for example, the last two months of mail are labeled "junk" or "non-spam", and then Get (x (1), Y (1)) ... (x (m), Y (M)), X (i) is a word vector that marks the word that appears in the e-mail message, and y (i) is whether the e-mail is spam. Construct a dictionary with all occurrences of the word in the message, or select a word that appears more than k times to construct a dictionary.
the question of naive Bayes :
There is a new word in a new message that does not have a dictionary, set its label to 30000, because the word does not exist in both spam and non-spam messages, p (x3000|y=1) =0,p (x30000|y=0) = 0, and the calculation p (y=1|x) is as follows:
P (y=1|x) = P (x|y=1) p (y=1)/(P (X|y=1) p (y=1) + P (x|y=0) p (y=0))
Because P (x|y=1) =p (x|y=0) =0 (P (x30000|y=1) =p (x30000|y=0) = 0, the product is 0), p (y=1|x) =0/0, the result is undefined.
The problem is that, statistically, p (x30000|y) =0 is unreasonable. That is, in the past two months the word has not appeared in the mail, it is considered that the probability of 0, unreasonable.
Generally speaking, it is unreasonable to think that these events will not happen if they have not been seen before . Solve this problem with Laplace smoothing.
4. Laplace Smoothing
According to the maximum likelihood estimate, p (y=1) = # "1" s/(# "0" s + # "1" s), that is, the probability of Y being 1 is the ratio of the number of 1 in the sample to all samples. Laplace smoothing is to add 1 to each of the numerator denominator, i.e.:
P (Y=1) = (# "1" s+1)/(# "0" s+1 + # "1" s+1)
Example: Give a team 5 game results as a sample, 5 games are lost, recorded as 0, then to predict the sixth game winning percentage, according to naive Bayes: P (Y=1) = 0/(5+0) = 0, that is, there is no winning field in the sample, the winning percentage is 0, obviously this is unreasonable. With Laplace smoothing, p (y=1) = 0+1/(5+1+0+1) = 1/7, not 0, and as the negative number increases, p (y=1) is reduced, but not 0.
More generally, if y takes a possible value in K, such as trying to estimate the parameters of a polynomial distribution, the following formula is obtained:
That is, the percentage of samples with a value of J, with Laplace smoothing as follows:
For naive Bayes, the results are:
Adding 1 to the numerator, plus 2 on the denominator, solves the problem of 0 probability.