1. Problem
We have discussed PCA and ICA before. For sample data, there can be no category label y. Recall that when we perform regression, if there are too many features, there will be unrelated feature introduction, over-fitting, and other problems. We can use PCA for dimensionality reduction, but PCA does not take category labels into account, which is unsupervised.
For example, if you go back to the question that the previous document contains "Learn" and "study", after using PCA, you may be able to combine these two features into one, reducing the dimension. But assume that our category tag y is to judge this article.ArticleIs the topic related to learning. These two features have almost no effect on Y and can be completely removed.
For another example, if we recognize a human face in a 100*100 pixel image and each pixel is a feature, there will be 10000 features, the corresponding category tag y is only 0/1, and 1 represents the face. So many features are not only complex in training, but do not need to have unpredictable effects on the results. However, we want to obtain some of the best features (most closely related to Y) after dimensionality reduction ), what should we do?
2. Linear Discriminant Analysis (Case 2)
Review our previous logistic regression method. Given m n-dimensional feature training samples (I from 1 to m), each corresponds to a class label. We just need to learn the parameters so that (G is the sigmoid function ).
Now we only consider binary classification, that is, y = 1 or Y = 0.
To facilitate representation, we first change the symbol to redefine the problem. We give n samples with the feature d dimension. One of them belongs to the category and the other belongs to the category.
Now we think that there are too many original features and we want to reduce the D-dimension signOnly one dimensionTo ensure that the category is "clearly" reflected in low-dimensional data, that is, the one-dimension determines the category of each sample.
We call this best vector W (D dimension). The projection from X (D dimension) to W can be calculated using the following formula.
The obtained y value is not a 0/1 value, but the distance from X to the point to the origin point on a straight line.
When X is two-dimensional, we are looking for a straight line (W in the direction) for projection, and then looking for a straight line that best separates the sample points. For example:
Intuitively, the right image is better, and the sample points of different classes can be well separated.
Next we will find the best w from a quantitative perspective.
First, we look for the mean (center point) of each type of sample. Here I only have two
Since the mean of the sample points after X to W projection is
We can see that the mean value after projection is the projection of the sample center.
What is the best straight line (w? First, we found that the straight line that can separate the two types of sample centers after projection is a good line. The quantitative representation is:
The larger J (W), the better.
But can't I only consider J (w? No, see
The sample points are evenly distributed in the elliptical shape. When projected onto x 1 of the horizontal axis, a larger center spacing J (w) can be obtained. However, because of overlap, X 1 cannot separate the sample points. Projected to the vertical axis X2, although J (W) is small, the sample points can be separated. Therefore, we also need to consider the variance between the sample points. The larger the variance, the harder it is to separate the sample points.
We use another measurement value, called the scatter, to calculate the hash value for the projected class, as shown below:
It can be seen from the formula that only the difference value between the number of samples is less divided. The geometric meaning of the hash value is the density of the sample points. The larger the value, the more scattered, and vice versa.
The projection sample points we want will look like: the better the samples of different classes are separated, the better the clustering of the same class, that is, the larger the mean difference, the better, the smaller the hash value, the better. Exactly, we can use J (W) and S to measure. The final measurement formula is
The next step is more obvious. We only need to find the W that makes J (w) the largest.
First, expand the hash value formula.
We define the middle part of the above formula
Isn't this formula A covariance matrix that is less divided by the number of samples, called a scatter matrices)
Continue to define
CalledWithin-Class scatter matrix.
Return to the formula above and replace the intermediate part.
Then we expand the numerator
CalledBetween-Class scatter is the outer product of two vectors. Although it is a matrix, its rank is 1.
Then J (w) can be expressed
We need to normalize the denominator before seeking for guidance, because if we do not normalize the denominator, W will be expanded by any number of times, and we will not be able to determine W. Therefore, we want to make the order. Then, after we add the Laplace multiplier, evaluate the derivation.
Matrix calculus is used, which can be viewed as a simple method during derivation.
If it is reversible, multiply both sides of the result after the derivation.
The good result is that W is the feature vector of the matrix.
This formula is called Fisher Linear discrimination.
Wait. Let's take another look and find out the previous formula.
So
Get the formula of the final feature value
The result is not affected because W is scaled down by any number of times. Therefore, the unknown Constants on both sides can be reduced to get
So far, we only need to find the mean and variance of the original sample to find the best direction W, which is the linear discriminant analysis proposed by Fisher in 1936.
See the projection result graph of the preceding two-dimensional sample:
3. Linear Discriminant Analysis (multi-class Cases)
In the preceding scenario, only two classes are involved. If there are multiple classes, how can we change them to ensure that categories can be separated after projection?
We discussed how to reduce D dimension to one dimension. Now there are more categories, and one dimension may not meet the requirements. Suppose we have C categories, and K-dimensional vectors (or base vectors) are required for projection.
This K-dimensional vector is represented.
After projection of the K-dimensional vector, the sample points are expressed as follows:
To measure J (w) as in the previous section, we intend to consider the hash between classes and the class hash.
When the sample is 2D, we consider it in a geometric sense:
In the same way as in the preceding section, the sample points in Class 1 are hashed to the center points of the class. Returns the covariance matrix between the center of class 1 and the center of the sample, that is, the degree of hash relative to Class 1.
Is
The formula remains unchanged. It is still similar to the covariance matrix of the sample points within the class.
It needs to be changed. The original measurement is the hash of two mean points. Now, the measurement is the hash of each type of mean points relative to the sample center. Similar to the sample point, it is the covariance matrix of the mean. If there are many sample points in a class, the weight is slightly larger, and the weight is represented by Ni/n. However, due to J (W) it is not sensitive to multiples, So Ni is used.
Where
Is the mean of all samples.
All of the above are the formula changes before projection, but the actual J (w) denominator is calculated after projection. The following shows the formula change after the sample point projection:
These two are the mean calculation formula after the I-th sample point is shadow on a base vector.
The following two are the sums after projection on a base vector.
In fact, it is changed.
Update the two parameters based on the sum of the projection vectors (W ).
W is the base vector matrix, the sum of the scattered matrices within each class after projection, and the sum of the scattered matrices of each class center after projection relative to the full sample center projection.
Recall the formula J (w) in the previous section. The numerator is two kinds of center distance, and the denominator is the hash degree of each class. Now the Projection Direction is multi-dimensional (several straight lines), and the numerator needs to make some changes. We do not calculate the sum of the center distance of the two samples (this is useless to describe the degree of dispersion between classes ), instead, calculate the sum of hash degrees between each type of center and the entire sample center.
However, the final J (w) form is
Since the denominator of the numerator we get is a discrete matrix, to convert the matrix into a real number, we need to take the deciding factor. Because the value of the determinant is actually the product of the matrix feature value, a feature value can indicate the degree of divergence on the feature vector. So we use the deciding factor for calculation (here I feel a little far-fetched, the truth is not so convincing ).
The whole problem is solved again by finding the maximum value of J (W). We fixed the denominator to 1 and then obtained the final result (I have reviewed many handouts and articles, the process of derivation is not found)
Same as the conclusion in the previous section
Finally, it comes down to the feature value of the matrix. First, obtain the feature value, and then take the first K feature vectors to form the W matrix.
Note:Since the rank in is 1, the rank is mostly C (the rank of the matrix is less than or equal to the sum of the rank of each addition matrix ). Since we know the first C-1, the last one can have the first linear representation, so the rank is mostly C-1. Then K is the largest C-1, that is, the feature vector has a maximum of C-1. The feature vector with a large feature value has the best splitting performance.
Because it is not necessarily a symmetric array, the K feature vectors obtained are not necessarily orthogonal, which is different from PCA.