"Reprint" Linear discriminant analysis (Linear discriminant analyses) (i)

Source: Internet
Author: User

Linear discriminant Analysis (Linear discriminant Analyst) (i)1. Questions

Before we discussed the PCA, ICA or the sample data to say, can be no category tag Y. Recall that when we do the regression, if there are too many features, then there will be irrelevant features introduced, over-fitting and so on. We can use PCA to reduce dimensions, but PCA does not take the category tag into account and is unsupervised.

For example, to return to the previous document that contains "learn" and "study" problem, after using PCA, you may be able to combine these two features into one, the dimension is reduced. But suppose our category label Y is judging the topic of this article is not about learning. Then these two characteristics have little effect on Y and can be completely removed.

To give another example, suppose we have a picture of a 100*100 pixel face recognition, each pixel is a feature, then there will be 10,000 features, and the corresponding category label y is only 0/1 values, 1 is a human face. So many features not only train complex, but also unnecessary features have unpredictable effects on the results, but we want some of the best features after dimensionality (the most closely related to Y), how to do?

2. Linear discriminant Analysis (two types of cases)

Review our previous logistic regression method, given a training sample of M n-dimensional features (I from 1 to M), each corresponding to a class label. We just have to learn the parameters so that (g is the sigmoid function).

Now consider only the two-valued classification, that is, Y=1 or y=0.

For convenience, let's change the symbol to redefine the problem, given that the feature is a D-dimensional n sample, where one of the samples belongs to the category, and the other sample belongs to a category.

Now we feel that the original feature number is too many, want to reduce the D-dimensional feature to only one dimension , but also to ensure that the category can be "clearly" reflected in the low-dimensional data, that is, this dimension can determine the category of each sample.

We refer to this best vector as W (d dimension), then the projection of sample X (d) to W can be calculated using the following formula

The resulting Y value is not a 0/1 value, but a distance from the point to the origin of the x projected to the line.

When x is two-dimensional, we're looking for a straight line (W) to do the projection, and then we'll look for the line that best separates the sample points. Such as:

From the intuitive point of view, the right image is better, can be very good to separate different categories of sample points.

Next we will find the best w from a quantitative point of view.

First we look for the mean (center point) of each type of sample, where I only have two

Since the sample point values after the X-to-W projection are

Therefore, the mean value after projection is the projection of the center point of the sample.

What is the best line (W)? We first found that the line that was able to separate the center points of the two types of samples after projection was a good straight line, and the quantitative representation was:

J (W) the bigger the better.

But what about J (W)? No, look.

The sample points are uniformly distributed in the ellipse, and a larger center point Pitch J (W) can be obtained when projected onto the transverse x1, but the X1 cannot separate the sample points due to overlapping. Projected onto the longitudinal axis x2, although J (W) is small, it is possible to separate the sample points. Therefore, we also need to consider the variance between sample points, the greater the variance, the more difficult the sample point separation.

We use another measure, called the hash Value (scatter), to hash the projected class, as follows

As can be seen from the formula, just less divided by the variance value of the number of samples, the geometric meaning of the hash value is the density of the sample points, the larger the value, the more dispersed, conversely, the more concentrated.

And we want to look at the sample point after the projection is: Different categories of sample points the better, the better the same kind of aggregation, that is, the larger the mean difference is better, the smaller the hash value the better. Well, we can use J (W) and S to measure, and the final measurement formula is

The next thing is more obvious, we just need to find the largest J (W) W.

Expand the hash value formula first

We define the middle part of the formula.

Does this formula look less than the covariance matrix of the sample number, called the hash matrix (scatter matrices)

We continue to define

Called the within-class scatter matrix.

So go back to the formula above and replace the middle part with the

Then we unfold the molecule

Called between-class scatter, is the outer product of two vectors, although it is a matrix, but the rank is 1.

Then J (W) can eventually be expressed as

Before we take the derivative, we need to normalization the denominator, because if we do not return to one, W expands any times, and we cannot determine W. So we're going to make the derivative after adding the Lagrange multiplier.

The use of matrix calculus, the derivation can be simply regarded as.

If reversible, then multiply the results on both sides of the derivative

The encouraging result is that W is the eigenvector of the Matrix.

This formula is called Fisher Linear discrimination.

Wait, let's take a look again and find the previous formula

So

Substituting the last eigenvalue formula to

Since the enlargement of the W narrows any times without affecting the result, it is possible to approximate the unknown constants on both sides and to get

At this point, we can only ask for the mean and variance of the original sample to find the best Direction W, which is the linear discriminant analysis presented by Fisher in 1936.

Look at the projection results of the two-dimensional sample above:

3. Linear discriminant Analysis (Multi-class case)

Before the case of only two classes, assuming that the categories become multiple, then how to change to ensure that the category can be separated after projection?

What we discussed earlier is how to reduce D to one dimension, now there are more categories, one dimension may not be able to meet the requirements. Suppose we have a C class, we need a k-dimensional vector (or a base vector) to do the projection.

This is represented by the K-dimensional vector.

We put the sample point in this k-dimensional vector after the result is expressed as, the following formula is established

To measure J (W) as in the previous section, we intend to continue to consider the hash between classes and the intra-class hash level.

When the sample is two-dimensional, we consider it in the geometrical sense:

In the same way as in the previous section, the sample point in Category 1 is the degree of the hash relative to the center point of the class. becomes the covariance matrix of the 1 center point relative to the sample center point, that is, the degree to which Class 1 is relative to the hash.

For

is invariant and still resembles the covariance matrix of sample points inside a class

Need to change, the original measure is the hash of two mean points, now measures the average point of each class relative to the sample center hash. Similar to will be considered as a sample point, is the mean covariance matrix, if a class inside the sample point more, then its weight is slightly larger, the weight is expressed in ni/n, but because J (W) is not sensitive to multiples, so use NI.

which

Is the mean value of all samples.

The above discussion is about the formula changes before the projection, but the numerator denominator of the true J (W) is calculated after projection. Below we look at the sample point projection after the formula changes:

These two are the formula for calculating the mean value of the Class I sample points after projection on a base vector.

The following two are projected on a base vector and

It's actually going to be replaced.

Combine the and of each projection vector (W) and update the two parameters to get

W is the base vector matrix, which is the sum of the hash matrices within each of the projected classes, and is the sum of the hash matrices of the centers of each class relative to the full sample center projection.

Recall our last section of the formula J (W), the numerator is two kinds of center distance, the denominator is each class own hash degree. Now the projection direction is multidimensional (several straight lines), the molecules need to make some changes, we do not ask for 22 sample center distance (this is not used to describe the degree of dispersion between categories), but rather for each type of center relative to the full sample center of the sum of the hash.

However, the final form of J (W) is

Since the numerator denominator we get is a hash matrix, the determinant is needed to turn the matrix into a real number. And because the determinant value is actually the product of the matrix eigenvalue, a characteristic value can represent the degree of divergence on the eigenvector. So we use determinant to calculate (I feel a bit far-fetched here, which is not so persuasive).

The whole question went back to the maximum value of J (w), we fixed the denominator at 1, and then we took the derivative, and we got the final result (I looked through a lot of handouts and articles, and I didn't find the derivative process).

As the previous section concludes

Finally, it comes down to finding the eigenvalues of the Matrix. First, the eigenvalues are obtained, and then the first k eigenvector is composed of a w matrix.

Note: because the rank in is 1, the rank is at most C (the sum of the rank of the matrix is less than or equal to the rank of each additive matrix). Since the first C-1 is known, the last one can have a linear representation of the preceding, so the rank is at most C-1. Then K maximum is C-1, that is, eigenvectors have a maximum of C-1. The characteristics of large eigenvalues of the corresponding feature vector segmentation performance is best.

Because it is not necessarily a symmetric array, the resulting k eigenvector is not necessarily orthogonal, which is different from the PCA.

"Reprint" Linear discriminant analysis (Linear discriminant analyses) (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.