Linear discriminant Analysis (Linear discriminant Analyst) (ii)4. Example
The spherical sample points on the 3-dimensional space are projected onto two dimensions, and W1 can achieve better separation than W2.
Comparison of the dimensionality reduction between PCA and LDA:
The PCA selects the sample point projection with the direction of the maximum variance, and LDA chooses the best way to classify the performance.
Since LDA is called linear discriminant analysis, it should have some predictive functions, such as a new sample x, how to determine its category?
With a value of two, we can project it to a straight line, get Y, and see if Y is y0 over a certain threshold, more than a certain class, or another class. And how do you find this y0?
See
According to the central limit theorem, the stochastic variable with independent distribution and the Gaussian distribution are conformed, then the maximum likelihood estimation is used to find
Then use the formula in the decision theory to find the best y0, see PRML for details.
This is a feasible but cumbersome selection method, you can see the 7th section (some questions) to get a simple answer.
5. Some limitations of using LDA
1. LDA can generate up to C-1 subspace space
The dimension interval of LDA descending dimension is in [1,c-1], independent of the original characteristic number n, and for the binary classification, it is projected up to 1 dimensions.
2, LDA is not suitable for non-Gaussian distribution samples to reduce dimensionality.
The middle red area represents a class of samples, and the blue area represents another class, and because it is a class 2, it is projected up to 1 dimensions. No matter how it is projected on a straight line, it is difficult to condense the red dots and the blue dots, separating the classes from each other.
3. LDA is not effective when it relies on variance instead of mean value in sample classification information.
, the sample points are categorized by variance information rather than mean information. LDA is not able to classify effectively because LDA relies heavily on mean information.
4. LDA may over-fit data.
6. Some variants of LDA
1. Non-parametric LDA
Nonparametric LDA uses local information and K near the sample points to calculate, so that it is full-rank, so that we can extract the extra C-1 eigenvectors. And the separation effect is better after projection.
2. Orthogonal LDA
Find the best eigenvector first, and then find the vector that is orthogonal to the eigenvector and maximizes the fisher condition. This method also can get rid of C-1 limit.
3. Generalized LDA
Introduced the theory of Bayesian risk, etc.
4. Kernel function Lda
Use the kernel function to calculate the feature.
7. Some issues
The above is used in multi-value classification
is a hash matrix with weights of various sample centers to the full sample center. If c=2 (i.e., two value classification) applies this formula, it cannot be used in binary classification.
Therefore, the two-valued classification and the multi-value classification will be different, but the meaning is consistent.
For the two-valued classification problem, it is surprising that the least squares and Fisher linear discriminant analysis are consistent.
Here we prove the conclusion, and give the y0 of the 4th section is worth selecting the problem.
Review the previous linear regression, a training sample given n D-dimensional features (I from 1 to n), each corresponding to a class label. We have previously made y=0 a class, and Y=1 represents another class, and now we need to make some changes to prove the relationship between least squares and LDA.
is to replace 0/1 with a value.
We list the least squares formula
W and is the fit weight parameter.
respectively, and W.
From the first formula, you can get a
After the elimination of the Yuan,
It can be proved that the second equation is expanded and the following formula is equivalent
This is the same as the formula in the two value category.
Because
Therefore, the final result is still
This process is understood in terms of geometric meanings, i.e., the linear regression after deformation (redefining the class label), and the straight line direction after the linear regression is the linear direction that LDA obtains in the binary classification.
Well, we can see from the definition of the changed y that y>0 belongs to class, Y<0 belongs to class. So we can choose Y0=0, that is, if, it is a class, otherwise it is a class.
Wrote a lot of, quite miscellaneous, there is a topic model is also called LDA, but the name is called latent Dirichlet Allocation, the second author is Andrew Ng Daniel, the last of his mentor Jordan, when read and then write a summary of the post.
"Reprint" Linear discriminant analysis (Linear discriminant analyses) (ii)