Introduction to LDA algorithm
A. LDA Algorithm Overview:
Linear discriminant Analysis (Linear discriminant, LDA), also called Fisher Linear discriminant (Fisher Linear discriminant, FLD), is a classical algorithm for pattern recognition, It was introduced in the field of pattern recognition and artificial intelligence in 1996 by Belhumeur. The basic idea of sexual discriminant analysis is to project the high-dimensional pattern samples to the optimal discriminant vector space, so as to achieve the effect of extracting the classified information and compressing the spatial dimension of feature space, and then ensuring that the model samples have the largest inter-class distance and the smallest intra-class distance in the new subspace, that is, the model has the best separation in the Therefore, it is an effective feature extraction method. By using this method, it is possible to maximize the scattering matrix between classes of the projected pattern samples and minimize the dispersion matrix in the class. That is, it can guarantee that the pattern sample in the new space has the smallest intra-class distance and the largest inter-class distance, that is, the pattern has the best separable in the space.
two. lda hypothesis and symbol description:
Suppose that for a space there are m samples of x1,x2,...... XM that is, each x is an n-row matrix, which represents the number of samples belonging to Class I, assuming there is a C class, then.
.................................................................................... Inter-class dispersion matrix
.................................................................................... In-Class dispersion matrices
.................................................................................... Number of samples belonging to Class I
....................................................................................... First sample
....................................................................................... Mean value of all samples
....................................................................................... Sample mean value for Class I
three. formula derivation, algorithm formalization description
The sample mean of the available class I according to the notation is:
.............................................................................. (1)
In the same vein, we can also get the overall sample mean value:
.................................................................................... (2)
According to the definition of the inter-class dispersion matrix and the intra-class dispersion matrix, the following equation can be obtained:
...................................................... (3)
.......................................... (4)
There is, of course, a different representation of the discretization matrix within the class-Inter class:
It refers to the prior probability of Class I samples, that is, the probability of Class I in the sample (), the second set of equations, we can find that the first set of equations are only less than the second set of 1/m, we will discuss later, in fact, for the multiplication of the 1/m, the algorithm itself has no effect, Now let's analyze the idea of the algorithm,
We can know that the actual meaning of the matrix is a covariance matrix , which depicts the relationship between the class and the sample population, where the function on the diagonal of the matrix represents the variance (that is, the dispersion) of the relative sample population of the class. Instead of the element on the diagonal represents the covariance of the population mean of the class sample (i.e., the correlation degree or redundancy of the class and the population sample), the formula (3) shows that the (3) equation calculates the sum of the covariance matrices of the sample and the population according to the class to which they belong. This is a macroscopic description of the degree of discrete redundancy between all classes and the population. The same can be obtained (4) in the classification of the sum of the covariance matrix between each sample and the owning class, it is characterized by a general view of the class between the sample and the class (the class characteristic described here is the average matrix of each sample in the class) the dispersion degree, In fact, it can be seen that whether the sample expectation matrix or the overall sample expectation matrix in the class, they all just act as a medium, whether in the class or the Inter-class dispersion matrix, the dispersion of the samples between classes and classes is depicted on the macroscopic level, and the dispersion between the samples and samples in the class is obtained.
LDA as a classification of the algorithm, we certainly hope that its sub-class between the low coupling, the class of high degree of aggregation, that is, the intra-class dispersion matrix in the number of small, and inter-class dispersion matrix in the value of large, such a classification effect is good.
Here we introduce the Fisher identification criteria expression:
..................................................................... (5)
It is either an n-dimensional column vector. Fisher Linear discriminant Analysis is the selection of the vector to achieve the maximum value of the projection direction, the physical meaning is that the projected sample has the largest inter-class dispersion and the smallest intra-class dispersion degree.
We put the formula (4) and the formula (3) into the formula (5) to get:
We can set up a matrix which can be seen as a space, that is, a projection of the low-dimensional space (hyper-plane) of the composition. Can also be expressed as, and when the sample is a column vector, that is, the square of the geometric distance in space. So it can be introduced that the molecule of the expression of Fisher linear discriminant analysis is the sum of the square of the geometric distance of the sample in the projection space, and the same as the squared difference of the geometric distance of the sample in the projection space, so the classification problem is transformed to find a low dimensional space so that the sample can be projected into the space. The sum of the squared distance between the projected classes and the sum of the squares of the intra-class distance is the best classification effect.
Therefore, according to the above thought, that is, by optimizing the following criteria function to find a set of optimal identification vector composition of the projection matrix (here we can also see that 1/m can be reduced by the numerator denominator, so the first set of formulas mentioned above and the second set of formulas expressed the same effect).
.................. (6)
Can prove that when the non-singular (generally in the implementation of the LDA algorithm, the sample will do a reduction of the PCA algorithm, eliminating the redundancy of the sample, thus guaranteeing the non-singular array, of course, even for the singular array can be solved, can be put or diagonal, where not to discuss, assuming that all are non-singular cases) , the column vector of the best projection matrix is just the generalized characteristic equation.
.................................................................................... (7)
The characteristic vectors of the largest eigenvalues of the D (Matrix eigenvectors), and the number of the optimal projection axes is d<=c-1.
According to the (7) formula can be launched ... ..... ..... .... ..... ..... ..... .................... (8)
And because
The following is a validation: the (7) type of substituting (6) can be:
Four. the physical meaning and thinking of the algorithm
4.1 Using an example to illustrate the spatial significance of LDA algorithm
Here we use LDA to do a classification problem: Suppose a product has two parameters to measure whether it is qualified,
We assume that two parameters are:
Parameter A |
Parameters B |
Whether qualified |
2.95 |
6.63 |
Qualified |
2.53 |
7.79 |
Qualified |
3.57 |
5.65 |
Qualified |
3.16 |
5.47 |
Qualified |
2.58 |
4.46 |
Not qualified |
2.16 |
6.22 |
Not qualified |
3.27 |
3.52 |
Not qualified |
Experimental data source:http://people.revoledu.com/kardi/tutorial/LDA/Numerical%20Example.html
So we can divide the sample into two categories according to the form, one is qualified, the other is unqualified, so we can create two data set classes:
Cls1_data =
2.9500 6.6300
2.5300 7.7900
3.5700 5.6500
3.1600 5.4700
Cls2_data =
2.5800 4.4600
2.1600 6.2200
3.2700 3.5200
Where Cls1_data is a qualified sample, Cls2_data for unqualified samples, we according to the formula (1), (2) can calculate a qualified sample of expectations, unqualified class samples of the qualified values, as well as the total sample expectations:
E_CLS1 =
3.0525 6.3850
E_CLS2 =
2.6700 4.7333
E_all =
2.8886 5.6771
We can make the position of each sample point now:
Figure A
where the blue ' * ' dots represent unqualified samples, and the red solid points represent qualified samples, the blue inverted triangle is representative of the total expectation, the Blues triangle represents the expectation of unqualified samples, the red triangle represents the expected of qualified samples. It can be seen from the coordinate direction of the x, Y axis that the qualified and unqualified samples are poorly differentiated.
We can calculate the inter-class dispersion matrix and the In-class dispersion matrix based on the expression (3), (4):
Sb =
0.0358 0.1547
0.1547 0.6681
Sw =
0.5909-1.3338
-1.3338 3.5596
We can calculate the eigenvalues and corresponding eigenvectors according to the formula (7) and (8):
L =
0.0000 0
0 2.8837
The characteristic value on the diagonal, the first eigenvalue is too small to be about 0 by the computer
The characteristic vectors that correspond to him are
V =
-0.9742-0.9230
0.2256-0.3848
According to the eigenvector corresponding to the maximum eigenvalue: ( -0.9230,-0.3848), the vector is the subspace we require, we can project the original sample to the vector after the new space (2-D projection to 1-dimensional, should be a number)
New_cls1_data =
-5.2741
-5.3328
-5.4693
-5.0216
Sample values after projection for a qualified sample
New_cls2_data =
-4.0976
-4.3872
-4.3727
For the sample value after projection for unqualified samples, we find that the classification effect is obvious after projection, the aggregation degree between class and class is very high, we draw again to see the classification effect more intuitively.
Figure II
The blue line is the characteristic vector of the smaller eigenvalue, the blue one is the characteristic vector with large eigenvalues, where the punctuate is the unqualified sample in the position of the feature vector, and the data set after the qualified sample of the two red ' * ' sign is projected, from which we can see that the classification effect is better (of course, because of X, The problem projection of the y-axis units is not so intuitive.
We then use the resulting eigenvector to judge the other samples to see what type it belongs to, and we sample this point
(2.81,5.46),
We projected it onto the eigenvector and got it: result =-4.6947 so it should belong to the unqualified sample.
4.2 LDA algorithm and PCA algorithm
On the basis of the traditional feature face method, the researchers noticed that characteristic vectors (i.e. feature faces) are the best in the classification performance, and for the K-L transformation, the difference between the external factors and the face itself can not be distinguished, and the characteristics of the images reflect the difference of illumination to a large extent. The research shows that feature face, feature face method with the introduction of light, angle and face size and so on, the recognition rate drops sharply, so the feature face method is applied to face recognition and there is a theoretical flaw. The linear discriminant analysis extracts the feature vector set, emphasizes the difference of different face and not the change of facial expression, illumination condition and so on, thus helps to improve the recognition effect.
Linear discriminant Analysis (Linear discriminant analytical, LDA) algorithm initial knowledge