Machine Learning-feature selection (Dimension Reduction) Linear Discriminant Analysis (LDA)

Source: Internet
Author: User

Feature Selection (Dimension Reduction) is an important step in data preprocessing. For classification, feature selection can select the features most important to classification from a large number of features to remove noise from the original data. Principal Component Analysis (PCA) and linear discriminant analysis (LDA) are two of the most common feature selection algorithms. For more information about PCA, see my other blog. Here we mainly introduce Linear Discriminant Analysis (LDA), which is based on two documents: Fisher Discriminant Analysis with Kernals [1] and Fisher Linear Discriminant Analysis [2.

One major difference between LDA and PCA is that LDA is a supervised algorithm, while PCA is unsupervised because the PCA algorithm does not consider data tags (categories ), only the original data is mapped to the direction (base) with a relatively large variance. The LDA algorithm considers data labels. The document [2] provides a very vivid example to illustrate the poor performance of the PCA algorithm in some cases, such:

We use different colors to mark data of different classes C1 and C2. According to the PCA algorithm, the data should be mapped to the direction with the largest variance, that is, the y-axis direction. However, if the data is mapped to the y-axis direction, C1 and C2, the data of different categories will be completely mixed, it is difficult to distinguish, so the effect of dimensionality reduction using PCA before classification is very poor. However, using the LDA algorithm, data is mapped to the x-axis direction.

The LDA algorithm takes into account the category attribute of the data. Given two classes C1 and C2, we want to find a vector ω. when the data is mapped to the ω direction, data from two classes should be separated as much as possible, and data in the same class should be as compact as possible. The data ing formula is z = ω Tx, where z is the projection from x to ω, which is also a dimension reduction from d to 1 dimension.

LingM1And m1 represent the mean value after the previous projection of C1 data projection, respectively. Yi Zhi m1 = ω TM1,Similarly, m2 = ω TM2

So that s12 and s22 respectively represent C1 and C2 data after projection (scatter), that is, s12 = sigma (ω Txt-m1) 2rt, s22 = Σ (ω Txt-m2) 2 (1-rt) Where if xt ε C1, rt = 1; otherwise rt = 0.

We want | M1-M2 | as big as possible, while s12 + s22 as small as possible,Fisher Linear DiscriminantIs to maximize the following formula ω:

J (ω) = (M1-M2) 2/(s12 + s22) equation-1

Replace the numerator in formula-1: (M1-M2) 2 = (ω TM1-ω TM2) 2 = ω T (M1-M2)(M1-M2) T ω = ω TSB ω

WhereSB = (M1-M2)(M1-M2) T-2

YesInter-class scatter matrix(Between class scatter matrix ).

Rewrite the denominator in sub-1:

Σ (ω Txt-m1) 2rt = Σ ω T (xt-M1) (xt-M1) T ω rt = ω TS1 ω, whereS1 = Σ rt (xt-M1) (xt-M1) T is C1'sClass Distribution Matrix(Within class scatter matrix ).

LingSW =S1 +S2, yesSum of intra-class distribution, Then s12 + s22 = ω TSW ω.

So the formula-1 can be rewritten:

J (ω) = (ω TSB ω)/(ω TSW ω) Sub-3

We only need to make formula-3 evaluate the derivative of ω, and then make the derivative equal to 0, then we can find the value of ω: ω = cSW-1 (M1-M2), Where c is a parameter, we are only interested in the direction of ω, So c can be set to 1.

In addition, the final J (ω) value is equal to λ k, and λ k isSW-1SB's largest feature value, while ω isSW-1SThe feature vector corresponding to the largest feature value of B.

Finally, we have some discussions about LDA algorithms, from the literature [1]:

1. Fisher LDA makes some strong assumptions about the data distribution. For example, the data of each class is Gaussian distribution, and the covariance of each class is equal. Although these strong assumptions may not be met in actual data, Fisher LDA has been proved to be a very effective dimension reduction algorithm, because linear models are more robust to noise, it is not easy to over-fit. The disadvantage is that the model is simple and the expression capability is not strong. To enhance the expression capability of the Fisher LDA algorithm, core functions can be introduced, see my other blog machine learning-Kernel Fisher LDA algorithm.

2. Accurate Estimation of the data distribution matrix is very important, and there may be a large offset. Using formula-2 for estimation results in large variability when the sample data is relatively small (relative to the dimension.

 

References:

[1] Fisher Discriminant Analysis with Kernals. Sebastian Mika, Gunnar Ratsch, Jason Weston, Bernhadr Scholkopf, Klaus-Robert Muller.

[2] Fisher Linear Discriminant Analysis. Max Welling.

[3] Introduction to Machine Learning. Ethem Alpaydin

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.