Mathematics in Machine learning (4)-Linear discriminant analysis (LDA), principal component analysis (PCA)
Copyright Notice:
This article is published by Leftnoteasy in Http://leftnoteasy.cnblogs.com, this article can be reproduced or part of the use, but please indicate the source, if there is a problem, please contact [email protected]
Objective:
The second article talked about, and department Kroning out outing, he gave me quite a lot of machine learning advice, which involves many of the meaning of the algorithm, learning methods and so on. Yining last mention to me, if the learning classification algorithm, preferably from the linear start, linear classifier is the simplest of LDA, it can be seen as a simplified version of the SVM, if you want to understand the SVM classifier, it is necessary to understand LDA.
When it comes to LDA, we have to talk about PCA,PCA is a very relevant algorithm, from derivation, solution, to the final results of the algorithm, are quite similar.
The main content of this time is mainly derived from the mathematical formula, all from the physical meaning of the algorithm, and then the final derivation of the final formula, LDA and PCA are the final expression of the solution of a matrix eigenvalue problem, but understand how to deduce, in order to understand the meaning of the deeper. This content requires the reader to have some basic linear algebra basis, such as eigenvalue, eigenvector concept, space projection, point multiplication and other basic knowledge. In addition to the other formulas, I try to speak more simple and clear.
Lda:
The full name of LDA is linear discriminant analysis (linear discriminant analyses), which is a supervised learning. Some of the information is also known as Fisher's Linear discriminant, because it was invented by Ronald Fisher since 1936, discriminant this word my personal understanding is that a model that does not need to be trained by means of probabilities, Predictive data, such as various Bayesian methods, requires the acquisition of a priori, posterior probability, and so on. LDA is a classic and popular algorithm in the field of machine learning and data mining , and as far as I know, Baidu's business search department uses a lot of this algorithm.
The principle of LDA is that the data (points) with labels will be projected into the lower dimensions of the dimension by means of projection, so that the projected points will form a cluster-by-category case, and the same category of points will be closer to the projected space. To understand LDA, you first have to understand the linear classifier (Linear Classifier): Because LDA is a linear classifier. For a classification problem of K-classification, there will be K linear functions:
When the condition is met: For all J, there are YK > Yj, when we say x belongs to category K. For each classification, there is a formula to calculate a score, in all the formula to get the score, find one of the largest, is the classification of belong.
The above is actually a kind of projection, is a high-dimensional point projection to a high-dimensional line, LDA most of the goal is to give a labeled category of the data set, projected into a line, can make the point as far as possible by the category, when the k=2 is two classification problem, as shown in:
The Red Square point is the original point of the 0 class, the blue Square point is the original point of the 1 class, the line through the origin is the projection line, from the diagram can be clearly seen, the red dots and blue dots are clearly separated from the original point , this data is just casually painted, if in the case of high-dimensional, It will look a little better. Let me deduce the formula for the two category LDA problem:
Suppose that the line (projection function) used to differentiate two classifications is:
One goal of the LDA classification is to make the distance between different categories as far as possible, the closer the same category is, the better, so we need to define a few key values.
The original center point for category I is: (di denotes a point belonging to category I)
The center point after the category I projection is:
After measuring the category I projection, the degree of dispersion (variance) between the category points is:
Finally we can get a formula that represents the loss function of the LDA projection to W:
The goal of our classification is to make the point distance within the category as close as possible (concentration), and the point farther away from the category is as good as possible. the denominator represents the sum of the variances within each category, the greater the variance, the more dispersed the points within a category, the square of the distance of the numerator to the respective center point of the two categories, and we maximize J (W) to find the optimal W. To ask for the optimal w, you can use the Lagrangian multiplier, but now we get the J (W) inside, W is not to be raised alone, we have to find a way to put W alone.
We define a matrix of the degree of dispersion before the projection, and this matrix looks a little troublesome, in fact, that is, if a certain classification of the input point set di inside the point distance of this category of the central store Mi closer, then si inside the value of the element is smaller, if the classification of the points are tightly around MI, The greater the value of the element within the SI is closer to 0.
Bring in the Si and the J (W) to the denominator:
Similarly, J (W) molecules are converted to:
This loss function can be converted into the following form:
So you can use your favorite Lagrange multiplier, but there is a problem, if the numerator, the denominator can take any value, it will make infinite solution, we will limit the denominator to a length of 1 (this is a very important technique with Lagrange multiplier method, in the following will be said in the PCA will also be used, if you forget, Please review the high number), and as a limiting condition of the Lagrange multiplier method, bring into the obtained:
Such a formula is a question of finding eigenvalues.
For the problem of N (N>2) classification, I wrote the following conclusions directly:
This is also a question of eigenvalue, we find the first large eigenvector, is the corresponding wi.
Here want to talk more about eigenvalues, eigenvalues in pure mathematics, quantum mechanics, solid mechanics, computers and other fields have a wide range of applications, eigenvalues represent the nature of the matrix, when we take the matrix of the first n the largest eigenvalues, we can say that the main components extracted to the matrix (this and after the PCA correlation, But not exactly the same concept). In the field of machine learning, the calculation of eigenvalues is used in many places, such as image recognition, PageRank, LDA, and PCA, which will be mentioned later.
Image recognition is widely used in the feature face (Eigen faces), extract features face has two purposes, first of all to compress the data, for a picture, only need to save its most important part is, and then to make the program easier to handle, in the extraction of the main features, a lot of noise is filtered out. is very relevant to the role of PCA as discussed below.
There are many eigenvalues, the time complexity of a d * d matrix is O (d^3), there are some methods for top M, such as power method, its time complexity is O (d^2 * M), in general, to find the eigenvalues is a very time-consuming operation, if it is a single-machine environment, is very limited.
Pca:
Principal component Analysis (PCA) has a very similar meaning to LDA, the input data of LDA is labeled, and the input data of PCA is non-tagged, so pca is a kind of unsupervised learning. LDA usually exists as an independent algorithm, given the training data, will be a series of discriminant functions (discriminate function), and then for the new input, it can be predicted. And PCA is more like a preprocessing method, it can reduce the original data dimension, and reduce the dimension of the data between the maximum variance (it can be said that the projection error is minimal, specifically in the subsequent deduction will be discussed).
Variance This thing is very interesting, sometimes we consider reducing variance (for example, when we train the model, we will consider the variance-deviation equilibrium), sometimes we will try to increase the variance. Variance is like a belief (strong brother's words), does not necessarily have very strict proof, from the practice, by maximizing the projection variance of the PCA algorithm, it can improve our algorithm quality.
Having said so much, the push formula can help us understand. I will deduce the same expression in two ways. The first is to maximize the variance after the projection, followed by minimizing the loss after the projection (the projection produces the least loss).
Maximum Variance method:
Suppose we still project a point in a space into a vector. First, give the center point of the original space:
Assuming that U1 is a projection vector, the variance after the projection is:
If the above formula is understood before the process of the derivation of LDA, it should be easier to understand, if the contents of the linear algebra forgotten, you can review, to optimize the right side of the equal to the content, or the Lagrange multiplier method:
The derivative of the above, which is 0, gets:
This is a standard eigenvalue expression, λ corresponds to the eigenvalues, U corresponds to the eigenvector. The maximum value of the left side of the upper-λ1 is the maximum, that is, when the maximum eigenvalue is obtained. Suppose we are going to project a D-dimensional data space into the m-dimensional data space (M < D), then the projection matrix of the first M eigenvector is the matrix that can make the most variance.
Minimization of loss method:
Assuming that the input data x is a point in the D-dimensional space, then we can use D orthogonal d-dimensional vectors to fully represent the space (all vectors in this space can be obtained using the linear combination of the D vector). In D-dimensional space, there are infinitely many possible ways to find the D-dimensional vector, which is the most appropriate combination?
Let's say we've found this D-vector and we can get:
We can use approximate method to represent the point after projection:
The above means that the resulting new x is a linear combination of the first m-base and a linear combination of the latter d-m, noting that z here is different for each x, and B is the same for each x, so that we can use the number of M to represent a point in space, that is, to make the data dimensionality. But the data after the dimensionality will inevitably produce some distortions, and we use J to describe this twist, and our goal is to make J minimum:
The meaning of the above is very intuitive, that is, for each point, the descending dimension of the point and the original point of the square between the sum of squares together, the average, we will make this average value is minimal. We make:
The above-mentioned Z and B are taken into the reduced-dimensional expression:
The expression of the above-loaded J is obtained:
Then using the Laplace multiplier (a little bit here), we can get the expression of the projection base we want:
Here is another characteristic value expression, we want the first m vector is actually here the largest m eigenvalues corresponding to the eigenvector. Prove this can also see, we J can be translated into:
That is, when the error J is composed of the smallest d-m eigenvalues, J obtains the minimum value. It's the same as the above meaning.
is a representation of the projection of the PCA, the black Point is the original point, the dotted line with arrows is the vector of the projection, PC1 represents the most characteristic vector of eigenvalues, PC2 is the characteristic vector of the characteristic value, the two are orthogonal to each other, because this is a 2-dimensional space, so there are up to two projection vectors, If the spatial dimension is higher, the projected vector will be more.
Summarize:
This is mainly about two methods, PCA and LDA, both thought and calculation method very similar, but one is as independent algorithm exists, another more for data preprocessing work. In addition to the PCA and LDA also have a nuclear method, the length of this time is relatively large, not to mention, and later have time to talk about:
Resources:
PRML bishop,introduce to LDA (sorry, this really didn't find the source)
Mathematics in Machine learning (4)-Linear discriminant analysis (LDA), principal component analysis (PCA)