Copyright:
This article by leftnoteasy released in http://leftnoteasy.cnblogs.com, this article can be all reproduced or part of the use, but please note the source, if there is a problem, please contact the wheeleast@gmail.com
Preface:
Article 2ArticleHe gave me a lot of machine learning suggestions when he went out outing with the department boss, which involved a lotAlgorithmAnd learning methods. Yi Ning told me last time that if we learn classification algorithms, we 'd better start with linearity. The simplest linear classifier is lda. It can be seen as a simplified SVM. If we want to understand SVM classifier, it is necessary to understand lda.
Speaking of LDA, we have to talk about PCA. PCA is an algorithm that is very relevant to Lda. The derivation, solution, and final results of the algorithm are quite similar.
The main content of this article is the derivation of mathematical formulas, starting from the physical meaning of the algorithm, and then finally deriving the formula step by step, the final expression of Lda and PCA is to solve the problem of matrix feature values. However, only by understanding how to derive them can we have a deeper understanding of the meaning. This article requires readers to have some basic linear algebra basics, such as the concept of feature values, feature vectors, spatial projection, and dot multiplication. I will try to make it easier and clearer about other formulas.
LDA:
The full name of LDA is linear discriminant analysis (linear discriminant analysis ),Is a supervised learning.Some materials are also known as Fisher's linear discriminant, because it was invented by Ronald Fisher since 1936. In my personal understanding, this term discriminant is a model, we do not need to train or predict data using probability methods. For example, we need to obtain the prior and posterior probability of data using Bayesian methods. Lda is inCurrently, machine learning and data mining are classic and popular.As far as I know, Baidu's business Search Department uses many algorithms in this area.
The principle of LDA is to project the labeled data (points) to a lower-dimension space by means of projection, so that the projected points are classified by category, in the case of a cluster or cluster, points of the same category are closer to each other in the projected space. To understand Lda, first understand the linear classifier: Because Lda is a linear classifier. For a classification problem of K-classification, there will be K Linear functions:
When the condition is met: For all J, there is a YK> YJ, we will say that X belongs to Category K. For each classification, there is a formula to calculate a score. Among all the scores obtained by the formula, the biggest one is the classification.
The above formula is actually a projection. It is used to project a high-dimensional point to a high-dimensional straight line. The goal of LDA is to provide a dataset labeled with a category, after projected into a straight line, the points can be distinguished by category as much as possible. When k = 2 is a problem of binary classification, as shown in:
The Red Square points are 0-Class Original points, and the blue square points are 1-Class Original points. The line passing through the origin is the projection line, which can be clearly seen from the figure, the red and blue dots areOriginObviously, this data is just randomly drawn. It looks better if it is in high dimensions. Next I will deduce the formula for the binary classification LDA problem:
Assume that the line used to distinguish binary classification (Projection Function) is:
One goal of LDA classification is to make the distance between different categories better, and the closer the distance between the same category, the better. Therefore, we need to define several key values.
The original center of Category I is: (di indicates a point belonging to category I)
The center point after Category I projection is:
Measure the degree of dispersion (variance) between class points after Category I projection:
Finally, we can get the following formula, indicating the loss function after Lda is projected to W:
WeThe goal of classification is to make the closer the points in a category, the better (concentration), and the farther the points between categories, the better.The denominator represents the sum of variance in each category. The larger the variance, the more scattered the points in a category. The molecules represent the square of the distance between the centers of the two classes. We maximize the value of J (W) we can find the optimal W. To find the optimal W, we can use the Laplace multiplier method. However, in J (W), w cannot be proposed separately, we have to find a way to separate W.
We define a matrix of the degree of dispersion of various categories before projection. This matrix seems a little troublesome. It actually means, if the point in Di of an input set of a classification is closer to the MI of the central store of the classification, the smaller the value of the elements in Si, if the classification points are closely centered around Mi, the element value in Si is closer to 0.
Import Si to split J (w) into the following denominator:
Similarly, J (w) molecules are converted:
In this way, the loss function can be converted into the following form:
In this way, we can use the favorite Laplace multiplication sub-method, but there is another problem: If the numerator and denominator can all get any value, then there will be an infinite solution, we will limit the denominator to 1 (this is a very important technique using the Laplace multiplier method. It will also be used in the PCA mentioned below. If you forget it, please review the high number) and use it as the restriction condition of the Laplace multiplier method:
This formula is a problem of feature value calculation.
For n (n> 2) classification, I wrote the following conclusion:
This is also a feature value problem. The largest feature vector we have obtained is the corresponding WI.
Here I would like to talk more about the feature values. The feature values are widely used in pure mathematics, quantum mechanics, solid mechanics, and computers. The feature values represent the properties of matrices, when we obtain the first n largest feature values of a matrix, we can say that the main component of the extracted matrix (this is related to the PCA, but not the same concept ). In the field of machine learning, feature value calculation is used in many places, such as image recognition, PageRank, Lda, and PCA, which will be mentioned later.
It is a feature face that is widely used in image recognition. Feature face extraction has two purposes: first, to compress data. For an image, you only need to save the most important part, and thenProgramIt is easier to process. When extracting the main features, a lot of noise is filtered out. The role of PCA is very relevant.
There are many methods to evaluate the feature value. the time complexity of finding a matrix of D * D is O (d ^ 3). There are also some methods to calculate the top M, such as power method, its time complexity is O (d ^ 2 * m). In general, the feature value is a very time-consuming operation. If it is a single machine environment, it is very limited.
PCA:
Principal Component Analysis (PCA) is very similar to Lda. lda input data is labeled, while PCA input data is not labeled, therefore, PCA is an unsupervised learning. Lda is usually used as an independent algorithm. Given the training data, it will obtain a series of discriminant functions (discriminate function), and then for new input, you can make predictions. PCA is more like a preprocessing method. It can reduce the original data by dimension, and minimize the variance between Dimension Data (it can also be said that the projection error is the least, as mentioned in the subsequent derivation ).
Variance is a very interesting thing. Sometimes we will consider reducing the variance (for example, when training a model, we will consider the balance of variance-deviation ), sometimes we try to increase the variance. Variance is like a belief (strong brother's words), which does not necessarily prove very strict. In practice, by increasing the PCA algorithm of projection variance as much as possible, we can indeed improve the quality of our algorithms.
If we have said so much, the push formula can help us understand it.I will use two ways to export the same expression. The first is to maximize the variance after projection, and the second is to minimize the loss after projection (the loss caused by projection is the least ).
Maximum variance method:
Suppose we still project the points in a space into a vector. First, the central point of the original space is given:
Assume that U1 is the projection vector, and the variance after projection is:
The above formula should be easier to understand if you understand the process of deriving Lda. If the content in the linear algebra is forgotten, you can review it and optimize the content on the right of the equals sign, or use the Laplace multiplier method:
Perform the preceding derivation to make it 0:
This is a standard feature value expression. The feature value corresponding to λ is the feature vector corresponding to u. The condition for obtaining the maximum value on the left of the above formula is the maximum value of λ 1, that is, the maximum feature value. Suppose we want to project a D-dimensional data space to m-dimensional data space (M <D ), the projection matrix composed of the first M feature vectors can be used to maximize the variance.
Minimize loss method:
Assume that X is a point in the D-dimension space, we can use d orthogonal D-dimensional vectors to represent the space completely (all vectors in this space can be obtained by linear combination of these d vectors ). In D-dimensional space, there are infinite possibilities to find the D-dimensional vector of D orthogonal. Which combination is the most suitable?
Suppose we have found the D vectors, and we can get:
We can use approximation to represent the points after projection:
The formula above indicates that the new x is a linear combination of the first M bases plus the last d-M bases, note that Z here is different for each X, while B is the same for each X, so that we can use M number to represent a point in space, that is, to reduce the data dimension. However, the data after dimensionality reduction will inevitably produce some distortion. We use J to describe this distortion. Our goal is to minimize J:
The preceding formula is intuitive, that is, for each vertex, sum up the sum of the squares of the distance between the vertices after dimensionality reduction and the original vertices, and calculate the average value. We need to minimize the mean value. Our order:
Add the preceding expressions Z and B to the dimension reduction:
The expression that adds the preceding formula to J is obtained:
We can use the upper Laplace multiplier (omitted here) to obtain the projection base expression we want:
Here is an expression of feature values. The first M vector we want is actually the feature vector corresponding to the largest M feature values here. To prove this, we can also see that J can be converted:
That is, when error J is composed of the smallest D-M feature values, J gets the minimum value. Same as above.
Is a representation of PCA projection. The black point is the original point, the dotted line with arrows is the projection vector, and pC1 represents the largest feature vector of the feature value, PC2 represents a feature vector with a large feature value. The two are orthogonal to each other. Because they are originally a two-dimensional space, there are at most two projection vectors. If the spatial dimension is higher, then there will be more projection vectors.
Summary:
This article mainly describes two methods: PCA and Lda. The idea and calculation method of the two are very similar, but one is as an independent algorithm, another step is more data preprocessing. In addition, there is a kernel method for PCA and Lda. This time, we will not talk about it, but we will talk about it later:
References:
PRML Bishop, introduce to LDA (Sorry, I have not found the source)