Four machine learning dimensionality reduction algorithms: PCA, LDA, LLE, Laplacian eigenmaps
In the field of machine learning, the so-called dimensionality reduction refers to the mapping of data points in the original high-dimensional space to the low-dimensional space. The essence of dimensionality is to learn a mapping function f:x->y, where x is the expression of the original data point, which is currently used at most in vector representations. Y is the low-dimensional vector representation of a data point map, and the dimension of y is usually less than the dimension of X (and of course the dimension is also possible). F may be explicit or implicit, linear, or non-linear.
At present, most of the dimensionality reduction algorithms deal with the data of vector expression, and some dimensionality reduction algorithms are used to deal with high-order tensor expression. The reason for the use of the data after the dimensionality reduction is because in the original high-dimensional space, contains redundant information and noise information, in the actual application of example recognition caused by error, reduce the accuracy, and through dimensionality reduction, we want to reduce the error caused by redundant information, improve the accuracy of identification (or other applications). Or we hope to find the intrinsic structural features of the data through the dimensionality reduction algorithm.
In many algorithms, the reduced-dimension algorithm becomes a part of data preprocessing, such as PCA. In fact, there are some algorithms without dimensionality reduction pretreatment, in fact, it is very difficult to get good results.
Principal component Analysis Algorithm (PCA)
Principal Component Analysis (PCA) is the most commonly used linear dimensionality reduction method, whose goal is to map high-dimensional data to low-dimensional spaces by some kind of linear projection, and expect the data to be the most variance in the projected dimension, using less data dimensions, At the same time, retain more of the original data point characteristics.
Popular understanding, if all points are mapped together, then almost all the information (such as the distance between points and points) is lost, and if the mapping behind the difference as large as possible, then the data points will be scattered, so as to retain more information. It can be proved that PCA is a linear dimensionality reduction method which loses the least information of raw data. (actually closest to the original data, but PCA does not attempt to explore the data intrinsic structure)
The n-dimensional vector w is an axis direction (called the mapping vector) of the target subspace, maximizing the variance after the data is mapped, as follows:
where M is the number of data instances, Xi is the vector representation of the data instance I, and x is the average vector for all data instances. Defining w as a matrix containing all the mapping vectors as column vectors, after a linear algebraic transformation, the following optimization objective functions can be obtained:
where TR represents the trace of the Matrix,
A is the data covariance matrix.
The easy-to-get optimal w is a feature vector that corresponds to the k largest eigenvalues of the data covariance matrix as a column vector. These eigenvectors form a set of orthogonal bases and best preserve the information in the data.
The output of the PCA is y = W ' x, which is reduced from the original dimension of X to the K dimension.
PCA pursues the ability to maximize the intrinsic information of the data after dimensionality reduction, and to measure the importance of that direction by measuring the size of the data variance in the projection direction. But this kind of projection after the data is not a big difference, but may make the data points mixed together can not be differentiated. This is also one of the biggest problems with PCA, which results in the classification of PCA in many cases not good. As you can see, if you use PCA to project a data point onto a one-dimensional space, the PCA chooses the 2 axis, which makes the two clusters of points that are easily distinguishable from being mixed together become indistinguishable, and when you select the 1 axis, you get a good result.
discriminant analysis, unlike PCA, does not want to keep the most data, but wants the data to be easily distinguishable after dimensionality reduction. The method of LDA is introduced later, and is another common linear dimensionality reduction method. In addition, some nonlinear dimensionality reduction methods can be used to distinguish the results, such as Lle,laplacian Eigenmap, by using the local properties of data points. will be introduced later.
Lda
Linear discriminant analysis (also known as Fisher Linear discriminant) is a supervised (supervised) linear dimensionality reduction algorithm. Unlike PCA, which maintains data information, LDA is designed so that the data points after dimensionality are as easy to differentiate as possible!
Assuming the original data is represented as X, (M*n matrix, M is dimension, n is the number of sample)
Since it is linear, it is desirable to find the mapping vector A, so that the data points after a ' x can maintain the following two properties:
1, similar data points as close as possible (within class)
2. The data points of different classes are as separate as possible (between class)
So, the last time the PCA used this diagram, if the two stacks of points in the diagram are two classes, then we want them to be able to project to the Axis 1 (the PCA result is axis 2), so in one-dimensional space is also very easy to distinguish.
Next is the derivation, because here the formula is very inconvenient, I quoted Deng Cai Teacher of a PPT in a small section of the picture:
The idea is still very clear, the objective function is the last line J (a), μ (a float) is the map of the center to evaluate the distance between classes, S (one scoop) is the map of the point and the center of the sum of the distances used to evaluate the class spacing. J (a) is precisely the evolution from the above two properties.
Therefore, in two categories of cases:
Plus the condition of a ' a=1 (similar to PCA)
Can be expanded into multiple categories:
The above formula derivation can be specific reference to the corresponding section of the pattern classification book, talking about Fisher Discirminant
OK, the calculation of the mapping vector A is to find the maximum eigenvector, or the first few of the largest eigenvector composition matrix A=[A1,A2,.... AK], the new point can be reduced dimension: y = a ' X (linear one of the advantages is convenient calculation!). )
It can be found that LDA is finally transformed into a matrix eigenvector problem, and PCA very much like, in fact, many other algorithms are also due to this class, commonly referred to as the spectral (spectral) method.
Linear dimensionality reduction algorithm I think the most important thing is PCA and LDA, and then introduce some nonlinear methods.
Local linear embedding (LLE)
Locally linear embedding (LLE) is a nonlinear dimensionality reduction algorithm, which can keep the original manifold structure well after the reduced dimension data. Lle can be said to be one of the most classical tasks of manifold learning methods. Many subsequent manifold learning and dimensionality reduction methods are closely related to lle.
See figure 1, using Lle to map the three-dimensional data (b) to two-dimensional (c), the mapped data can still maintain the original data flow shape (red dots close to each other, blue also close to each other), indicating that lle effectively maintained the original data of the popular structure.
However, Lle does not apply in some cases, and if the data is distributed across a closed spherical surface, the lle cannot map it to a two-dimensional space and cannot maintain the original data flow shape. So we're working on the data, first of all assuming that the data is not distributed on closed spherical or ellipsoid surfaces.
Fig. 1 Example of using Lle algorithm for descending dimension
The LLE algorithm considers that each data point can be constructed by a linear weighted combination of its nearest neighbor points. The main step of the algorithm is divided into three steps: (1) Finding K nearest neighbors of each sample point, (2) calculating the local reconstruction weight matrix of the sample point by the nearest neighbor point of each sample point, (3) The output value of the sample point is computed by the local reconstruction weight matrix of the sample point and its nearest neighbor point. The specific algorithm Flow 2 shows:
Figure 2 Lle algorithm steps
Laplacian eigenmaps Laplace feature mapping
Continue to write a bit of classical dimensionality reduction algorithm, described in front of Pca,lda,lle, here to talk about Laplacian eigenmaps. In fact, it is not that every algorithm is better than the previous one, but every algorithm is from different angles to see the problem, so the idea of solving the problem is not the same. The thought of these dimensionality reduction algorithms is very simple, but in some ways very effective. These methods are in fact the source of ideas behind some of the new algorithms.
Laplacian Eigenmaps[1] Look at the angle of the problem and lle some similarities, but also use a local perspective to build the relationship between the data.
Its intuitive idea is to hope that the points that are related to each other (the points connected in the graph) are as close as possible in the space after dimensionality reduction. Laplacian Eigenmaps can reflect the internal manifold structure of the data.
The specific steps for using the algorithm are:
Step 1: Build the diagram
Use a method to build all the points into a graph, for example, using the KNN algorithm, to connect the nearest K points of each point to the top. K is a pre-set value.
Step 2: Determine weights
Determine the weight between points and points, for example, select the thermonuclear function to determine if the point I and Point J are connected, then the weight of their relationship is set to:
The feature vectors corresponding to the minimum m non-0 eigenvalues are used as the output of the reduced dimension results.
As mentioned earlier, Laplacian Eigenmap has the characteristic of distinguishing data points, which can be seen from the following examples:
As shown in Figure 1, the graph on the left shows two types of data points (data is a picture), the middle diagram shows the position of each data point in two-dimensional space after Laplacian Eigenmap, the right figure represents the result of using PCA and taking the first two main direction projection, can clearly see, on this classification problem , the results of Laplacian Eigenmap were obviously better than that of PCA.
Figure 2 Descending dimension of roll data
Figure 2 illustrates that high-dimensional data (3D in the figure) may also have a low-dimensional intrinsic properties (the roll is actually 2D), but this low dimension is not the original coordinates, for example, if you want to maintain a local relationship, blue and yellow is completely irrelevant, However, it is inaccurate to describe only any 2D or 3D distance.
The following three graphs are Laplacian eigenmap under different parameters (descending dimension to 2D), it appears that the entire belt is to be leveled. So the blue and yellow difference is far away.
Xbinworld
Reprint please indicate from 36 Big Data (36dsj.com): 36 Big Data» Four major machine learning dimensionality reduction algorithms: PCA, LDA, LLE, Laplacian eigenmaps
Four machine learning dimensionality reduction algorithms: PCA, LDA, LLE, Laplacian eigenmaps