Source: http://blog.csdn.net/zhongkelee/article/details/44064401
Reprint please declare the source: http://blog.csdn.net/zhongkelee/article/details/44064401
A brief introduction of PCA
1. Related background
After Chenhonghong teacher's "machine learning and Knowledge discovery" and Tihaibo Teacher's "matrix algebra" two courses, quite experience. Recently in the master component analysis and singular value decomposition of the project, so record the experience.
In many fields of research and application, we often need to do a lot of observation of the variables that reflect the things, and collect a lot of data in order to analyze and find the law. Large-scale multivariate samples will undoubtedly provide rich information for research and application, but also increase the workload of data collection to some extent, more importantly, in most cases, many variables may have correlations, which increases the complexity of problem analysis and inconvenience to analysis. If each indicator is analyzed separately, the analysis is often isolated rather than integrated. Blind reduction of indicators will lose a lot of information, easy to produce erroneous conclusions.
It is therefore necessary to find a reasonable way to minimize the loss of information contained in the original indicator while minimizing the need for analysis in order to achieve the objective of a comprehensive analysis of the data collected. Because there is a certain correlation between the variables, it is possible to synthesize various kinds of information that exist in each variable with less comprehensive index. Principal component analysis and factor analysis belong to this kind of dimensionality reduction method. 2. Description of the problem
Table 1 below is the statistics of the language, mathematics, physics and chemistry of some students:
First of all, assume that these subjects are not related to the results, which means that the number of subjects in a certain subject does not relate to other subjects. Then one can see, mathematics, Physics, chemistry, the results of the three classes constitute the main component of this set of data (it is clear that mathematics as the first principal component, because the mathematical results of the most open). Why can you see it at a glance. Because the axes are selected. Next to see a group of students in mathematics, physics, Chemistry, language, history, English score statistics, see table 2, can also be a glance out:
There is so much data that it looks a bit messy. That is, there is no direct way to see the main component of this set of data, because in a coordinate system this set of data is scattered. The reason for this is that it is impossible to remove the mist from the naked eye. If the data is expressed in the corresponding space, you may be able to find the main component in a different angle of observation. As shown in Figure 1 below:
However, for higher dimensional data, can you imagine its distribution. Even if you can describe the distribution, how to find the axes of these principal components exactly. How to measure how much information you have extracted from the main component in the end of the total data. Therefore, we need to use the principal component analysis of the processing methods. 3. Data dimensionality reduction
In order to explain what is the main component of the data, first from the data dimensionality. What's going on with data dimensionality reduction. Suppose there is a series of points in the three-dimensional space, which are distributed over an inclined plane over the origin. If you use the natural coordinate system x,y,z these three axes to represent this set of data, you need to use three dimensions, and in fact, the distribution of these points is only on a two-dimensional plane, then, the problem is where. If you think about it again, can you rotate the x,y,z coordinate system so that the data is in a plane that coincides with the x,y plane. That's right. If the rotated coordinate system is recorded as X ', y ', Z ', then the representation of this set of data is expressed only in X ' and y ' two dimensions. Of course, if you want to revert to the original presentation, you have to save the transformation matrix between the two coordinates. This will enable the data dimension to be lowered. But we want to see the nature of the process, and if you put the data in rows or columns into a matrix, then the rank of the matrix is 2. There is a correlation between these data and the maximum linearly independent group of the vectors that make up the origin of the data contains 2 vectors, which is why the first assumption of the plane over the origin. So if the plane is not the original point. This is the reason for data center. Translation of the coordinate origin to the data center so that the original irrelevant data is relevant in this new coordinate system. Interestingly, the 3.1 coplanar, that is, any three points in three-dimensional space after the center is linear, generally speaking n-dimensional space in the N-point can certainly be in a n-1 Wizi space analysis.
In the previous paragraph, it was thought that the data dimensionality reduction did not discard anything because the data had a component of 0 in the third dimension outside the plane. Now, suppose that the data has a very small jitter in the z ' axis, so we're still using these two dimensions to represent this data, because we can assume that the information on the two axes is the principal component of the data, and that this information is sufficient for our analysis, and the jitter on the z ' axis is likely to be noise, Which means the data is actually relevant, the introduction of noise leads to incomplete correlation of data, however, the angle between the distribution of these data on the z ' axis and the origin of the original point is very small, that is to say, there is a great correlation on the z ' axis, which can be considered as the data in X ', y ' The projection on the axis forms the principal component of the data.
In the classroom, the teacher talked about the feature selection problem, in fact, is to remove the characteristics of the main feature is not related to the class label. And a lot of the characteristics here are related to the class label, but there is noise or redundancy. In this case, a feature dimensionality reduction method is needed to reduce the number of features, reduce noise and redundancy, and reduce the likelihood of excessive fitting.
The idea of PCA is to map n-dimensional features to K-dimensional (k<n), which is a new orthogonal feature. This k-Viterbi, called the principal component, is a reconstructed K-dimensional feature rather than simply removing the rest of the N-k dimension features from the N-dimensional feature. Second, PCA example
Now suppose that there is a set of data as follows:
The row represents the sample, the column represents the feature, here are 10 examples, each sample two characteristics. It can be argued that there are 10 documents, X is the tf-idf,y of "learn" in 10 documents is TF-IDF of "study" in 10 documents.
The first step is to find the average value of x and Y, and then subtract the corresponding mean for all the samples. Here the mean value of x is the mean of 1.81,y is 1.91, then the example minus the mean value is (0.69,0.49), get
The second step is to find the characteristic covariance matrix, if the data is 3 D, then the covariance matrix is
There are only X and Y, and the solution is
The diagonal is the variance of x and Y, and the non diagonal is covariance. Covariance is the degree at which two variables change at the same time. Covariance greater than 0 means x and y if one increases, the other increases; less than 0 indicates an increase, a minus. If x and y are statistically independent, then the covariance between the two is 0, but the covariance is 0, and it does not mean that X and Y are independent. The greater the absolute value of covariance, the greater the effect of the two on each other and the smaller the opposite. Covariance is a quantity that has no units, so if the same two variables are changed in dimensions, their covariance can also result in changes in the branches.
The third step is to find the eigenvalues and eigenvectors of covariance, and get
Above is two eigenvalues, the corresponding eigenvector, eigenvalue 0.0490833989 corresponding to the eigenvector, where the eigenvectors are normalized to the unit vector.
In the fourth step, the eigenvalues are sorted in order from large to small, the largest k is selected, and the corresponding K eigenvector is used as the column vector to form the eigenvector matrix.
There are only two eigenvalues, we select the largest one, here is 1.28402771, and the corresponding eigenvector is ( -0.677873399, -0.735178656) T.
In the fifth step, the sample points are projected onto the selected eigenvector. Assuming the sample number is m, the characteristic number is n, the sample matrix minus the mean value is Dataadjust (m*n), the covariance matrix is N*n, and the selected K eigenvector is eigenvectors (n*k). Then the projected data FinalData to
FinalData (10*1) = Dataadjust (10*2 matrix) x eigenvector ( -0.677873399, -0.735178656) T
The result is
Thus, the n-dimensional feature of the original sample is transformed into K-dimensional, which is the projection of the original feature on the K-dimension.
The above data can be considered to be the fusion of learn and study features into a new feature called the LS feature, which basically represents these two features. The above process is described in Figure 2 below:
The positive sign indicates the sample point after preprocessing, the oblique two lines are orthogonal eigenvector (because the covariance matrix is symmetrical, so its eigenvector is orthogonal), and the last step of matrix multiplication is to projection the original sample point to the axis corresponding to the eigenvector respectively.
The whole PCA process seems to be simple, which is to find the eigenvalues and eigenvectors of covariance, and then do data conversion. But there is no wonder, why the eigenvector of covariance is the ideal k-dimensional vector. What is the hidden meaning behind it. What is the meaning of the whole PCA. Three, PCA derivation
Let's look at the following picture:
In the first part, we give an example of a student's performance in which the data points are six-dimensional, i.e. each observation is a point in a 6-dimensional space. We want to express 6-dimensional space in low dimensional space.
It is assumed that there are only two dimensions, that is, two variables, which are represented by the horizontal and ordinate axes, so each observation has two coordinate values corresponding to the two coordinates, and if the data forms an oval-shaped lattice, the ellipse has a long axis and a short axis. In the direction of the short axis, the data changes very little; In extreme cases, if the short axis is degenerate into a point, it is only in the direction of the long axis to be able to explain the change of these points; Thus, the dimensionality of two-dimensional to one-dimensional is done naturally.
In the above diagram, the U1 is the principal component direction, and then the direction of the orthogonal direction in the two-dimensional space is taken and U1, which is the direction of U2. Then n data in the U1 axis of the most discrete degree (variance is the largest), the data on the U1 projection represents the majority of the original data information, even if not considered U2, information loss is not much. Moreover, U1 and U2 are irrelevant. When the U1 is considered only, the two-dimensional is reduced to one dimension.
The larger the length and axis of the ellipse, the more reasonable the dimensionality reduction. 1. Maximum Variance theory
In signal processing, it is considered that the signal has a large variance, the noise has a small variance, and the signal-to-noise ratio is the variance ratio of the signal to the noise, the bigger the better. As in the previous figure, the projection variance of the sample on the U1 is large, the projection variance on the U2 is small, then the projection on the U2 can be thought to be caused by noise.
Therefore, we think that the best K-dimensional feature is the conversion of n-dimensional sample points to K-dimensional, the sample variance on each dimension is very large.
For example, we projected the 5 dots in the following image onto a dimension, which is represented by a straight line over the origin (the data is centered):
Let's say we choose two different lines to do the projection, so which one of the two is good. According to our prior variance maximization theory, the left side is good because the variance between the sample points after projection is the largest (or the sum of the absolute value of the projection).
The method for calculating projections is shown in Figure 5 below:
In the figure, the red dots represent the sample, the blue dots represent the projections on U, the slope of the line is also the direction vector of the line, and the unit vector. The blue point is the projection point on U, the distance from the origin is <x,u> (i.e. xtu or UTX). 2. Least Squares
We use the least squares method to determine the direction of each spindle (principal component).
For a given set of data (in the following elaboration, vectors are generally referred to as column vectors):
Its data center is located at:
Data Center (move the coordinate origin to the center point of the sample point):
The center of the data in the direction of the first spindle U1 scattered the most open, that is, in the direction of the U1 of the absolute value of the projection and the largest (also can say variance is the largest), the calculation of the projection method above has been elaborated, that is, X and U1 do internal product, because only required U1 direction, so set U1 is also a unit vector.
Here, that is, to maximize the next type:
By the knowledge of matrix algebra, it is more convenient to square the absolute value sign term. So the next is to maximize the following type:
Two vectors do the inner product and can be transformed into matrix multiplication:
So the target function can be expressed as:
Inside the parentheses is the matrix multiplication to represent the vector inner product, because the column vector is the row vector after the transpose, the row vector multiplies the column vector to get a number, a number of the transpose or itself, so the objective function can be converted to:
Go to parentheses:
And because U1 and I have nothing to do, you can get the sum of the outside, the upper-style simplification is:
Students who have studied matrix algebra may have discovered that the result of summing the upper bracket is the equivalent of a large matrix multiplied by its own transpose, in which the large matrix forms the following:
The first column of the X matrix is XI
So there are:
So the goal function is eventually translated into:
One of them is a two-second type,
We assume that a characteristic value is λ, and the corresponding eigenvector is ξ, which has
Therefore, the symmetric matrix of semidefinite, that is, the two-second form of semidefinite matrices, is derived from the knowledge of matrix algebra, and the objective function has the maximum value.
Here we solve the two problems of the maximum value and the direction of U1 when we get the maximum value.
To solve the first problem, the two-norm squared for vector x is:
Similarly, the objective function can also be expressed as a two-norm square of the mapped vector:
To form two times into a norm, the basic problem of maximizing the objective function is transformed into a matrix, which transforms a vector, and how the scale of the vector before and after the transformation can be maximal, because the U1 takes the unit vector. We have the theorem in matrix algebra, the maximum value of the vector length before and after the matrix mapping is the maximum singular value of the matrix, namely:
In the formula, it is the maximal singular value of matrix A (also the two norm of Matrix A), which is equal to (or) the maximum characteristic value of square.
For this problem, the semidefinite symmetric matrix means that its eigenvalues are greater than or equal to 0, and the eigenvector corresponding to different eigenvalues is orthogonal, constituting a set of orthogonal bases in the space.
To solve the second problem, for the general situation, set the symmetric matrix of the n eigenvalues are:
The corresponding unit feature vectors are:
Let's take a vector x, which is represented by this set of bases in the space composed of the eigenvector:
The
So:
For the second problem, the direction of the corresponding eigenvector is the direction of the first principal component U1, when we take the maximum value of the object function in the upper formula, that is, the maximum eigenvalue. (The orientation of the second principal component is the direction of the eigenvector corresponding to the second large eigenvalue, etc.).
Proof complete.
The percentage of the total information that the principal component occupies can be calculated in the following style:
The denominator of the formula is the square sum of all singular values, and the numerator is the sum of squares of the first k large singular values selected.