I. INTRODUCTION of PCA
1. Related background
Principal component Analysis (Principal Component ANALYSIS,PCA) is a statistical method. An orthogonal transformation transforms a set of variables that may be related to a set of linearly unrelated variables, and the transformed set of variables is called the principal component.
After finishing Chenhonghong teacher's "machine learning and Knowledge discovery" and Ji Haibo teacher's "matrix algebra" two courses, quite experience. Recently in the master component analysis and singular value decomposition of the project, so record a bit of experience.
In many fields of research and application, it is often necessary to make a large amount of observations on multiple variables that reflect things, and collect large amounts of data for analysis to find the law. Large-scale multivariate samples will undoubtedly provide rich information for research and application, but also to a certain extent, increase the workload of data acquisition, more importantly, in most cases, there may be correlations between many variables, thus increasing the complexity of the problem analysis, while the analysis is inconvenient. If each indicator is analyzed separately, the analysis is often isolated rather than integrated. Blindly reducing the indicator will lose a lot of information, easy to produce wrong conclusions.
Therefore, it is necessary to find a reasonable way to reduce the loss of information contained in the original index and to achieve the purpose of comprehensive analysis of the collected data. Because of the correlation between the variables, it is possible to synthesize various kinds of information which exist in each variable with less comprehensive index. Principal component analysis and factor analysis belong to this kind of dimensionality reduction method. 2. Description of the problem
The following table 1 is a number of students in the Chinese, mathematical, physical, chemical performance statistics:
First of all, assume that these subjects are not relevant, that is, how many points in a particular subject does not relate to other subjects. Then one can see that mathematics, physics, chemistry, the results of these two classes constitute the main component of the data (it is clear that mathematics as the first principal component, because the mathematical results pull the most open). Why can you see it at one glance? Because the axes are selected. Below to see a group of students of Mathematics, physics, Chemistry, language, history, English score statistics, see table 2, but also can be a glance out:
There's too much data to look messy. In other words, the main component of this set of data cannot be directly seen because the data distribution is scattered in the coordinate system. The reason for this is that it is impossible to remove the fog that hides the naked eye--If you show the data in the appropriate space, you may be able to change the observation angle to find the principal component. This is shown in Figure 1 below:
However, for higher-dimensional data, can you imagine its distribution. Even if you can describe the distribution, how exactly to find the axes of these principal components. How to measure how much information you extract from the main component actually accounts for the entire data. Therefore, we need to use the principal component analysis of the processing method. 3. Data Dimension Reduction
To illustrate what is the main component of the data, start with the data dimensionality reduction. Data dimensionality reduction is going on. Assuming that there are a series of points in the three-dimensional space, which are distributed over the slope of an over-origin point, if you use the natural coordinate system x, Y, z these three axes to represent this set of data, you need three dimensions, and in fact, the distribution of these points is only on a two-dimensional plane, then, the problem is. If you think about it again, can you rotate the x, y, z coordinate system so that the data plane coincides with the X, y plane. That's right. If the rotated coordinate system is recorded as X ', y ', Z ', then the representation of this set of data is only represented by X ' and y ' two dimensions. Of course, if you want to restore the original representation, you have to save the transformation matrix between the two coordinates. This allows the data dimension to be lowered. However, we want to see the nature of this process, if the data are sorted by rows or columns into a matrix, then the rank of this matrix is 2. There is a correlation between these data, and the maximum linear independent group of vectors that make up the origin of the data consists of 2 vectors, which is why the plane was initially assumed to be over the origin. So if the plane is not the original point. This is the reason for the data center. Shift the origin of the coordinates to the data center so that irrelevant data is relevant in the new coordinate system. Interestingly, 3.1 fixed coplanar, that is, the three-dimensional space in any three-point center is linearly related, generally speaking n-dimensional space n points must be in a n-1 subspace space analysis.
In the last paragraph of the text, it was thought that the data was not discarded after dimensionality, because the data in the third dimension outside the plane of the component is 0. Now, assuming that the data has a small jitter in the z ' axis, we still represent the data in the two dimensions described above, on the grounds that we can assume that the information of the two axes is the principal component of the data, and that this information is sufficient for our analysis, and that the jitter on the z ' axis is likely to be noise, That is to say that this group of data is correlated, the introduction of noise, resulting in incomplete data correlation, but the distribution of these data on the z ' axis and the origin of the angle is very small, that is, there is a large correlation on the z ' axis, to synthesize these considerations, you can think of the data in X ', y ' The projection on the axis forms the principal component of the data.
The problem that the teacher talks about in class is that the characteristic that is to be removed is the characteristic which is irrelevant to the class label. Many of the features here are related to class labels, but there is noise or redundancy. In this case, a feature reduction method is required to reduce the number of features, reduce noise and redundancy, and reduce the likelihood of overfitting.
The idea of PCA is to map n-dimensional features to K-Dimension (K<n), which is a new orthogonal feature. This k-sign, called the principal component, is a re-constructed K-dimensional feature, rather than simply removing the remaining n-k-dimensional features from the n-dimensional features. ii. Examples of PCA
Now suppose that there is a set of data as follows:
The row represents the sample, the column represents the feature, here are 10 samples, each of the two characteristics. As you can see, there are 10 documents, X is the tf-idf,y of "learn" in the 10 documents, and the TF-IDF that appeared in the 10 documents "study".
The first step is to find the average of x and Y respectively, and then subtract the mean value for all the examples. Here the mean value of x is the mean value of 1.81,y is 1.91, then the sample minus the mean is (0.69,0.49), get
The second step is to find the characteristic covariance matrix, if the data is 3 dimensions, then the covariance matrix is
There's only X and Y, and the solution is
On the diagonal are the variances of x and Y, which are covariance on the non-diagonal. Covariance is a measure of the degree to which two variables change at the same time. A covariance greater than 0 means that x and y increase by one and the other increases; less than 0 means one increment, one minus. If x and y are statistically independent, then the covariance between the two is 0, but the covariance is 0, and it does not mean that X and Y are independent. The greater the absolute value of the covariance, the greater the effect of the two on each other, and the smaller the inverse. Covariance is a quantity with no units, so if the dimensions of the same two variables change, their covariance will also produce changes on the branches.
In the third step, the eigenvalues and eigenvectors of the covariance are obtained.
The above is two eigenvalues, the following is the corresponding eigenvectors, the eigenvalues of 0.0490833989 corresponding to the eigenvector, where the eigenvectors are normalized to the unit vector.
In the fourth step, the eigenvalues are sorted in order from large to small, the largest k is selected, and the corresponding K eigenvectors are respectively used as the column vectors to form the eigenvector matrix.
Here the eigenvalues are only two, we choose the largest one, here is 1.28402771, the corresponding eigenvector is ( -0.677873399, -0.735178656) T.
In the fifth step, the sample points are projected onto the selected eigenvectors. Suppose the sample number is m, the characteristic number is n, the sample matrix after subtracting the mean is Dataadjust (m*n), and the covariance matrix is n*n, the matrix of the selected K eigenvectors is eigenvectors (n*k). Then the projected data FinalData to
FinalData (10*1) = Dataadjust (10*2 matrix) x eigenvectors ( -0.677873399, -0.735178656) T
The result is
In this way, the N-dimensional feature of the original sample is changed to K-Dimension, which is the projection of the original feature on the K-dimension.
The above data can be thought of as learn and study feature fusion as a new feature called LS feature, which basically represents these two characteristics. The above process is described in Figure 2 below:
The positive sign indicates the pre-processed sample point, the diagonal two lines are orthogonal eigenvectors (because the covariance matrix is symmetric, so its eigenvector is orthogonal), and the last step of matrix multiplication is the projection of the original sample points to the corresponding axes of the eigenvectors.
The whole PCA process seems to be simple, which is to find the eigenvalues and eigenvectors of covariance, and then do the data transformation. But there is no magic, why the covariance of the eigenvector is the most ideal k-dimensional vector. What is the hidden meaning behind it. What is the meaning of the whole PCA. Iii. derivation of PCA
Let's look at the following picture:
In the first part, we give an example of a student's performance, where the data points are six-dimensional, that is, each observation is a point in a 6-dimensional space. We want to represent 6-dimensional space in a low-dimensional space.
Assume that there are only two dimensions, that is, there are only two variables, which are represented by the horizontal and vertical axes, so that each observation has two coordinate values corresponding to the two axis, and if the data forms an oval-shaped lattice, the ellipse has a long axis and a short axis. In the short-axis direction, the data changes very little, in extreme cases, if the short axis degenerate into a point, it is only in the direction of the long axis to be able to explain the changes in these points, so that the two-dimensional to one-dimensional dimensionality is naturally completed.
In the above figure, the U1 is the principal component direction, and then in the two-dimensional space to take and U1 direction orthogonal direction, is the direction of U2. n data in the U1 axis of the maximum degree of dispersion (variance is the largest), the data on the U1 projection represents the majority of the original data information, even if the U2 is not considered, the loss of information is not much. Moreover, U1 and U2 are irrelevant. When only considering U1, the two dimensions are reduced to one dimension.
The greater the length of the elliptical axis, the more reasonable the dimensionality. 1. Maximum Variance theory
In signal processing, the signal has a large variance, the noise has a small variance, the signal-to-noise ratio is the ratio of the variance between the signals and the noise, the larger the better. As shown in the preceding figure, the projection variance on the U1 is large and the projection variance on the U2 is small, then the projection on the U2 is considered to be caused by noise.
Therefore, we believe that the best K-dimensional feature is to convert n-dimensional sample points to K-dimensions, and the sample variance on each dimension is very large.
For example, we projected the 5 points in the image below to a certain dimension, which is represented by a line that is over the origin (the data is already centralized):
Suppose we choose two different lines to do the projection, then the left and right two of which is good. According to our previous variance maximization theory, the left side is good because the variance between the sample points after the projection is the largest (or the sum of the absolute values of the projection).
The method for calculating projections is shown in Figure 5 below:
In the figure, the red dot represents the sample, the blue dot represents the projection on U, the slope of the line is also the direction vector of the line, and is the unit vector. The blue point is the projection point on the U, and the distance from the origin is <x,u> (i.e. xtu or UTX). 2. Least Squares
We use the least squares method to determine the direction of each spindle (principal component).
For a given set of data (in the following description, vectors are generally referred to as column vectors):
Its data center is located at:
Data Center (moves the coordinate origin to the center point of the sample point):
The center of the data in the first spindle U1 direction of the dispersion of the most open, that is, in the U1 direction of the absolute value of the sum of the maximum (also can be said the largest variance), the method of calculating projection has been described above, is the X and U1 do the inner product, because only requires U1 direction, so set U1 is a unit vector
Here, that is, maximizing the formula:
By the knowledge of matrix algebra, it is more convenient to deal with the square of the absolute value symbol. So that is the maximization of the following formula:
Two vectors do the inner product, which can be transformed into matrix multiplication:
So the objective function can be expressed as:
Inside the parentheses is the matrix multiplication, which represents the inner product of the vector, since the column vector is converted to a row vector, the row vector is multiplied by the column vector to get a number, the transpose of a number is itself, so the objective function can be converted to:
Go to parentheses: