Introduction
Principal component Analysis (PCA) is a data dimensionality reduction algorithm which can greatly improve the learning speed of unsupervised features. More importantly, the understanding of PCA algorithm, the implementation of the whitening algorithm has a great help, many algorithms are first used whitening algorithm for preprocessing steps.
Suppose you use an image to train the algorithm, because the adjacent pixels in the image are highly correlated, and the input data is somewhat redundant. Specifically, if the 16x16 grayscale image We are training is recorded as a 256-dimensional vector, where the eigenvalues correspond to the luminance values of each pixel. Due to the correlation between neighboring pixels, the PCA algorithm can convert input vectors to an approximate vector with a much lower number of dimensions, and the error is very small.
examples and mathematical backgrounds
In our instance, the input dataset used is represented as the dimension. Suppose we want to reduce the data from 2 to 1 dimensions. (In practice, we may need to reduce the data from 256 to 50-D, where low-dimensional data is used primarily to better visualize the behavior of the algorithm). Is our data set:
These data have been preprocessed so that each feature has the same mean value (0) and variance.
For ease of display, depending on the size of the value, we have applied one of three colors to each point, but the color is not used for the algorithm but for illustration only.
The PCA algorithm will look for a low-dimensional space to project our data. It can be seen that the main direction of the data change, but the secondary direction.
In other words, the data changes in direction more than in the direction. To find the direction more formally, we first calculate the matrix as follows:
If the mean value is zero, then it is the covariance matrix of X. (Symbol, read "Sigma", is the standard symbol of the covariance matrix.) Although they look like a summation symbol, they are actually two different concepts. )
It can be proved that the main direction of data change is the principal eigenvector of covariance matrix, but the secondary eigenvector.
Note: If you are interested in how to obtain the specific mathematical derivation of this result, you can refer to the CS229 (machine Learning) PCA section of the Courseware (link at the bottom of this page). But if you just want to keep up with this lesson, you don't have to.
You can obtain the eigenvector by using the standard numerical linear algebra software (see implementation description). We first calculate the eigenvectors of the covariance matrix, which are emitted by the columns, and The matrix is composed:
Here, the main feature vector (corresponding to the largest eigenvalue), is the secondary eigenvector. And so on, it is also remembered as the corresponding characteristic value.
In this example, a vector and a new base are formed that can be used to represent data. As a training sample, the length (amplitude) of the projection of the sample point on the dimension. The same is the magnitude projected onto the dimension.
Rotate Data
At this point, we can use the base expression as:
(The subscript "rot" comes from the word "rotation", meaning that this is the result of the original data being rotated (or can be said to be mapped))
Rotate each sample in the dataset separately: for every, and then display the transformed data on the coordinate chart to get:
This is the result of rotating the training data set to the base. In general, operations represent training data that is rotated to the base,, ...,. Matrices have orthogonality, that is, to satisfy, so if you want to restore the rotated vector to the original data, the left multiplication matrix can:, check it out:.
Data Dimension Reduction
The main direction of the data is the first dimension of the rotated data. Therefore, if you want to reduce this data to one dimension, you can make:
More generally, if you want to reduce the data to a dimensional representation (order), just select the previous component, respectively, corresponding to the previous data changes in the main direction.
Another explanation for PCA is that it is a vector of dimensions, where the first few components may be larger (for example, the first component of most samples in the previous example has a relatively large value), and the latter component may be smaller (for example, the smaller of most samples in the previous example).
The PCA algorithm is actually to discard in the back (lower value) of the composition, that is, the values of these components are approximately zero. Specifically, the set is the approximate representation, then the other than the previous component, the remaining full assignment is zero, you get:
In this example, the available point graph is as follows (fetch):
However, since the above latter are zero, it is not necessary to keep these 0 items. Therefore, we only use the previous (not 0) component to define the dimension vectors.
This also explains why we think radicals represents data: deciding which ingredients to keep is easy, just take the previous ingredient. It can also be said that we "retain the previous PCA (Main) component".
Restore Approximate data
Now, we get the low-dimensional "compression" characterization of the raw data, and, conversely, how do we restore the original data if given? See previous chapters in the previous chapters to convert back, simply. Further, we consider the last element to be set to 0 of the resulting approximate representation, so if given, can be obtained by adding a zero at the end of the approximation, finally, the left drop can be approximated to restore the original data. Specifically, the calculation is as follows:
The above equation is based on the definition of the previous pair. When implemented, we do not actually fill in 0 and then left, because it means a lot of multiplication by 0. We can do this by multiplying the top-most right-hand side of the list. To apply the algorithm to the dataset in this example, you can get a point graph of the refactoring data as follows:
As the graph shows, we get a one-dimensional approximate reconstruction of the original data set.
When training an automatic encoder or other unsupervised feature learning algorithm, the algorithm run time depends on the dimension of the input data. With substitution as input data, the algorithm can be trained with low-dimensional data, and the speed of operation will be significantly faster. For many datasets, low-dimensional characterization is a very good approximation of the original data set, so it is appropriate to use PCA in these situations, and it introduces a small approximation error that can significantly improve the speed of your algorithm.
Select the number of principal components
How do we choose how many PCA principal components are retained? In this simple two-dimensional experiment, preserving the first ingredient looks like a natural choice. For high-dimensional data, this decision is not as simple as: if it is too large, the data compression rate is not high, in the limit case, is equal to the use of the original data (only the rotation is projected to a different base); Conversely, if too small, the approximate error of the data missus.
When determining values, we typically consider the percentage of variance that different values can hold. Specifically, if, then we get a perfect approximation of the data, that is, 100% of the variance is preserved, that is, all changes to the original data are preserved, conversely, if it is equivalent to using 0 vectors to approximate the input data, that is, only 0% of the variance is preserved.
In general, the eigenvalues of the representation are arranged in large to small order, making them the eigenvalues corresponding to the eigenvectors. So if we keep the previous ingredient, the percentage of variance that is retained can be calculated as:
In the simple two-dimensional experiment above,. So, if we keep a principal component, it's equal to the 91.3% variance.
A more formal definition of the percentage of reserved variances is beyond the scope of this tutorial, but it is easy to prove that. Therefore, if the description is essentially close to 0, it is not likely to incur much loss by using the "%" approximation. This also explains why the preceding principal component (the corresponding value is larger) is preserved rather than the end. These front principal components are more variable and have greater value, and if set to 0, it is bound to introduce a large approximate error.
As an example of processing image data, one of the usual rules of thumb is to choose to preserve the 99% variance, in other words, we select the minimum value that meets the following criteria:
For other applications, if you do not mind introducing a slightly larger error, sometimes the variance range of the 90-98% is preserved. If you introduce the PCA algorithm details to others, tell them that you have chosen to retain 95% of the variance, better understanding than telling them that you have retained the first 120 (or any number) principal components.
application of PCA algorithm to image data
In order for the PCA algorithm to work effectively, we usually want to have a similar range of values for all features (and the mean value is close to 0). If you have used the PCA algorithm in other applications, you may know that it is necessary to preprocess each feature individually, by estimating the mean and variance of each feature, and then warping its range to 0 mean and unit variance. However, for most image types, we do not need such preprocessing. Suppose we train the algorithm on a natural image, at which point the feature represents the value of the pixel. The so-called "natural image", not strictly speaking, refers to the kind of image that people or animals see in their life.
Note: Usually we select outdoor scene images with vegetation and other content, and then randomly intercept small image blocks (such as 16x16 pixels) to train the algorithm. In practice, we find that most feature learning algorithms are not sensitive to the exact type of training pictures, so most images taken with ordinary cameras can be used as long as they are not particularly blurry or have very strange artificial traces.
When training on a natural image, it is not significant to estimate the mean and variance individually for each pixel, because the statistical nature of any part of the image should be the same as the rest, and this characteristic of the image is called stationarity (stationarity).
Specifically, in order for the PCA algorithm to work properly, we usually need to meet the following requirements: (1) The mean value of the feature is roughly 0, and (2) The variance values of the different characteristics are similar to each other. For natural images, even if no variance is normalized, the condition (2) is naturally met, so we no longer perform any variance normalization (for audio data, such as the sound spectrum, or text data, such as a word bag vector, we do not normally make variance normalization). In fact, the PCA algorithm has a scaling invariant to the input data, regardless of how the value of the input data is enlarged (or shrunk), and the returned eigenvectors do not change. More formally: If each eigenvector is multiplied by a positive number (that is, all the features are magnified or reduced by the same multiples), the output eigenvectors of the PCA will not change.
Since we do not make the variance normalization, the only normalization that needs to be done is the mean normalization, which is to ensure that the mean value of all features is around 0. In most cases, depending on the application, we are not concerned with the overall brightness of the image being entered. For example, in object recognition tasks, the overall brightness of the image does not affect what is present in the image. More formally, we are not interested in the average luminance value of an image block, so we can subtract this value to make the mean normalized.
The specific step is that if you represent the luminance (grayscale) value () of a 16x16 image block, you can use the following algorithm to perform a 0-value operation on each image:
, for all
Please note: 1) for each input image block to perform the above two steps, 2) here is the average luminance value of the image block. It is particularly important to note that this and the individual estimate of the mean for each pixel are two completely different concepts.
If the image you are working with is not a natural image (for example, handwritten text, or a single object in the middle of a white background), other regularization operations are worth considering, and which is most appropriate and depends on the specific application. However, for the natural image, it is the default and reasonable process for each image to be normalized to the above 0 mean value.
Reprint: http://ufldl.stanford.edu/wiki/index.php/%E4%B8%BB%E6%88%90%E5%88%86%E5%88%86%E6%9E%90
PCA Principal Component Analysis