**Pca:**

PCA has 2 functions, one is the dimensionality reduction (can speed up the training speed of the algorithm, reduce memory consumption, etc.), the first is the visualization of data.

PCA is not a linear regression, because linear regression is a guarantee that the resulting function is the least error in terms of the Y value, while the PCA is guaranteed to minimize the error of the resulting function to the descending dimension. In addition, linear regression predicts the Y value by x value, while the PCA treats all x samples equally.

Before using PCA, it is necessary to preprocess the data, first of all, to subtract the average of the dimension from each feature dimension, and then to have the data range of the different dimensions normalized to the same range, which is generally divided by the maximum value. It is strange, however, that the average of a natural image is not subtracted from the average value of the dimension, but rather minus the mean of the image itself. Because the pretreatment of PCA is determined according to different application situations.

Natural images refer to images that are often seen by the human eye, which conform to certain statistical characteristics. In general, the actual process, as long as the normal camera shot, did not add a lot of artificial images can be called natural pictures, because many algorithms on these images of the input type is still relatively robust. In the natural image to learn, in fact, do not need to pay too much attention to the image of the variance normalization, because the natural image of each part of the statistical characteristics are similar, only need to do a mean value of 0 is OK. However, when training other pictures, such as the first word recognition, we need to make the variance normalization.

The PCA calculation process mainly requires 2 things, one is the direction of each vector after descending dimension, and the other is the value after the original sample is projected in the new direction.

The first requirement is the covariance matrix of the training sample, as shown in the formula (the input data has already been valued):

After finding the covariance matrix of the training samples, the SVD is decomposed, and each column in the U vector is the new direction vector of these data samples, the vectors in front represent the main direction, and so on. What you get with U ' *x is the descending dimension of the sample value Z, namely:

(In fact, the geometric meaning of this z-value is the distance value of the original point to that direction, but this distance has positive and negative points), so that the PCA 2 major computing tasks have been completed. The original data sample x can be restored with u*z.

When using supervised learning, if you want to use PCA dimensionality, then simply extract the x value of the training sample, calculate the principal component matrix U and the value z after descending, and then let the combination of y values of Z and the original sample form a new training sample to train the classifier. In the test process, you can also use the original U to the new test sample dimensionality, and then input into a trained classifier.

One point to note is that PCA does not stop the fitting phenomenon. It is shown that PCA is dimensionality reduction, because in the same number of training sample data, its characteristic number is less, it should be more difficult to produce overfitting phenomenon. However, during the actual operation, this method prevents the overfitting phenomenon from being very small, mainly through the rule items to prevent overfitting.

Not all ML algorithms need to use PCA to reduce dimensionality, because only when the original training sample does not meet the needs of the situation we use, such as the training speed of the model, memory size, hope visualization and so on. If you do not need to consider those situations, you do not necessarily need to use the PCA algorithm.

**Whitening:**

The goal of whitening is to remove the correlation between the data, which is a process of preprocessing many algorithms. For example, when training picture data, because of the image of the adjacent pixel value has a certain correlation, so a lot of information is redundant. At this time to the relevant operation can be used whitening operation. The whitening of the data must meet two conditions: first, the correlation between different features is the smallest, close to 0, and the variance of all features is equal (not necessarily 1). Common whitening operations include PCA whitening and Zca whitening.

PCA whitening refers to the data x after the PCA is reduced to Z, you can see that each dimension in Z is independent, to meet the first condition of whitening whitening, it is only necessary to divide each dimension in Z by the standard deviation to get the variance of each dimension is 1, that is, the variance is equal. The formula is:

ZCA whitening means that data x is converted to Z by PCA first, but not dimensionality, because this is where all the ingredients are chosen. This is also the first condition that satisfies the whtienning, and the characteristics are independent of each other. Then the same operation with the variance of 1, the resulting matrix is left multiply by a eigenvectors matrix u can.

The ZCA whitening formula is:

PCA and Whitening