machine learning algorithm -PCA dimensionality reduction One
Introduction
The problems we encounter in the actual data analysis problem usually have the characteristics of higher dimensionality, when we carry out the actual data analysis, we will not use all the features for the training of the algorithm, but rather pick out the features that we think may affect the target. For example, in the Titanic Crew survival prediction problem, we will use the name as useless information processing, which we can from the intuitive understanding of the better. But there may be strong correlations between some traits, such as studying the development of a region, and we may choose two characteristics of the region's GDP and per capita consumption as a measure. Obviously there is a strong correlation between the two, and they describe the economic situation in the region, can we turn them into a feature? This can be achieved by reducing the feature dimension, but also to avoid the problem of overfitting caused by too many features.
Second,
PCA Dimension reduction Technology
There are roughly three methods for reducing the dimensionality of data, such as principal component analysis, factor analysis and independent component analysis. As the principal component analysis method is more used among the three, we only discuss the principal component analysis in depth.
2.1 PCA algorithm thought
In principal component analysis, the original coordinate system is transformed into a new coordinate system through coordinate changes. The selection of the axis of the new coordinate is related to the original data, the first axis selects the direction of the most difference in the original data, the second axis is selected with the first coordinate orthogonal and has the largest variance, so it repeats until the coordinate dimension is the same as the feature dimension of the data. You can see that most of the variance is contained in the previous several new axes. As a result, the remaining axes can be ignored, so that the data is reduced to a dimension.
calculation method of 2.2 PCA algorithm
let's start by introducing some The PCA calculation method, after which we will analyze the PCA algorithm from the mathematics .
The PCA calculation process is as follows:
Ste1: Removes the mean of each dimension feature to move the center of the data to the origin point.
Ste2: Computes the covariance matrix.
Ste3: Calculates the eigenvalues and eigenvectors of the covariance matrix.
Ste4: To find the top N eigenvector of the eigenvalues from the large to the small order .
STEP5: Transforms the original data into a new space constructed of the N eigenvectors above.
with these 5 steps, we can reduce the raw data to the dimensions we want to achieve. The main idea of principal component analysis is based on the maximal variance and the least dimension theory, so we can determine the value of N by calculating the accumulative contribution rate . The defined contribution rate is as follows:
wherein λ(i) represents the characteristic value corresponding to the dimension I. We can set the appropriate thresholds, typically 0.8, andif the sum of the first n eigenvalues reaches that threshold, we can assume that n Dimension represents the primary information for the original data.
2.3 PCAMathematical principles of technology
in the previous section, we have discussed The PCA calculation method, below we will discuss why this calculation. In signal processing, the signal is considered to have a large variance, while the noise has a small variance. The SNR represents the ratio of the signal variance to the noise variance. The higher the signal-to-noise ratio, the better the table data. By the coordinate transformation, we can calculate the variance of each dimension under the new coordinate, if the variance on one axis is very small we can think that the feature of this dimension is noise and disturbance. So the best way to coordinate the transformation is
after the transformation each dimension of the N-dimensional feature has a large variance.
Figure 1
The blue bold lines shown in 1 represent a dimension after the coordinate change , representing the first sample of the data after the mean is removed , andu is the direction vector of the dimension. Represents the distance to the origin on the dimension. Now all we have to do is find out the sum of the variance that makes the data projected on that dimension. Because the average of 0 after the average processing is removed , it is easy to prove that theaverage of their projections in any direction is also 0. so in u the variance of the direction is followed by:
The middle part of the above means that the covariance matrix of the sample features is represented by the expression, so that the above formula can be represented as a var. According to the definition of eigenvalue, λ is a characteristic value, andU is the eigenvector. The best projection line is the eigenvector that corresponds to the maximum of the eigenvalue λ, and so on. This means that the size of the eigenvalues is equivalent to the variance of the data after normalization. so we only need to decompose the covariance matrix to get the former Characteristic vectors corresponding to n eigenvalues, and the characteristics of this n-restoration are orthogonal. Therefore, the n-dimensional raw data can be converted into new n-dimensional data by the following calculation methods:
by selecting the largest front n-dimensional so that their cumulative contribution rate reached a certain value, we can discard the less variance features, to achieve the purpose of dimensionality reduction.
Python implementation of 2.4 PCA technology
Defining Functions The PCA is as follows:
defPCA (Datamat, minration): Meanvals= Mean (Datamat, axis=0) meanremoved= Datamat-meanvals#Remove AverageCovmat = CoV (meanremoved,rowvar=0)#Calculate the covariance matrixEigvals,eigvects = Linalg.eig (Mat (Covmat))#calculate eigenvalues feature vectorsEigvalind=argsort (-eigvals)#from small to large to sortratio=0 Topnfeat=1 forIinchRange (len (eigvals)): Index=Eigvalind[i] ratio=ratio+eigvals[index]#Calculate cumulative contribution rateTopnfeat=i+1ifRatio>minration: BreakEigvalind=eigvalind[1:topnfeat+1] Redeigvects=Eigvects[:,eigvalind] Lowddatamat= meanremoved * redeigvects#convert data to a new dimension space returntopnfeat Lowddatamat
the function is entered as a primitive matrix, and a threshold value for the cumulative contribution rate. The average is removed first, and then the covariance matrix of the data after the average is removed using the cov () function provided by the Numy library. The eigenvalues and eigenvectors of the covariance matrix are computed by the Linalg.eig () function. Sorting by Argsort function (-eigvals) is equivalent to sorting the eigvals in reverse order, returning the sorted index value. The loops are used to calculate the dimensions under the new space, and the lowdata is the data that is transformed into the new space. Finally, the function outputs the transformed data, and the dimension of the space topnfeat.
three, the application of PCA technology3.1 Problem Description
before the continuation of the echo detection problem, that is, from a different direction to test the rock, according to the results of the echo to detect whether the object is a rock wall or mine. The DataSet has a total of 208 data, each with a dimension of four .
3.2 PCA reduced Dimension processing
we are already in Adabost using this example, this time we have PCA the data after dimensionality is used again Adabost algorithm to calculate the accuracy rate of the test. The accuracy and running time of the test are as follows.
Cumulative contribution threshold P |
Number of dimensions after processing N |
Average accuracy of test R |
Program Average time t |
0.80 |
7 |
0.27 |
1.44 |
0.85 |
9 |
0.26 |
1.84 |
0.90 |
12 |
0.28 |
2.40 |
0.95 |
17 |
0.30 |
3.01 |
0.98 |
24 |
0.21 |
5.1 |
1.00 |
60 |
0.28 |
5.70 |
The results show that if the accumulative contribution rate is set to 0.9, the dimension of the new space is more than the dimensions , and the dimension of the data is greatly reduced. On the same data set, the accuracy rate of the algorithm is almost unchanged after dimensionality reduction. This shows that dimensionality reduction is an effective means for data processing.
Iv. Summary
In this paper, the principle of CA dimensionality reduction technology is analyzed from the angle of variance theory, and then the accuracy of the algorithm is compared before and after the reduction by the adaboost algorithm. In this case, the accuracy of the dataset after dimensionality reduction has not changed much, but the time has been greatly shortened. Of course, I have done some experiments on the data, and found that some data algorithms can be significantly more efficient, while others do not. Therefore, the dimensionality reduction technology is not omnipotent, but needs to be analyzed according to the actual data set is suitable for use.
Machine Learning Algorithm-PCA dimensionality reduction Technology