This article is based on two references of the same name.A Tutorial on Principal Component Analysis.
PCA, or principal component analysis, is mainly used for dimensionality reduction of features. If the number of features in the data is very large, we can think that only a part of the features are truly interesting and meaningful, while other features or noise, or redundant with other features. The process of finding meaningful features from all features is dimensionality reduction, while PCA is one of the two main methods of dimensionality reduction (LDA ).
The Jonathon Shlens paper presents an example of spring vibration testing in physics in ideal conditions. For details, see [1].
First, let's take A look at A given matrix A representing A data record. If we calculate its Principal Component P and use P to obtain the data matrix B after dimensionality reduction, next, we will introduce the principles behind this computation process. Finally, we will introduce examples of implementing PCA in Python and calling the PCA algorithm in Weka.
1. calculation process:
Suppose we have n data records, each of which is m-dimensional. We can represent these data records as an n * m matrix.
Calculate the average value of each column in matrix A. For each element in A, subtract the average value of the column where the element is located to obtain A new matrix B.
Computing matrix Z = BTB/(n-1 ). In fact, m * m dimension matrix Z is the covariance matrix of A matrix.
Calculate the feature value D and feature vector V of the matrix Z, where D is the 1 * m matrix and V is an m * m matrix. Each element in D is the feature value of Z, column I in V is the feature vector corresponding to the feature value of column I.
Next, we can perform dimensionality reduction. If we need to reduce the data from m to k, we only need to select k largest feature vectors from D, then, k feature vectors are selected from V to form a new m * k matrix N.
Each column in N is the Principal Component (Principal Component) of ). calculate A * N to obtain n * k dimension matrix C, which is the result of dimensionality reduction of source data. The dimension of no data records is reduced from m to k.
2. Principles
The main reason for dimensionality reduction is that there is noise in the data, the data axis (base) needs to be rotated, and the data is redundant.
(1) Noise
It is a two-dimensional graph that records spring vibration. We found that the variance of the data along the positive 45 degrees direction is relatively large, while the variance of the data along the negative 45 degrees direction is relatively small. Generally, we think that the direction with the largest variance records the information we are interested in, so we can keep the data information in the positive 45 degree direction, the data information in the negative 45-degree direction can be considered as noise.
(2) Rotating
In linear algebra, we know that the coordinates of the same group of data are different under different bases, and we generally think that the direction of the Base should be consistent with the direction of the maximum data variance, that is, the base in should not be X or Y axis, and the clockwise rotation is 45 degrees.
(3) Redundancy
A and c represent data with no redundancy and high redundancy respectively. In a, the X-and y-axis coordinate values of a data point are basically completely independent. It is impossible for us to use one of them to speculate on the other. In c, data is basically distributed in a straight line. We can easily deduce another coordinate value from a coordinate value, so we do not need to record the data in X, you only need to record the values on the Y coordinate axes. Data redundancy is similar to noise. We only need to record the coordinate values in the direction with a large variance, the coordinate value in the direction with a small variance can be seen as redundancy (or noise ).
In the above three cases, the final result is that the direction (base) with a relatively large variance is required, and then the coordinates of the data under this base are obtained. This process can be expressed:
PX = Y.
K * m matrix P is an orthogonal matrix, and each row of P corresponds to a base with a relatively large variance. Each column of m * n matrix X and k * n matrix Y is a data (this is different from 1 because it is two different papers with different representation methods, in essence ).
X is the original data, P is a new base, and Y is the coordinate of X under the new base of P. Note that the dimension of data records in Y is reduced from m to k, that is, the dimensionality reduction is completed.
But what kind of y matrix do we want? We hope that the variance of the coordinates under each base in Y is as big as possible, while the variance of the coordinates under different base is as small as possible, that is, we hope that Cy = yyt/(n-1) is a diagonal matrix.
Cy = yyt/(n-1) = P (xxt) Pt/(n-1)
To make a = xxt, we break down a: a = edet
We take p = ET, then Cy = etae/(n-1) = etedete/(n-1), because ET = E-1, So Cy = D/(n-1) is a diagonal matrix.
Therefore, we should take each row of P as the feature vector of A, and the obtained y will have the above properties.
3. Implementation
1. Python Implementation of PCA
Use the python scientific computing module numpy
1 import numpy as NP 2 3 MAT = [(2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 1.5, 1.1, 2.4), (0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9)] 4 # transpose. Each row contains 5 Data = NP. matrix (NP. transpose (MAT) 6 data_adjust = data-mean 7 # returns the covariance matrix 8 covariance = NP. transpose (data_adjust) * data_adjust/9 9 # obtain the feature values and feature vectors of the covariance matrix 10 eigenvalues, eigenvectors = NP. linalg. EIG (covariance) 11 feature_vectors = NP. transpose (eigenvectors) 12 # converted data 13 final_data = feature_vectors * NP. transpose (data_adjust)
2. Call PCA in WEKA:
Import java. io. fileReader as FileReaderimport java. io. file as Fileimport weka. filters. unsupervised. attribute. principalComponents as PCAimport weka. core. instances as Instancesimport weka. core. converters. CSVLoader as CSVLoaderimport weka. filters. filter as Filterdef main (): # Use the dataset cpu that comes with Weka. arff reader = FileReader ('data/cpu. arff ') data = Instances (reader) pca = PCA () pca. setInputFormat (data) pca. setMaximumAttributes (5) newData = Filter. useFilter (data, pca) for n in range (newData. numInstances (): print newData. instance (n) if _ name __= = '_ main _': main ()
References:
[1]. Jonathon Shlens.A Tutorial on Principal Component Analysis.
[2]. Lindsay I Smith.A Tutorial on Principal Component Analysis.
[3]. A learning Summary of PCA Algorithms
[4] PCA Principle Analysis
[5] Principal Component Analysis (PCA) Theory Analysis and Application