Principal component Analysis (principal component, or PCA) is a classic and simple machine learning algorithm whose main purpose is to use fewer variables to explain most of the variation in the original data. It is expected that many variables with high correlation can be converted into independent variables, and some new variables which are less than the number of original variables and which can explain most of the data variation are selected to achieve the purpose of dimensionality reduction, and then we first deduce the principle and theory of PCA algorithm:
The principal component is the new variable that we get from the linear combination of the original variable, where XI(i=1,2,..., p) is the original variable,yi(i=1,2,..., p) is the main component, and the relationship between them is as follows:
Among them, Uij is the linear correlation coefficient between the first principal component Yi and the first J primitive variable xj ,y1,y2 , ...,YP are the first to second ...,P main components, and u-One, ...,u1p By linear combination with corresponding primitive variables, the y1 is given the maximum ability to interpret the mutation, then U21, ...,u2p by linear combination with the corresponding primitive variables, This allows the y2 to obtain the maximum explanatory power for the variation part of the original variable that is not Y1 interpreted, and so on, and so on, until the p principal component is obtained; Usually we are based on the purpose of dimensionality reduction from this p The main component of the m component is less than p , and want m smaller, the total interpretation capacity of more than 80%, it is worth noting that the resulting principal components are linearly independent of each other;
Set y=a1x1+a2x2+...+apxp=a ' x, where a= (A1,A2,..., ap) ', x= (x1,x2,..., xp) ', the main ingredient is looking for x The linear function of a ' x, so that the corresponding variance is maximized, that is, var (a ' x)=a ' ∑a, and a ' a=1(so that a only),∑ is the covariance matrix of x ;
Derivation:
Based on the properties of the real symmetric matrix (each real symmetric matrix can be decomposed into a unit real eigenvector and real eigenvalues), such as for any real symmetric matrix a, there is
A=qτq '
Wherein, q is a matrix of the column vector consists of a characteristic vector, T is a diagonal matrix of the characteristic value of a diagonally, note that the characteristic value here corresponds to the characteristic column vector one by one in Q, and for this property, back to the PCA, because the covariance matrix of X is ∑ as a real symmetric matrix, the characteristic root of ∑ is Λ1≥λ2≥ ... ≥λp, the corresponding unit eigenvectors are u1, U2,...,up, so u= (u1,u2,..., up), then u ' u=uu ' =i, that is, u is an orthogonal array, and:
When taking A=U1:
So Y1=u ' 1x is the first principal component, and its variance is:
Similarly:
Through the above deduction, we can use the covariance matrix of the original variable to solve each principal component, after the calculation of all principal components, the main component selection, because the principal component and the original variable covariance matrix directly linked, we define the K principal component YK Variance Contribution Rate:
The main component selection process is the main component from the largest contribution rate, until the cumulative contribution rate to meet the requirements;
Then define the principal component load (loadings, which is called the factor load in the factor analysis):
That is, the correlation coefficients of the first principal component and the J Primitive variable, the matrix a= (AIJ) is called the factor load matrix, and in practice the AIJ is used instead of Uij As the principal component coefficient, because it is a standardized coefficient, can reflect the size of the influence of variables;
We have already learned the main principles of principal component analysis, and we will then implement this process in Python with our own function:
Python
Build a custom PCA algorithm using NumPy and Sklearn packages (except for normalization and solving eigenvalues, the rest of the feature vectors are implemented by custom functions)
import numpy as np
import pandas as pd
from sklearn import preprocessing
‘‘ Read data ’‘ ‘
original_data = pd.read_csv (r‘C: \ Users \ windows \ Desktop \ kaggle \ A \ wine_red.csv ’, encoding =‘ ANSI ’)
‘‘ Data preprocessing ’‘ ’
data = np.asmatrix (original_data.iloc [:, 4:])
class My_PCA ():
def __init __ (self):
print (‘self-made PCA algorithm’)
‘‘ ‘According to the input data set and the specified cumulative contribution rate threshold’ ‘’
def PCA (self, data, alpha = 0.8):
‘‘ ‘Data standardization’ ‘’
scaler = preprocessing.StandardScaler (). fit (data)
input = scaler.transform (data) .astype (dtype = ‘float32’)
‘‘ ‘Calculate the correlation coefficient matrix’ ‘’
cor = np.corrcoef (input)
‘‘ ‘Calculate the eigenvalues and corresponding eigenvectors of the correlation coefficient matrix’ ‘
eigvalue = np.linalg.eig (cor) [0] .astype (dtype = ‘float32’)
eigvector = np.linalg.eig (cor) [1] .astype (dtype = ‘float32’)
‘‘ ‘Calculate the variance contribution of each principal component’ ‘’
contribute = [eigvalue [i] / np.sum (eigvalue) for i in range (len (eigvalue))]
‘‘ ‘Save the position corresponding to the feature value after sorting’ ‘’
sort = np.argsort (contribute)
‘‘ ‘Extract the required principal components based on the incoming cumulative contribution rate threshold alpha’ ‘’
pca = []
token = 0
i = 1
while (token <= alpha):
token = token + contribute [sort [len (input)-i]]
pca.append (sort [len (input)-i])
i + = 1
‘‘ ‘Save the eigenvalues and eigenvectors corresponding to the principal components and get them as the return value‘ ‘‘
PCA_eig = {}
for i in range (len (pca)):
PCA_eig [‘the {th {} principal component’.format (str (i + 1))] = [eigvalue [pca [i]], eigvector [pca [i]]]
return PCA_eig
‘‘ ‘Assign the class of the algorithm to a custom variable’ ‘’
test = My_PCA ()
‘‘ ‘Call the PCA algorithm in the class to produce the eigenvalues and eigenvectors corresponding to the required principal components’ ‘’
pca = test.PCA (data)
‘‘ ‘Show the eigenvalues and eigenvectors corresponding to the largest principal component’ ‘’
pca [‘The first principal component’]
View the 1th main component results as follows:
The above is about the principle of PCA algorithm and the implementation of the self-coding function, in the next we will carefully introduce Python and r in the respective mature third-party PCA function, please look forward to.
(Data Science Learning Codex 20) Derivation of principal component Analysis principle &python self-programmed function realization