Principal component Analysis
Pearson was proposed in 1901, and then developed by Hotelling (1933), a multivariable statistical method
The maximum individual differences are revealed by the main component, and the number of variables in the regression analysis and clustering analysis is also reduced.
The sample covariance matrix or correlation coefficient matrix can be used as a starting point for analysis
Retention of ingredients: Kaiser Proposition (1960) Discard components with eigenvalues less than 1 and retain only those with eigenvalues greater than 1
If we can explain the variation of 80% with no more than 3-5 ingredients, even if it is successful
The optimized index is obtained by linear combination of the original variables.
The calculation of the original multiple indexes is reduced to a few of the optimized indexes (taking up most of the shares)
The basic idea: to try to regroup many of the previously relevant indicators into a new set of independent integrated indicators and replace the original indicators
The visual geometric meaning of principal component analysis
The mathematical model of principal component analysis
The idea of principal component analysis can finally be transformed into a linear algebra problem by the matrix notation
Translates to the problem of diagonalization of the covariance matrix (solving eigenvalue)
Factor analysis
A method of dimensionality reduction is the generalization and development of principal component analysis.
is a statistical model used to analyze the effects of factors behind surface phenomena. An attempt is made to describe each component of the original observation with the sum of the linear and special factors of the least number of non-measurable common factors .
Example: Academic achievement (mathematical ability, language ability, transport ability, etc.)
Example: Life satisfaction (job satisfaction, family satisfaction)
Example: Shiry book P522
Main uses of factor analysis
Reduce the number of analysis variables
By probing the correlation between variables, the original variables are grouped, i.e. the variables with high correlation are divided into a group, and the variables are substituted by the common factor .
Make the meaning of the business factor behind the problem more clearly presented
The difference from principal component analysis
Principal component analysis focuses on " variability" by converting the original variable into a new combination variable to maximize the "variance" of the data, thus maximizing the difference between the individual samples.
But the main ingredient that comes out is often difficult to explain from the perspective of business scenarios
Factor analysis pays more attention to the "common variation " of related variables, and combines the primitive variables with strong correlation.
The goal is to find a few key factors that work behind the scenes , and the results of factor analysis tend to be easier to interpret with business knowledge.
Factor analysis uses a complex mathematical approach
A more complex mathematical model than principal component analysis
Methods for solving the model: principal component method, Principal factor method, maximum likelihood method
The result can also be rotated by factor to make the business meaning more obvious
Maximum Likelihood method
Likelihood function
Maximum likelihood function
Algorithm description (Shiry book p533)
Principal Component Method
Estimating expectation and covariance matrices from samples
Finding eigenvalues and eigenvectors for covariance matrices
Omit parts with smaller eigenvalues to find a, D
Program
Example
Principal Factor method
First, standardize the variables
Gives an estimate (initial) value for M and a special variance
Finding the Simple Correlation matrix r* (P-order matrix)
Calculate the eigenvalues and eigenvectors of the r*, take the first m and omit the rest
Find A * and d*, and then iterate the calculation
Machine Learning 4th Week---Smelting number into gold-----dimensionality reduction Technology