R in action Reading Notes (19) Chapter 1 Principal Component and factor analysis, action Reading Notes
Chapter 2 Principal Component and Factor Analysis
Content of this Chapter
Principal Component Analysis
Exploratory Factor Analysis
Other latent variable models
Principal Component Analysis (PCA) is a data dimension reduction technique that converts a large number of correlated variables into a small group of irrelevant variables called principal components. Exploratory factor analysis (EFA) is a series of methods used to discover the potential structure of a group of variables. It explains the relationships between observed and explicit variables by looking for a group of smaller, potential, or hidden structures.
Differences between PCA and EFA Models
The principal component (PC1 and PC2) is a linear combination of observed variables (X1 to X5. The weights of linear combinations are obtained by maximizing the variance interpreted by each principal component, and the correlation between each principal component is also ensured. On the contrary, factors (F1 and F2) are treated as the structural basis or "cause" of observed variables, rather than linear combinations of them. The error representing the variance of observed variables (e1 to e5) cannot be interpreted by factors. The circle in the figure indicates that the factor and error cannot be directly observed, but the result can be derived from the relationship between variables.
Analysis of principal component and factor in 14.1 R
PsychUseful factor analysis functions in the package
Principal () |
Principal Component Analysis with multiple optional variance rotation methods |
Fa () |
Factor Analysis Using the spindle, least residual, weighted least square, or maximum likelihood method |
Fa. parallel () |
Gravel graph with parallel analysis |
Factor. plot () |
Draw the result of factor analysis or Principal Component Analysis |
Fa. digoal () |
Draw the load matrix of factor analysis or Principal Component |
Scree () |
Gravel graph of factor analysis and principal component analysis |
The most common steps are as follows:
(1) data preprocessing. Both PCA and EFA are derived based on the correlation between observed variables. You can enter the original data matrix or correlation coefficient matrix to the principal () and fa () functions. If you enter the initial data, the correlation coefficient matrix is automatically calculated. Before calculation, make sure that there are no missing values in the data.
(2) Select a factor model. Determine whether PCA or EFA is more in line with your research goals. If you select the EFA method, you also need to select a method for estimating Factor Model (such as maximum likelihood estimation ).
(3) determine the number of principal component/factor to be selected.
(4) Select principal component/factor.
(5) rotating principal component/factor.
(6) Explain the results.
(7) Calculate the score of principal component or factor.
14.2 Principal Component Analysis
The goal of PCA is to replace a large number of correlated variables with a small set of irrelevant variables, and keep the information of the initial variables as much as possible. The derived variables are called principal components, they are linear combinations of observed variables. For example, the first principal component is PC1 = a1X1 + a2X 2 + ...... + Ak Xk, which isKWeighted combination of observed variables, the maximum interpretation of variance of the initial variable set. The second principal component is also a linear combination of the initial variables. The interpretation of the difference between the two parties ranks second, and is orthogonal to the first principal component (unrelated ). Each of the following principal components maximizes the degree of interpretation of the difference between each other and is orthogonal to all the previous principal components. For example, the USJudgeRatings dataset contains 43 observations and 12 variables.
Changes |
Description |
CONT |
Number of visits between lawyers and judges |
INTG |
Judge integrity |
DMNR |
Demeanor |
DILG |
Diligence |
CFMG |
Case Process Management Level |
DECI |
Decision Efficiency |
PREP |
Preparations before trial |
FAMI |
Degree of familiarity with the law |
ORAL |
Reliability of oral decisions |
WRIT |
Reliability of written ruling |
PHYS |
Physical fitness |
RTEN |
Whether it is retained |
14.2.1 determine the number of Principal Components
Determine the number of principal components required in PCA:
Determine the master score based on prior experience and theoretical knowledge;
Determine the main score based on the threshold value of the cumulative value of the variance to be interpreted;
Check the correlation matrix of k × k between variables to determine the reserved primary score.
Using the fa. parallel () function, we can evaluate the three feature value discriminant criteria at the same time.
> fa.parallel(USJudgeRatings[,-1],fa="PC",n.iter=100,+ show.legend=FALSE,+ main="Scree plotwith parallel analysis")
Evaluate the number of principal components to be retained in the U.S. judge's rating. A broken chart (line and x symbol), a parallel analysis (dotted line) with a feature value greater than 1 criterion (horizontal line), and 100 simulation shows that a principal component can be retained. The three criteria indicate that a principal component can be selected to retain most of the information about the dataset.
14.2.2 extract Principal Component
The principal () function performs principal Component Analysis Based on the original data matrix or correlation coefficient matrix. Format: principal (r, nfactors =, rotate =, scores =)
R is the correlation coefficient matrix or the original data matrix;
Nfactors sets the primary score (1 by default );
Rotate specifies the rotation method [default maximum variance rotation (varimax)
Scores determines whether to calculate the principal component score (not required by default ).
> pc<-principal(USJudgeRatings[,-1],nfactors=1)> pcPrincipal Components AnalysisCall: principal(r = USJudgeRatings[, -1], nfactors = 1)Standardized loadings (pattern matrix) based upon correlation matrixPC1 h2 u2INTG 0.92 0.84 0.1565DMNR 0.91 0.83 0.1663DILG 0.97 0.94 0.0613CFMG 0.96 0.93 0.0720DECI 0.96 0.92 0.0763PREP 0.98 0.97 0.0299FAMI 0.98 0.95 0.0469ORAL 1.00 0.99 0.0091WRIT 0.99 0.98 0.0196PHYS 0.89 0.80 0.2013RTEN 0.99 0.97 0.0275PC1SS loadings 10.13Proportion Var 0.92
Since PCA only analyzes the correlation coefficient matrix, the original data is automatically converted to the correlation coefficient matrix before obtaining the principal component. The PC1 column contains the component load, which refers to the correlation coefficient between the observed variables and principal components. If more than one principal component is extracted, there will be columns such as PC2 and PC3. Component loadings can be used to explain the meaning of principal components. We can see that the first principal component (PC1) is highly related to each variable, that is, it is a dimension that can be used for general evaluation.
H2 column refers to the variance of component common factor-the degree of variance interpretation of principal component for each variable. U2 column indicates the component uniqueness-the ratio in which the variance cannot be interpreted by the principal component. For example, 80% of the variance in physical fitness (PHYS) can be interpreted by the first principal component, and 20% cannot. In contrast, PHYS are the most dominant variables. The SS loadings row contains the feature values associated with the principal component, which refers to the difference value after standardization associated with the specific principal component (in this example, the value of the first principal component is 10 ). Finally, the row Proportion Var represents the extent to which each principal component interprets the entire dataset. The first Principal Component explains the 92% variance of 11 variables.
14.2.3 Principal Component Rotation
Rotation is a series of mathematical methods that make the Component Load array easier to interpret. They remove noise as much as possible. Rotator
There are two methods: Make the selected components irrelevant (orthogonal rotation), and make them correlated (oblique rotation ). The rotation method varies according to the definition of de-noise. The most popular orthogonal rotation is the great variance rotation, which attempts to de-noise the columns of the load array, each component is interpreted only by a limited set of variables (that is, each column of the load array has only a few large loads, and others are very small loads ).
Principal Component Analysis with large variance Rotation
>rc<-principal(Harman23.cor$cov,nfactors=2,rotate="varimax")> rcPrincipal Components AnalysisCall: principal(r = Harman23.cor$cov, nfactors = 2, rotate ="varimax")Standardized loadings (pattern matrix) based upon correlation matrixRC1 RC2 h2 u2height 0.90 0.25 0.88 0.123arm.span 0.93 0.19 0.90 0.097forearm 0.92 0.16 0.87 0.128lower.leg 0.90 0.22 0.86 0.139weight 0.26 0.88 0.85 0.150bitro.diameter 0.19 0.84 0.74 0.261chest.girth 0.11 0.84 0.72 0.283chest.width 0.26 0.75 0.62 0.375RC1 RC2SS loadings 3.52 2.92Proportion Var 0.44 0.37Cumulative Var 0.44 0.81
Observe the load in the RC1 column. You can find that the first principal component is interpreted by the first four variables (length variable ). The load in the RC2 column indicates that the second principal component is interpreted mainly by the variables 5 to 8 (capacity variable). The cumulative variance interpretation after the two principal components rotate is not changed (81% ), the change is only the degree of interpretation of the difference between each principal component (from 1 to 58%, and from 2 to 44% ). The variance interpretations of each component are similar. To be precise, they should be called components rather than principal components (because the maximum variance of a single principal component is not retained ).
14.2.4 obtain the principal component score
Obtain component scores from raw data
> library(psych)> pc<-principal(USJudgeRatings[,-1],nfactors=1,score=TRUE)> head(pc$scores)PC1AARONSON,L.H. -0.1857981ALEXANDER,J.M. 0.7469865ARMENTANO,A.J. 0.0704772BERDON,R.I. 1.1358765BRACKEN,J.J. -2.1586211BURNS,E.B. 0.7669406
When scores = TRUE, the principal component score is stored in the scores element of the returned object by the principal () function.
The correlation coefficient between the frequency of contact between lawyers and judges and the judge's score can also be obtained:
> cor(USJudgeRatings$CONT,pc$score)PC1[1,] -0.008815895
The degree of proficiency of lawyers and judges is irrelevant to the score of lawyers.
Returns the coefficient of the principal component score.
> library(psych)>rc<-principal(Harman23.cor$cov,nfactors=2,rotate="varimax")> round(unclass(rc$weights),2)RC1 RC2height 0.28 -0.05arm.span 0.30 -0.08forearm 0.30 -0.09lower.leg 0.28 -0.06weight -0.06 0.33bitro.diameter -0.08 0.32chest.girth -0.10 0.34chest.width -0.04 0.27
Principal Component score:
PC1 = 0.25 * height + 0.3 * arm. span + 0.3 * forearm + 0.29 * lower. leg-0.06 * weight-0.08 * bitro. diameter-0.1 * chest. girth-0.04 * chest. width
14.3 exploratory factor analysis
The goal of EFA is to explain the correlation of a group of observed variables by discovering a small group of more basic unobserved variables hidden in the data. These virtual and unobserved variables are called factors. (Each factor is considered to be a common variance among multiple observed variables, so they should be called a common factor accurately .) The model format is:
WhereXIYesIObserved variables (I= 1...K),FJIs a common factor (J= 1...P), AndP<K.UIYesXIThe unique part of the variable (which cannot be interpreted by common factors ). AIIt can be considered that each factor contributes to the composite observability variable.
> options(digits=2)> covariances<-ability.cov$cov> correlations<-cov2cor(covariances)> correlationsgeneral picture blocks mazereading vocabgeneral 1.00 0.47 0.55 0.34 0.58 0.51picture 0.47 1.00 0.57 0.19 0.26 0.24blocks 0.55 0.57 1.00 0.45 0.35 0.36maze 0.34 0.19 0.45 1.00 0.18 0.22reading 0.58 0.26 0.35 0.18 1.00 0.79vocab 0.51 0.24 0.36 0.22 0.79 1.00
14.3.1 determine the number of common factors to be extracted
Use the fa. parallel () function to determine the number of factors to be extracted:
> library(psych)> covariances<-ability.cov$cov> correlations<-cov2cor(covariances)> fa.parallel(correlations,n.obs=112,fa="both",n.iter=100,+ main="Screeplots with parrallel analysis")
Determine the number of factors that need to be retained by the psychology test. The results of PCA and EFA are displayed at the same time. One or two components are recommended for PCA results, and two factors are recommended for EFA.
14.3.2 extract public factors
Determine to extract two factors. You can use the fa () function to obtain the corresponding results. The format of the fa () function is as follows: fa (r, nfactors =, n. obs =, rotate =, scores =, fm =)
R is the correlation coefficient matrix or the original data matrix;
Nfactors sets the number of extracted factors (1 by default );
N. obs is the number of observations (which must be filled in when the correlation coefficient matrix is input );
Rotate sets the rotation method (default mutual variation least );
Scores determines whether to calculate the score (not calculated by default );
Fm sets the factorization method (by default, the minimal residual method ).
Different from PCA, there are many methods to extract common factors, including the maximum likelihood method (ml), the spindle iteration method (pa), and the weighted least squares method (wls) generalized weighted least squares (gls) and minres (Least Residual) non-rotating spindle iteration factor method:
> fa<-fa(correlations,nfactors=2,rotate="none",fm="pa")> faFactor Analysis using method = paCall: fa(r = correlations, nfactors = 2, rotate = "none", fm ="pa")Standardized loadings (pattern matrix) based upon correlation matrixPA1 PA2 h2 u2 comgeneral 0.75 0.07 0.57 0.432 1.0picture 0.52 0.32 0.38 0.623 1.7blocks 0.75 0.52 0.83 0.166 1.8maze 0.39 0.22 0.20 0.798 1.6reading 0.81 -0.51 0.91 0.089 1.7vocab 0.73 -0.39 0.69 0.313 1.5 PA1 PA2SS loadings 2.75 0.83Proportion Var 0.46 0.14Cumulative Var 0.46 0.60
Two factors explain the variance of 60% in six psychology tests. However, the significance of the factor load array is not very easy to explain. Using the factor rotation will help to explain the factor.
14.3.3 factor Rotation
Extraction factor using orthogonal rotation
> fa.varimax<-fa(correlations,nfactors=2,rotate="varimax",fm="pa")> fa.varimaxFactor Analysis using method = paCall: fa(r = correlations, nfactors = 2, rotate = "varimax", fm= "pa")Standardized loadings (pattern matrix) based upon correlation matrixPA1 PA2 h2 u2 comgeneral 0.49 0.57 0.57 0.432 2.0picture 0.16 0.59 0.38 0.623 1.1blocks 0.18 0.89 0.83 0.166 1.1maze 0.13 0.43 0.20 0.798 1.2reading 0.93 0.20 0.91 0.089 1.1vocab 0.80 0.23 0.69 0.313 1.2 PA1 PA2SS loadings 1.83 1.75Proportion Var 0.30 0.29Cumulative Var 0.30 0.60
The result shows that the factor is better explained. Reading and vocabulary load on the first factor is large, drawing, Building Block Pattern and maze load on the second factor is large, non-verbal general intelligence measurement load on the two factors is relatively average, this indicates that there is a language intelligence factor and a non-verbal intelligence factor.
Extraction factor with oblique rotation:
> fa.promax<-fa(correlations,nfactors=2,rotate="promax",fm="pa")> fa.promaxFactor Analysis using method = paCall: fa(r = correlations, nfactors = 2, rotate = "promax", fm ="pa")PA1 PA2 h2 u2 comgeneral 0.36 0.49 0.57 0.432 1.8picture -0.04 0.64 0.38 0.623 1.0blocks -0.12 0.98 0.83 0.166 1.0maze -0.01 0.45 0.20 0.798 1.0reading 1.01 -0.11 0.91 0.089 1.0vocab 0.84 -0.02 0.69 0.313 1.0 PA1 PA2SS loadings 1.82 1.76Proportion Var 0.30 0.29Cumulative Var 0.30 0.60With factor correlations ofPA1 PA2PA1 1.00 0.57PA2 0.57 1.00
Based on the above results, you can see the differences between orthogonal rotation and oblique rotation. For orthogonal rotation, factor analysis focuses on the factor structure matrix (correlation coefficient between variables and factors). For oblique rotation, factor analysis considers three matrices: factor Structure matrix, factor mode matrix, and factor association matrix. The factor model matrix is a standardized regression coefficient matrix. It lists the weights of Factor Prediction variables. A factor correlation matrix is a factor correlation coefficient matrix. Factor. plot () or fa. digoal () functions. You can plot an orthogonal or diagonal plot.
> fa.diagram(fa.promax,simple=FALSE)
> Fa. promax $ weights [, 1] [, 2] general 0.080 0.210 picture 0.021 0.090 blocks 0.044 0.695 maze 0.027 0.035 reading 0.739 0.044 vocab 0.176 0.039
Conclusion 14.5
Welcome