R Linguistic Multivariate Analysis series

Source: Internet
Author: User
Tags comparison

Http://www.cnblogs.com/wentingtu/archive/2012/03/03/2377971.html



One of the series of R linguistic multivariate analysis: Principal component analysis

Principal component Analysis (Principal, PCA) is a technique for analyzing and simplifying data sets. It transforms the original data into a new coordinate system so that the first generous difference of any data projection is on the first coordinate (called the first principal component), the second generous difference is in the second coordinate (the second principal component), and so on. Principal component analysis often uses a feature that reduces the number of dimensions in a dataset while preserving the maximum contribution of the data set's difference. This is done by preserving low-order principal components, ignoring higher-order principal components. Such low-order components tend to retain the most important aspects of the data. However, it does not work when the number of observations is less than the number of variables, such as genetic data.

In the R language, principal component analysis can use basic princomp functions, and input the results into the summary and plot functions to obtain the analysis results and the gravel plots respectively. However, the psych extension package is more flexible.

1 Select the number of principal components

There are usually several criteria for selecting the number of principal components:

Choice based on experience and theory

Based on the cumulative variance contribution rate, for example, select the number of principal components that contribute to 80% of the cumulative variance.

According to the eigenvalue of the correlation coefficient matrix, the main component with the characteristic value greater than 1 is selected.

Another more advanced method is parallel analysis (parallel analyses). The method first generates a number of random matrices with the same data as the original data structure, evaluates the eigenvalues and averages them, and then makes a comparison with the eigenvalues of the real values, and chooses the number of principal components according to the location of the intersection points. We select the Usjudgeratings DataSet example, first load the psych package, and then use the Fa.parallel function to draw the following figure, from the diagram shows that the first main component is located above the red Line, the second main component is located below the red line, so the number of principal components is selected 1.

Fa.parallel (Usjudgeratings[,-1], fa= "PC", n.iter=100, Show.legend=false)

2 Extracting principal components

Pc=principal (Usjudgeratings[,-1],nfactors=1)            PC1    h2     u2  1  0.92 0.84 0.1565  2  0.91  0.83 0.1663  3  0.97 0.94 0.0613  4  0.96 0.93  0.0720  5  0.96 0.92 0.0763  6  0.98 0.97 0.0299  7   0.98 0.95 0.0469  8  1.00 0.99 0.0091  9  0.99 0.98  0.0196  10 0.89 0.80 0.2013  11 0.99 0.97 0.0275                      PC1  SS  loadings    10.13  proportion var  0.92 

From the above results, it is observed that PC1 is the correlation coefficient between the observed variable and the principal component, H2 is the proportion that the variable can be interpreted by the principal component, and U2 is the proportion which cannot be explained. The principal component explains the total variance of 92%. Note that this result is different from the Princomp function result, the Princomp function returns the linear combination coefficients of the principal components, and the principal function returns the correlation coefficients between the original and principal components, which is consistent with the result meaning of the factor analysis.

3 Rotating Main components

Rotation is to maintain the cumulative variance contribution rate under the condition, the main component load is transformed to facilitate interpretation. The variance contribution rate of each component after the rotation of the component will be redistributed, and it can no longer be called the "principal ingredient" but merely the "ingredient". The rotation can also be divided into orthogonal rotation and oblique intersection rotation. The popular method of orthogonal rotation is to maximize variance, which needs to be implemented by adding rotate= ' VariMAX ' parameters in principal. There is also a view that principal component analysis generally does not need to be rotated.

4 Calculating the principal component score

The main component score is the linear combination of each variable, and after calculating the main component score, it can be further analyzed by regression. Note, however, that the primary component score cannot be computed if the input data is not the original data. We need to add the Score=t parameter setting in principal, and the result will be stored in the score element of the result.

R Linguistic Multivariate Analysis series two: Exploratory factor analysis

Exploratory factor analysis (exploratory Factor Analysis,efa) is a technique used to find out the essential structure of multivariate observational variables and to deal with dimensionality reduction. Thus EFA can synthesize variables with intricate relationships into a few core factors. The difference between EFA and PCA is that the principal component in PCA is a linear combination of primitive variables, whereas the primitive variables in EFA are linear combinations of common factors, which are the potential variables that affect variables, the ones that cannot be interpreted by factors are called errors, and the factors and errors cannot be observed directly. EFA requires a large number of samples, and the general experience is that how to estimate the number of factors is n, which requires a sample number from 5N to 10N.

Although EFA and PCA are fundamentally different, there are similarities in the analytical process. Here we use Ability.cov, an example of this psychological measurement, whose variables are six kinds of human abilities, such as reading and spelling, whose data is a covariance matrix rather than raw data. The Factanal function in the stats package in the R language can do the job, but here we use a more flexible psych package.

One, the number of selection factors

The number of general selection factors can be chosen as the factor according to the eigenvalues of the correlation coefficient matrix and the eigenvalues greater than 0. We still use the parallel analysis method (parallel). The method first generates a number of random matrices with the same data as the original data structure, evaluates the eigenvalues and averages them, and then makes a comparison with the eigenvalues of the real values, and selects the number of factors according to the location of the intersection points. According to the following figure we can observe the relationship between the eigenvalues and the red line, there are two factors are located above the red line, obviously should choose two factors.

Library (psych) covariances = Ability.cov$cov correlations = Cov2cor (covariances) fa.parallel (correlations, n.obs=112, F A= "FA", N.iter=100,show.legend=false)

Second, the extraction factor

Psych package is the use of the FA function to extract the factor, the Nfactors parameter set factor to 2,rotate parameter set the maximum variance factor rotation method, the last FM expression analysis method, because the maximum likelihood method sometimes can not converge, so this is set as an iterative spindle method. From the results below you can see that two factors explain the total variance of 60%. The two variables of reading and vocabulary are related to the first factor, while the picture, blocks, and maze variables are related to the second factor, and the general variable has a relationship with the two factors.

Fafa = FA (correlations,nfactors=2,rotate= "VariMAX", fm= "Pa") PA1 PA2 H2 U2 general 0.49 0.57 0.57 0.432 Picture 0.16 0.59 0.38 0.623 blocks 0.18 0.89 0.83 0.166 Maze 0.13 0.43 0.20 0.798 reading 0.93 0.20 0.91 0.089 VO  Cab 0.80 0.23 0.69 0.313 PA1 PA2 SS loadings 1.83 1.75 proportion var 0.30 0.29 cumulative Var 0.30 0.60

If you use the basic function factanal for factor analysis, then the function should be factanal (covmat=correlations,factors=2,rottion= ' VariMAX '), which will get the same result. In addition, we can use graphs to represent the relationship between a factor and a variable.

Factor.plot (Fa,labels=rownames (fa$loadings))

Three, Factor score

When we get the common factor, we can turn to the factor score of each sample in the same way as the principal component analysis. If you are entering raw data, you can set the Score=t parameter in the FA function to get a factor score. If you enter a correlation matrix as in the example above, you need to return the estimate based on the factor score factor.

Fa$weights PA1 PA2 General 0.017702900 0.21504415 picture-0.007986044 0.09687725 blocks -0.198309764 0.79392660 Maze 0.019155930 0.03027495 reading 0.841777373-0.22404221 vocab 0.190592536-0.02 040749

Reference: R in Action

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.