Principle of principal component Analysis (PCA) and implementation of R language

Last Update:2016-12-01 Source: Internet

Author: User

Tags new set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Principle:

Principal component Analysis-Stanford

Principal component Analysis Method-think tank

Principle of PCA (Principal Component analysis)

Principal component Analysis and R language case-Library

Principle application and calculation steps of principal component analysis Method-Library

Main component analysis of the R chapter

Five questions about principal component analysis

Multivariate statistical methods, through the main components of the analysis of the largest individual differences, but also to reduce the number of regression analysis and clustering variables, you can use the sample covariance matrix or correlation coefficient matrix as a starting point for analysis.

By linear combination of the original variables, the optimized index is obtained: The calculation of the original multiple indexes is reduced to a few optimized ones (taking up most of the shares)

The basic idea: to try to regroup many of the previously relevant indicators into a new set of integrated indicators that are independent of each other and replace the original indicators.

Small example: The basic physiological attributes of primary school students to share the specific implementation of the R language, respectively, the height (x1), Weight (x2), Bust (x3) and sitting High (x4). Specific as follows:

> student<-Data.frame (+ X1=c (148,139, the,149,159,142,153, Max,151),  + X2=c ( A, the, the, $, $, to, +, +, the),  + X3=c ( the, in, the, the, the, the, the, the, the),  + X4=c ( +, the, the, -, the, the, the, -, the)  + ) > Student.PR<-Princomp (student,cor=TRUE)> Summary (student.PR, loadings=TRUE) Importance of Components:comp.1Comp.2Comp.3Comp.4Standard deviation1.884603 0.57380073 0.30944099 0.152548760Proportion of Variance0.887932 0.08231182 0.02393843 0.005817781Cumulative Proportion0.887932 0.97024379 0.99418222 1.000000000Loadings:comp.1Comp.2Comp.3Comp.4X1-0.510  0.436-0.139  0.728X2-0.513-0.172-0.741-0.398X3-0.473-0.761  0.396  0.201X4-0.504  0.448  0.524-0.520> Screeplot (student.PR, type="Lines")

Standard deviation: Normal deviation

Proportion of Variance:% of variance

Cumulative proportion: Cumulative contribution rate

After analysis of four indicators, 4 components were given, the importance of which was 0.887932, 0.08231182, 0.02393843, 0.005817781, and the cumulative contribution was: 0.887932, 0.97024379, 0.99418222 1.000000000 as the aggregate of the various components is also shown above, the cumulative contribution of visible ingredient 1 and ingredient 2 has reached 95%, so the use of these two elements will fully explain the basic information of the students.

You can figure out the formula for Z1 and Z2.

> temp<-predict (student.  PR> Plot (temp[,1:2])

Reference Links: R language and data analysis of the five: Principal component analysis (there are many series, take a look slowly)

The application of principal component analysis in bioinformatics:

Main steps in principal component analysis

Application of principal component Analysis (PCA) in Biochip sample screening and its implementation in R language

R Linguistic Multivariate analysis

Using GCAT to Master Component analysis (PCA)

PCA for principal component analysis of gene expression data

Application of principal component Analysis (PCA) in group-based data quality control

Principal component analysis of Bio-information PCA (original)

Advanced analysis of RNA-SEQ--principal component analysis

What is the significance of principal component analysis in RNA-SEQ? To be blunt, or to be clustered!

The goal of Principal component analysis (Principal Component ANALYSIS,PCA) is to replace a large number of unrelated variables with a relatively small set of irrelevant variable, while preserving the information of the original variable as much as possible, and the derived variable becomes the principal component and the linear combination of the original variable. That is to say n variables (n-dimensional), by linear combination, dimensionality reduction to K synthesis variables (k-dimensional, K-<n) to induce the interpretation of a phenomenon.

Let's start with a simple example to help you understand:

A basketball club has 40 male classmates, the students between the various indicators exist or big or small differences, including height, weight, vision, hundred meters speed, lung capacity, daily practice, sleep time and other indicators. In the quarterly selection, 40 students scored the number of goals (scores) There are differences, then whether these indicators are related to the results? Or how big is the correlation?

The PCA may be used at this point, and the analysis method is summarized as follows:

1 Selecting the initial variable

For example, the above 7 indicators as a variable (A1-A7), 40 students as a sample.

2 standardize the original data matrix and make the correlation coefficient matrix

(1) Raw data matrix: Each behavior of 40 male students of the indicators values, each listed in the indicators of 40 students in the embodiment;

(2) because each indicator unit of measurement is different, the value range is different, it should not be directly from the covariance matrix, so select the correlation coefficient matrix.

3 calculating eigenvalues and corresponding eigenvectors

4 Determining the number of principal components

The most common way is based on the characteristics of the value of the general choice of the number of variables with a characteristic value greater than 1 as the number of PCs, if the characteristic value of this analysis is greater than 1 of two, then the final can have 2 principal components (the number of specific principal components can be adjusted according to the actual study).

5 getting the principal component expression

The first principal component (PC1) and the second principal component (PC2) are obtained by analysis, assuming an expression such as: (a1* represents A1 normalized value)

6 analysis of the practical significance of the combined data

PC1 is more able to explain the causes of differences between samples than PC2 (as a percentage of the longitudinal axis of the middle). The contribution of A1, A2, A3 and A6 in the linear combination of PC1 is large (the coefficients in front are large), and PC2 contribution is greater in the linear combination of A5. Taking PC1 as the horizontal axis, each sample according to the result size has the obvious distinction, explained above 7 indicators, the height, the weight, the eyesight, the daily practice ball duration These 4 indexes and schoolmate's achievement correlation is stronger. In order to cater to the concept of linear combination, we should find a more appropriate word to comprehensively describe the 4 indicators of height, weight, vision, daily practice time, to cover the meaning of these 4 indicators (well, maybe a small example of improper, or small caishuxueqian temporarily can not think out ~). Understanding the above concepts and then using principal component analysis for RNA-SEQ is also easy to understand.

For example:

A single-cell RNA-SEQ project studies the gene expression patterns of cells in different stages of embryonic development, and the study of the stages of these cell samples is regulated by which genes?

The samples can be fertilized eggs, 2 cell embryos, 4 cell embryos, 8 cell embryos, mulberry embryos, blastocyst and other cells.

In PCA analysis, we can cluster the genes with the sample as a variable. It is also possible to cluster the samples with genes, and the PC1 can be used to find out which genes are of great significance for cell genotyping.

Additional Information: Bio-Information Overview

Principle of principal component Analysis (PCA) and implementation of R language

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More