The re-sequencing is cheap, and the sequencing and analysis of the population is also growing. The analysis of group structure is the most common analysis content of re-sequencing. The application of group structure analysis is very extensive, first of all, it is the most basic analysis content in the analysis of group evolution, secondly, when conducting GWAS analysis, it is necessary to use the results of PCA or structure analysis as a co-variable to correct the false positive of the group structure to the association analysis.
The reason why we are called "The Three Musketeers of group structure" is that the three graphs (or three analyses) are almost always in one article. Although these three graphs often appear together, the biological problems they can explain and the way they are plotted are different, so we hit explain.
2.1 PCA diagram (Principal component analysis)
Yan Value: ☆
Practicality: ☆☆☆☆
interpretation of PCA graphs
PCA analysis is a simple and unpretentious analysis, but it is widely used and can help us solve some biological problems. The simple thing about it is that the result is not fancy and it is easy to understand, just a scatter chart.
For example, an article from the Panda re-sequencing[1], the author uses the principal component 1 (principal component1) and the principal component 2 as the x-axis and y-axis to plot the scatter plot, each point represents 1 samples. In such a PCA diagram, the farther away the two samples are, the greater the genetic background difference between the two samples. Ideally, individuals with similar genetic backgrounds will be clustered together in the graph.
For example, the panda individuals involved in this picture are from 3 different panda nature reserves. The results of PCA analysis also divided these individuals into 3 subgroups, which are identical to the geographical information of the panda source.
Figure 1 PCA analysis distinguishes pandas from three regions from the exact
But if you're not just content with reading basic information, you still need to have a rudimentary understanding of the PCA approach. PCA is a method of data processing in linear algebra, and the mathematical process of this method is more abstract, in this limited space is inconvenient to open the discussion, interested students can own Baidu query.The scenario for PCA analysis is that, in some cases, our data is too complex.
For example: The re-sequencing of a group, the number of SNP points obtained is millions. If we use millions's SNP information as an indicator to differentiate the individual, it will appear that the information is too large to grasp the focus. The PCA analysis process is to extract the key information from this millions, so that we can use fewer variables (indicators) to effectively differentiate the samples. These extracted information, according to their effects from large to small arrangement, we call the main component 1 (principal component1), the main component 2, the main component 3 ...
In the actual article, we not only use PC1 and PC2 to differentiate the sample population. Mathematically, the process of PCA analysis is the process of extracting critical information from a large number of data indicators. However, PC1 or PC2 are always limited in their interpretation of the overall information. We call this the percentage of PCN to the total variance interpretation. In the results of PCA analysis of General re-sequencing, the proportion of PC1 to the general information is 3~10%. So, we also need to look at the other principal components of the classification effect.
For example, in the Silkworm re-sequencing article, the main components 1 and 2 were plotted (left) and the main component 3 and the main component 4 (right). The results of two clusters presented different meanings. In the cluster diagram of PC1 and PC2, the Wild Silkworm and Silkworm were separated into two groups. In the clusters of PC3 and PC4, two varieties of high-yielding silk were isolated from Jiangnan.
So, from the biological level, the process of PCA analysis is the process of information enrichment, which extracts similar information from all the original SNP sites, and concentrates the new variables PC1, PC2, PC3 ... Output. So different principal components May (remember, just possibly) correspond to different biological meanings, producing different clustering effects.
Fig. 2 Comparison of clustering results of silkworm population using different principal components
use in other real-world cases
PCA analysis is only a very simple mathematical method, the specific biological significance of the need for concrete analysis of specific problems. The main applications of PCA analysis in practical cases include:
1. Detection of outlier samples
For example, in (right), two high yielding varieties belong to an outlier sample. If your material is known to be a single source of the same species, this outlier sample may mean that sample confusion occurs during sampling or sequencing. If these materials are subsequently used for GWAS analysis, the outliers of the individual samples are considered to be removed from the outliers. Of course, if a large number of samples are out of the cluster or there is a population stratification (for example, the left graph, which is obviously layered into two subgroups), then the results of PCA or structure analysis need to be used as the covariance of subsequent association analysis to correct their impact on the association analysis.
2. Inferring the evolutionary relationship between subgroups
For example, this grape population study article [3], the study of grape varieties from the source of three regions. The green Western grape and the red Eastern grape distinguish is more obvious, but the blue middle grape mixes in the east, the West two subgroups, and two subgroups have the massive overlap. The authors infer that the two regions of the east and west of the grapes are spread to the central region, and accompanied by a large number of hybridization, resulting in the central region of the variety pedigree is more mixed, and did not form their own independent subgroup. In fact, I as a mouse has also done genotype detection, PCA results were finally classified into the Jiangnan population. Of course, I am not surprised at the results, because I am a false replacement of the big Hu Jianshen.
<ignore_js_op>
Fig. 3 The genetic confounding phenomenon of grape subsets
method of PCA analysis and drawing
PCA analysis is just a statistical method, we can use some group statistical software to calculate the number of PC1~PCN in the population, and then plot the scatter plot on it (drawing a scatter plot is actually done using Excel. Of course, if you use the R language, you will be more aesthetically pleasing).
On the group analysis software, we recommend the PCA module in the GCTA software to complete the analysis (http://cnsgenomics.com/software/gcta/pca.html). This software has a window version, but similar to the local blast (which we have previously shared) can only be run in the command line mode under the DOS interface. Of course there are older PCA software Eigensoft (http://www.hsph.harvard.edu/alkes-price/software/), but this software is only available on Linux. In short, in the field of biological information, software is mostly not very friendly, this is the status quo of the industry.
Of course, PCA analysis is not only used in the field of re-sequencing, RNA-SEQ, 16s Meta RDNA sequencing is also used in large numbers, only the SNP information mentioned above is replaced with the expression of abundance. If you have a headache with PCA analysis of rna-seq,16s meta sequencing, you can actually use our new free online analytics cloud tool (www.omicshare.com/tools/), which is newly developed by our base Dior.
By a variety of bio-information software has been abused by the biological dogs together, Turner Xiong is bound to achieve ... Feel a little off-topic, re-shout the slogan: Welcome to the various bio-information software abused by the biological dogs to give us feedback changes, OS tools user-friendliness will certainly continue to improve.
Reference Documents:
"1" Zhao S, et al. whole-genome sequencing of giant pandas provides insights into demographichistory and local ad Aptation. Nature GENETICS45 (1): 67-71.
"2" Xia Q, Guo Y, Zhang Z, et al. Complete resequencing of genomes reveals domestication events and genes Insilkworm (BOMBYX) [J]. Science, 326 (5951): 433-436.
"3" Myles S, Boyko A R, Owens C L, et al genetic structure and domesticationhistory of the grape[j]. Proceedings of the national Academy of Sciences, 2011,108 (9): 3530-3535.
Turn from:
Group structure Graphic Three Musketeers--PCA diagram
Http://www.omicshare.com/forum/thread-816-1-180.html
(Source: Omicshare Forum)
Group structure Graphic Three Musketeers--PCA diagram