R Linguistic Multivariate Analysis series

Source: Internet
Author: User
Tags ming ggplot

One of the series of R linguistic multivariate analysis: Principal component analysis principal component Analysis (Principal, PCA)is a technique for analyzing and simplifying data sets. It transforms the original data into a new coordinate system so that the first generous difference of any data projection is on the first coordinate (called the first principal component), the second generous difference is in the second coordinate (the second principal component), and so on. Principal component analysis often uses a feature that reduces the number of dimensions in a dataset while preserving the maximum contribution of the data set's difference. This is done by preserving low-order principal components, ignoring higher-order principal components. Such low-order components tend to retain the most important aspects of the data. However, it does not work when the number of observations is less than the number of variables, such as genetic data.

In the R language, principal component analysis can use basic princomp functions, and input the results into the summary and plot functions to obtain the analysis results and the gravel plots respectively. However, the psych extension package is more flexible.


1 Select the number of principal components
There are usually several criteria for selecting the number of principal components:
    • Choice based on experience and theory
    • Based on the cumulative variance contribution rate, for example, select the number of principal components that contribute to 80% of the cumulative variance.
    • According to the eigenvalue of the correlation coefficient matrix, the main component with the characteristic value greater than 1 is selected.
Another more advanced approach is to Parallel Analysis(parallel analysis). The method first generates a number of random matrices with the same data as the original data structure, evaluates the eigenvalues and averages them, and then makes a comparison with the eigenvalues of the real values, and chooses the number of principal components according to the location of the intersection points. We select the Usjudgeratings DataSet example, first load the psych package, and then use the Fa.parallel function to draw, visible the first main component is located above the red Line, the second main component is located below the red line, so the number of principal components is selected 1.

Fa.parallel (Usjudgeratings[,-1], fa= "PC", n.iter=100, Show.legend=false)
2 extracting principal components
Pc=principal (Usjudgeratings[,-1],nfactors=1)
    PC1   H2     U21  0.92 0.84 0.15652  0.91 0.83 0.16633 0.97  0.94 0.06134  0.96 0.93 0.07205  0.96 0.92 0.07636  0.98 0.97 0.02997  0.98 0.95 0.04698  1.00 0.99 0.00919  0.99 0.98 0.019610 0.89 0.80 0.2013 0.99 0.97 0.0275                 pc1ss loadings    10.13Proportion Var  0.92
From the above results, it is observed that PC1 is the correlation coefficient between the observed variable and the principal component, H2 is the proportion that the variable can be interpreted by the principal component, and U2 is the proportion which cannot be explained. The principal component explains the total variance of 92%. Note that this result is different from the Princomp function result, the Princomp function returns the linear combination coefficients of the principal components, and the principal function returns the correlation coefficients between the original and principal components, which is consistent with the result meaning of the factor analysis.

3 Rotating Main components
Rotation is to maintain the cumulative variance contribution rate under the condition, the main component load is transformed to facilitate interpretation. The variance contribution rate of each component after the rotation of the component will be redistributed, and it can no longer be called the "principal ingredient" but merely the "ingredient". The rotation can also be divided into orthogonal rotation and oblique intersection rotation. The popular method of orthogonal rotation is to maximize variance, which needs to be implemented by adding rotate= ' VariMAX ' parameters in principal. There is also a view that principal component analysis generally does not need to be rotated.

4 Calculating the principal component score
The main component score is the linear combination of each variable, and after calculating the main component score, it can be further analyzed by regression. Note, however, that the primary component score cannot be computed if the input data is not the original data. We need to add the Score=t parameter setting in principal, and the result will be stored in the score element of the result. R Linguistic Multivariate Analysis series two: Exploratory factor analysis Exploratory factor analysis (exploratory Factor Analysis,efa)is a technique used to find out the essential structure of multivariate observational variables and to deal with dimensionality reduction. Thus EFA can synthesize variables with intricate relationships into a few core factors. The difference between EFA and PCA is that the principal component in PCA is a linear combination of primitive variables, whereas the primitive variables in EFA are linear combinations of common factors, which are the potential variables that affect variables, the ones that cannot be interpreted by factors are called errors, and the factors and errors cannot be observed directly. EFA requires a large number of samples, and the general experience is that how to estimate the number of factors is n, which requires a sample number from 5N to 10N.

Although EFA and PCA are fundamentally different, there are similarities in the analytical process. Here we use Ability.cov, an example of this psychological measurement, whose variables are six kinds of human abilities, such as reading and spelling, whose data is a covariance matrix rather than raw data. The Factanal function in the stats package in the R language can do the job, but here we use a more flexible psych package.


One, the number of selection factors
The number of general selection factors can be chosen as the factor according to the eigenvalues of the correlation coefficient matrix and the eigenvalues greater than 0. We still use the parallel analysis method (parallel). The method first generates a number of random matrices with the same data as the original data structure, evaluates the eigenvalues and averages them, and then makes a comparison with the eigenvalues of the real values, and selects the number of factors according to the location of the intersection points. Depending on the relationship between the eigenvalues and the red line, there are two factors above the red line, and it is clear that two factors should be selected.
Library (psych) covariances = Ability.cov$covcorrelations = Cov2cor (covariances) fa.parallel (correlations, n.obs=112, Fa= "FA", N.iter=100,show.legend=false)

second, the extraction factor
Psych package is the use of the FA function to extract the factor, the Nfactors parameter set factor to 2,rotate parameter set the maximum variance factor rotation method, the last FM expression analysis method, because the maximum likelihood method sometimes can not converge, so this is set as an iterative spindle method. From the results below you can see that two factors explain the total variance of 60%. The two variables of reading and vocabulary are related to the first factor, while the picture, blocks, and maze variables are related to the second factor, and the general variable has a relationship with the two factors.
FA = FA (correlations,nfactors=2,rotate= "VariMAX", fm= "Pa")
         PA1  PA2   H2    U2
General 0.49 0.57 0.57 0.432picture 0.16 0.59 0.38 0.623blocks  0.18 0.89 0.83 0.166maze    0.13 0.43 0.20 0.798readin G 0.93 0.20 0.91 0.089vocab   0.80 0.23 0.69 0.313                PA1  pa2ss loadings    1.83 1.75Proportion Var 0.30 0.29Cumul ative Var 0.30 0.60
If you use the basic function factanal for factor analysis, then the function should be factanal (covmat=correlations,factors=2,rottion= ' VariMAX '), which will get the same result. In addition, we can use graphs to represent the relationship between a factor and a variable.
Factor.plot (Fa,labels=rownames (fa$loadings))
three, Factor score
When we get the common factor, we can turn to the factor score of each sample in the same way as the principal component analysis. If you are entering raw data, you can set the Score=t parameter in the FA function to get a factor score. If you enter a correlation matrix as in the example above, you need to return the estimate based on the factor score factor.
Fa$weights

                 PA1         pa2general  0.017702900  0.21504415picture-0.007986044  0.09687725blocks  -0.198309764  0.79392660maze     0.019155930  0.03027495reading  0.841777373-0.22404221vocab    0.190592536- 0.02040749

Reference: R in Actionr linguistic multivariate Analysis series three: Multidimensional scale analysis Multidimensional Scale analysis (MDS)It is a kind of data analysis method which simplifies the research object of multidimensional space to the low dimension space, and preserves the primitive relation between objects.

Imagine if we were to know the coordinates of some points in the Euclidean space, then we could find the Euclidean distance. In turn, the known distance should also be able to get the relationship between these points. This distance can be a classical Euclidean distance, or it can be a generalized "distance". MDS is to try to maintain this high dimension "distance" while the data in the low dimension of the show. In this sense, principal component analysis is also a special case of multidimensional scale analysis.


a measure of distance
The following distances are commonly used in multivariate analysis, that is, absolute distance, Euclidean distance (Euclidean), Markov distance (Manhattan), two distance (binary), and the range of the Ming (Minkowski). The disk function is typically used in R to get the distance between samples. MDS is the analysis of the distance matrix to show and interpret the internal structure of the data.

In classic MDs, the distance is the numeric data representation, which is considered to be a Euclidean distance. The Cmdscale function of the stats package in R implements the classic MDS. It is based on the Euclidean distance of the points, in the low-dimensional space to find the coordinates of each point, and try to keep the distance unchanged.

In a non-metric MDS method, distance is no longer treated as numeric data, but as sequential data. For example, in a psychological experiment, the subjects could only answer the answers with great agreement, consent, disagreement, and disagreement. In this case, the classic MDS is no longer valid. Kruskal proposed an algorithm to solve this problem in 1964. The ISOMDS function of the mass package in R can implement this algorithm, and another popular algorithm is implemented by the Sammon function.
Second, the classic MDS
Let's use the Watervoles data from the HSAUR2 package for example. This data is a similarity matrix, which indicates the similarity of paddy rats in different regions. Load the data first and then analyze it with cmdscales.
Library (GGPLOT2) data (watervoles, package = "HSAUR2") data (Watervoles) Voles.mds=cmdscale (watervoles,k=13,eig=t)
The following calculates the proportions of the first two eigenvalues in all eigenvalues, in order to detect whether the distance in the high-dimensional space can be represented by a distance of two dimensions, which is appropriate if it reaches about 0.8.
SUM (ABS (VOLES.MDS$EIG[1:2))/sum (ABS (Voles.mds$eig)) sum ((Voles.mds$eig[1:2]) ^2)/sum ((Voles.mds$eig) ^2)
The coordinates of the first two dimensions are then extracted from the results and plotted with the Ggplot package.
x = Voles.mds$points[,1]y = Voles.mds$points[,2]p=ggplot (Data.frame (x, y), AES (X,y,label = Colnames (watervoles))) p+ Geom_point (shape=16,size=3,colour= ' red ') +  Geom_text (hjust=-0.1,vjust=0.5,alpha=0.5)

third, non-metric MDS
The data in the second example is a similar matrix of the senator's voting behavior in New Jersey, where we use the ISOMDS function in the mass package for analysis
Library ("MASS") data (voting, package = "HSAUR2") Voting_mds = Isomds (voting) x = Voting_mds$points[,1]y = voting_mds$ Points[,2]g=ggplot (Data.frame (x, y), AES (X,y,label = colnames (voting))) G+geom_point (shape=16,size=3,colour= ' red ') +  Geom_text (hjust=-0.1,vjust=0.5,alpha=0.5)

Resources:
A Handbook of Statistical analyses Using R
Multivariate statistical analysis and R language modeling r linguistic multivariate analysis series four: discriminant analysis discriminant Analysis (discriminant)is a classification technology. It uses a known class of "training samples" to establish the criteria and to classify the unknown categories of data by predictor variables. There are three kinds of discriminant analysis methods, namely Fisher Discriminant, Bayes discriminant and distance discrimination. Fisher Discriminant thought is the projection dimensionality reduction, so that multidimensional problems can be simplified to one-dimensional problems to deal with. Select an appropriate projection axis so that all the sample points are projected onto this axis to get a projected value. The requirement for the direction of the projection axis is that the dispersion within the group formed by the projected values in each group is as small as possible, while the projected values between the different groups are as large as possible between classes. Bayes discriminant theory is based on a priori probability to find the posterior probability, and based on the post-test probability distribution to make statistical inference. Distance discriminant theory calculates various kinds of center of gravity according to the data of known classification, and calculates the distance between it and all kinds of center of gravity, and the distance from a certain center of gravity is attributed to this class recently.
1. Linear discriminantWhen the covariance matrices of different class samples are the same, we can use the LDA function of the mass package to realize the linear discriminant in R. The LDA function is based on Bayes discriminant theory. Bayes discriminant is equivalent to Fisher discriminant and distance discrimination when the classification has only two kinds and the population obeys multivariate normal distribution. This example uses the iris dataset to classify the varieties of flowers. First, the mass packet is loaded and the discriminant model is established, and the prior parameter represents a priori probability. The table function is then used to establish the confusion matrix, compared to the real category and the prediction category.
Library (MASS) Model1=lda (Species~.,data=iris,prior=c (1,1,1)/3) Table (Species,predict (MODEL1) $class)
Species      setosa versicolor virginicasetosa          0         0versicolor      0         2virginica       0          1        49
From the above results, there are only three samples that can be observed for judging errors. After the discriminant function is established, the discriminant score can be plotted like principal component Analysis Ld=predict (MODEL1) $xp =ggplot (Cbind (iris,as.data.frame (LD)), AES (X=LD1,Y=LD2)) p+ Geom_point (Aes (colour=species), alpha=0.8,size=3)
2. Two-time discriminationWhen the covariance matrices of different class samples are not the same, two discriminant should be used.
Model2=qda (species~.,data=iris,cv=t)
This sets the CV parameter to T, which is used to leave a cross check (Leave-one-out cross-validation) and automatically generates the predicted value. The confusion matrix generated under this condition is more reliable. You can also use the Predict (model) $posterior to extract the posteriori probabilities.
Note When using the LDA and QDA functions: The assumption is that the population obeys a multivariate normal distribution, which is used sparingly if not satisfied.
Resources:
Modern applied Statistics with S
Data_analysis_and_graphics_using_r__an_example_based_approach
R Linguistic Multivariate Analysis series five: Cluster analysis (end) Clustering (Cluster analysis)According to the principle of "birds of a Feather", a multivariate statistical analysis method is used to classify samples or indicators, and it is reasonable to classify samples according to their characteristics without prior knowledge.

Cluster analysis is used in many ways, and in business, cluster analysis is employed to identify different customer groups and to characterize different customer groups through the purchase model; in biology, cluster analysis is used to classify animals and plants, to classify them, and to gain an understanding of the inherent structure of the population; in Internet applications, Cluster analysis is used to classify documents on the Web to repair information.


There are two main computational methods for cluster analysis, namely, condensed hierarchical clustering (agglomerative hierarchical method) and K-means clustering (K-means).

first, hierarchical clustering
Hierarchical clustering, also known as System clustering, is the first to define the distance between samples, the distance is more near to a class, the farther is the different classes. The statistics that can be used to define distance include Euclidean distance (Euclidean), Markov distance (Manhattan), two distances (binary), and the range of the Ming (Minkowski). The correlation coefficients and the angle cosine are also included.

Hierarchical clustering first takes each sample individually as a class, and then merges the nearest of the different classes, and then recalculates the distance between classes. This process continues until all samples are grouped into one category. There are six different methods for calculating the distance between classes, namely the shortest distance method, the longest distance method, the class average method, the center of gravity method, the middle distance method, and the deviation squared sum method.

Below we use IrisDatasets for cluster analysis, the functions used in the R language are Hclust。 First, the 4 numeric variables in the iris data are extracted, and then the Euclidean distance matrix is computed. Then the matrix is drawn to the heat map, you can see the darker the color, the closer the distance between the samples, roughly can distinguish between three to four chunks, the sample is closer to each other.
Data=iris[,-5]dist.e=dist (data,method= ' Euclidean ') heatmap (As.matrix (DIST.E), Labrow = f, labcol = f)
Then using the Hclust function to establish the clustering model, the result exists in the MODEL1 variable, in which the ward parameter is to set the inter-class distance calculation method to the deviation squared sum method. Use Plot (Model1) to plot a cluster tree. If we want to set the category to Class 3, you can use the Cutree function to extract the category to which each sample belongs.
Model1=hclust (dist.e,method= ' Ward ') Result=cutree (model1,k=3)
To show the effect of clustering, we can combine the results of multidimensional scaling and clustering. The data is reduced by MDS, then the original classification is expressed in different shapes, and the result of clustering is expressed in different colors. You can see that the Setose variety clustering is very successful, but there are some virginica varieties of perianth errors and virginica varieties clustered together.
Mds=cmdscale (dist.e,k=2,eig=t) x = Mds$points[,1]y = Mds$points[,2]library (ggplot2) P=ggplot (Data.frame (x, y), AES (X, Y )) P+geom_point (size=3,alpha=0.8,             Aes (Colour=factor (Result),               shape=iris$species))
Two, k mean value clustering
K-means clustering is also called dynamic clustering, its calculation method is simpler, and the input distance matrix is not required. First, we need to specify the number of clusters of n, randomly take n samples as the center of the initial class, calculate the distance and classification of each sample and the center of the class, and then recalculate the center of the class after all the samples are divided, repeating the process until the center of the class is no longer changing.

Using in R Kmeansfunction for K-mean clustering, the centers parameter is used to set the number of classes, the Nstart parameter is used to set the number of random initial center, the default value is 1, but take more times can improve the clustering effect. Model2$cluster can be used to extract the category to which each sample belongs.
Model2=kmeans (data,centers=3,nstart=10)
When using K-means clustering, it is important to note that only when the average value of the class is defined can it be used, and the number of classifications is required beforehand. One method is to use hierarchical clustering to determine the number, and then use K-means clustering to improve. Or a contour factor to determine the number of categories. The method of improving clustering also includes transforming the original data, such as the dimensionality of the data and then implementing clustering.

There are also many functions in the cluster extension package that can be used for clustering, such as the Agnes function, which can be used to condense hierarchical clustering, which can be used for hierarchical clustering, Pam for K-mean clustering, and fanny for fuzzy clustering.

R Linguistic Multivariate Analysis series

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.