Research on multivariate statistical analysis of R language

Last Update:2015-02-27 Source: Internet

Author: User

Tags rcolorbrewer

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

# read multivariate statistical analysis data to R
Wine<-read.table ("Http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep= ",")
# Draw Multivariate Statistics
# Matrix Scatter plot
# A common approach is to use scatter plots to produce multivariate statistics that show a scatter plot between all variable 22.
# We can do this using the "Scatterplotmatrix ()" function in the "car" package in R.
Library (CAR)
Scatterplotmatrix (Wine[2:6])
# Scatter plot of group callout data points
Plot (WINE$V4,WINE$V5)
Text (wine$v4,wine$v5,wine$v1,cex=0.7,pos=4,col= "Red")

# contour Map?
# contour Map? Another very useful chart type is "Contour", which shows the variation of each variable by drawing the value of each variable in the sample.
# the "Makeprofileplot ()" function below can draw a contour map. This function requires a "rcolorbrewer" library.
Makeprofileplot<-function (mylist,names) {
Require (Rcolorbrewer)
# Find out how many variables we want to include
Numvariables<-length (MyList)
# choose ' numvariables ' random colours
Colours<-brewer.pal (Numvariables, "Set1")
# Find out the minimum and maximum values of the variables:
Mymin<-1e+20
Mymax<-1e-20
For (i in 1:numvariables) {
Vectori<-mylist[[i]]
Mini<-min (Vectori)
Maxi<-max (Vectori)
if (mini<mymin) {Mymin<-mini}
if (Maxi>mymax) {Mymax<-maxi}
}

# Plot the variables
For (i in 1:numvariables) {
Vectori<-mylist[[i]]
Namei<-names[i]
Colouri<-colours[i]

if (i = = 1) {Plot (vectori,col=colouri,type= "L", Ylim=c (Mymin,mymax))}
else {points (vectori,col=colouri,type= "L")}

Lastxval<-length (Vectori)
Lastyval<-vectori[length (Vectori)]
Text ((lastxval-10), (lastyval), namei,col= "Black", cex=0.6)
}
}
# For example, in order to draw a contour map of the first five chemicals in the wine sample (they are stored in the V2,V2,V4,V5,V6 column of the "Wine" variable), we enter:
Library (Rcolorbrewer)
Names<-c ("V2", "V3", "V4", "V5", "V6")
Mylist<-list (WINE$V2,WINE$V3,WINE$V4,WINE$V5,WINE$V6)
Makeprofileplot (Mylist,names)

# Calculate summary statistics for multivariate statistical data
# Another thing you might want to do is to calculate the summary statistics for each variable in your multivariate statistical dataset, such as mean, standard deviation, and so on.
Sapply (Wine[,2:14],mean)
Sapply (WINE[,2:14],SD)
# we can standardize to make the data look more meaningful so that we can clearly compare these variables. We need to pass each variable so that they have a sample variance of 1 and a sample mean of 0.

# mean and variance for each group
# usually interested in calculating their mean and standard deviations from a particular sample population, for example, to calculate a sample of each grape variety. The wine variety is stored in the "V1" column of the "Wine" variable.
# in order to extract only the data of the 2nd symbol, we enter:
Cultivar2wine<-wine[wine$v1==2,]
Sapply (Cultivar2wine[2:14],mean)
Sapply (CULTIVAR2WINE[2:14],SD)
You can also use similar methods to calculate the 1th sample, or the mean and standard deviation of 13 chemical concentrations for the 3rd sample:
However, for the sake of convenience, you may want to output the mean and standard deviation of grouped data in a data set by the following "Printmeanandsdbygroup ()" function:
Printmeanandsdbygroup<-function (variables,groupvariable) {
# Find the names of the variables
Variablenames<-c (Names (groupvariable), names (As.data.frame (variables)))
# within each group, find the mean of each variable
groupvariable<-groupvariable[,1] #ensures groupvariable is not a list
Means<-aggregate (As.matrix (variables) ~groupvariable,fun=mean)
Names (means) <-variablenames
Print (Paste ("Mean:"))
Print (means)
# within each group, find the standard deviation of each variable:
Sds<-aggregate (As.matrix (variables) ~GROUPVARIABLE,FUN=SD)
Names (SDS) <-variablenames
Print (Paste ("standard deviations:"))
Print (SDS)
# within each group, find the number of samples:
Samplesizes<-aggregate (As.matrix (variables) ~groupvariable,fun=length)
Names (samplesizes) <-variablenames
Print (Paste ("Sample sizes:"))
Print (samplesizes)
}
Printmeanandsdbygroup (Wine[2:14],wine[1])
# The function "Printmeanandsdbygroup ()" Outputs the number of the grouped sample. In this example, we can see that the symbol 1 has 59 samples, the species 2 has 71 samples, and the variety 3 has 48 samples.

# # variable inter-group variance and intra-group variance
# If we want to calculate the intra-group variance of a particular variable (for example, to calculate the concentration of a specific chemical), we can use the following "calwithingroupsvariance ()" function:
Calcwithingroupsvariance<-function (variable,groupvariable) {
# Find out how many values the group variable can take
Groupvariable2<-as.factor (Groupvariable[[1])
Levels<-levels (Groupvariable2)
Numlevels<-length (Levels)
# get the mean and standard deviation for each group:
numtotal<-0
denomtotal<-0
For (i in 1:numlevels) {
Leveli<-levels[i]
Levelidata<-variable[groupvariable==leveli,]
Levelilength<-length (Levelidata)
# get the mean and standard deviation for group I:
Meani<-mean (Levelidata)
SDI&LT;-SD (Levelidata)
numi<-(levelilength-1) * (SDI*SDI)
Denomi<-levelilength
Numtotal<-numtotal+numi
Denomtotal<-denomtotal+denomi
}
# Calculate the Within-groups variance
vw<-numtotal/(Denomtotal-numlevels)
Return (VW)
}
# For example, to calculate the intra-group variance of the V2 variable (the concentration of the first chemical), we enter:
Calcwithingroupsvariance (Wine[2],wine[1]) # [1] 0.2620525
# we can calculate the inter-group variance of a specific variable (such as V2) by using the "calcbetweengroupsvariance ()" function described below:
Calcbetweengroupsvariance <-Function (variable,groupvariable) {
# Find out how many values the group variable can take
Groupvariable2 <-As.factor (groupvariable[[1])
Levels <-levels (GROUPVARIABLE2)
Numlevels <-Length (levels)
# Calculate the overall grand mean:
Grandmean <-mean (variable[,1])
# get the mean and standard deviation for each group:
Numtotal <-0
Denomtotal <-0
For (i in 1:numlevels)
{
Leveli <-Levels[i]
Levelidata <-Variable[groupvariable==leveli,]
Levelilength <-Length (levelidata)
# get the mean and standard deviation for group I:
Meani <-mean (levelidata)
SDI <-SD (LEVELIDATA)
Numi <-Levelilength * ((Meani-grandmean) ^2)
Denomi <-Levelilength
Numtotal <-Numtotal + Numi
Denomtotal <-Denomtotal + Denomi
}
# Calculate the Between-groups variance
Vb <-numtotal/(NUMLEVELS-1)
Vb <-Vb[[1]]
Return (VB)
}
# You can use it like this to calculate the inter-group variance of V2:
Calcbetweengroupsvariance (Wine[2],wine[1]) # [1] 35.39742
# we can calculate "separation" by dividing the variance of the variables by the intra-group variance. Thus, this interval computed by V2 is:
Calcbetweengroupsvariance (Wine[2],wine[1])/calcwithingroupsvariance (wine[2],wine[1])
# If we want to calculate the interval from all variables of multivariate statistics, you can use the following "Calcseparations ()":
Calcseparations<-function (variables,groupvariable) {
# Find out what many variables we have
Variables<-as.data.frame (variables)
Numvariables<-length (variables)
# Find the variable names
Variablenames<-colnames (variables)
# Calculate the separation for each variable
For (i in 1:numvariables) {
Variablei<-variables[i]
Variablename<-variablenames[i]
Vw<-calcwithingroupsvariance (variablei,groupvariable)
Vb<-calcbetweengroupsvariance (variablei,groupvariable)
Sep<-vb/vw
Print (Paste ("variable", VariableName, "vw=", Vw, "vb=", Vb, "separation=", Sep))
}
}
# For example, to calculate the interval of 13 chemical concentrations per variable, we enter:
Calcseparations (Wine[2:14],wine[1])
# Therefore, the maximum interval for individual variables within a group (wine variety) is V2 (interval 233.0).
# as we will discuss below, the purpose of linear discriminant analysis (LDA) is to look for a linear combination of individual variables to achieve the maximum interval within the group (here is the symbol).
# Here it is hoped to get a better interval to replace this optimal interval by any individual variable (temporarily V8 233.9).

Research on multivariate statistical analysis of R language

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More