R-Basic statistical analysis-ch7

Last Update:2016-06-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Descriptive statistical analysis-quantitative variables

(1) The Basic installation package includes:

The summary () function provides the mean value of the minimum, maximum, four, and numeric variables, as well as the frequency statistics of factor vectors and logical vectors;
The Apply () or sapply () function calculates any descriptive statistic you choose. The format is: sapply (x,fun,options). where x is your data frame (or matrix), fun is an arbitrary function. If options are specified, they are passed to fun.

The function Fivenum () can return EOG five sum (Tukey's five-number summary, i.e. minimum, bottom four, median, four and maximum).
(2) Expansion pack:

The describe () function in the Hmisc package returns the number of variables and observations, the number of missing and unique values, the mean, the number of bits, and five maximum and five minimum values;

The psych package also has a function called describe () that calculates the standard errors of the number of non-missing values, average, standard deviation, median, truncated mean, absolute median, minimum, maximum, range, skewness, kurtosis, and mean.
(3) Calculation of descriptive statistics by grouping
Aggregate () function: Only one single return value function can be called at a time, it is not possible to return several statistics at once;

The Describeby () function in the psych package calculates and describe the same descriptive statistic, layering only one or more grouping variables. The function does not allow any function to be specified, so it has a low universality. If there are more than one grouping variable, you can use list (groupvar1, Groupvar2, ..., groupvarn) to represent them. However, this is only valid if no blank cells appear after the grouping variable crosses.

2. Frequency tables and lists-category variables

(1) Generate frequency tables

Note: the table () function ignores missing values (NA) by default. To see NA as a valid category in the frequency statistic, set the parameter Usena= "Ifany".

One-dimensional list: using the prop.table () function to convert the frequency to a proportional value;

Two-dimensional column tables:

Method one or two:

1 refers to the first variable in a table () statement;

First variable row and: Margin.table (mytable,1)

Second variable column and: Margin.table (mytable,2)

First variable row scale: Prop.table (mytable,1)

Second variable column scale: prop.table (mytable,2)

Percentage of each cell: prop.table (mytable)

Add rows, Columns, and: Addmargins (mytable)

Add each line and: Addmargins (mytable,1)

Calculate the proportions of each row and add the columns and: Addmargins (prop.table (mytable,1), 2)

Method Three: Use the crosstable () function in the Gmodels package

Three-dimensional column table:

Both table () and xtabs () can generate multidimensional cascading tables based on three or more category-type variables. The Margin.table (), prop.table (), and addmargins () functions can naturally be generalized to cases above two dimensions. In addition, the ftable () function can output a multidimensional list in a compact and appealing way.

The list of tables can tell you the frequency or proportions of the various combinations of variables that make up the table, but you may also be interested in whether the variables in the list are related or independent.

(2) Independence test

The following three significant tests assessed the existence of sufficient evidence to reject the original hypothesis that the variables were independent of each other.

R provides a variety of methods for verifying the independence of class-type variables. The three tests described in this section are the chi-square independence test, Fisher Precision Test, and Cochran-mantel–haenszel test.
Chi-Square Independence test: Use the Chisq.test () function to test the row and column variables of a two-dimensional table;

Fisher's exact test: The original hypothesis of Fisher's precise test is that the rows and columns of a fixed boundary list are independent of each other. Its invocation format is Fisher.test (mytable), where the mytable is a two-dimensional column-linked table. The Fisher.test () function can be used on a two-dimensional column table with any number of rows greater than or equal to 2, but it cannot be used in a 2x2 column table.

Cochran-mantel-haenszel test: The Mantelhaen.test () function can be used for Cochran-mantel-haenszel chi-square testing, the original assumption being that two nominal variables are conditionally independent in each layer of the third variable. The following code can test the treatment and improve the situation at each level of gender independence. This test assumes that there is no third-order interaction (Treatment condition x Improvement x sex).

(3) Measurement of relevance

The Assocstats () function in the VCD package is used to calculate the Phi coefficient, column contact number, and Cramer ' s v coefficients for a two-dimensional list. The VCD package also provides a kappa () function that calculates the confusion matrix of the Cohen's Kappa value as well as the weighted kappa value. (for example, the confusion matrix can represent the consistency of the results of the two-bit judges classifying a series of objects.) ）

(4) Visualization of the results

(5) Converting a table to a flat format: not seen

3. Related

Correlation coefficients can be used to describe the relationship between quantitative variables.

(1) Related types

R can calculate a variety of correlation coefficients, including Pearson correlation coefficient, spearman correlation coefficient, kendall correlation coefficient, partial correlation coefficient, multi-grid (polychoric) correlation coefficient and multiple series (polyserial) correlation coefficients. Let's understand these correlation coefficients in turn.

Pearson, Spearman and Kendall related

The correlation coefficient of Pearson is a measure of the linear correlation between two quantitative variables. Spearman level correlation coefficients measure the degree of correlation between rank-ordered variables. Kendall's tau correlation coefficient is also a non-parametric grade-dependent metric.

The Cor () function calculates the three correlation coefficients, and the cov () function can be used to calculate the covariance. There are many parameters to the two functions, in which the parameters related to the calculation of the correlation coefficients can be simplified to:
Cor (x,use=,method=)

By default, the result is a square (22 calculation-dependent between all variables), and this use of the Cor () function is useful when you are interested in the relationship between a set of variables and another set of variables.

Partial correlation

Partial correlation refers to the relationship between the other two quantitative variables (corresponding to categorical variables) when controlling one or more quantitative variables (corresponding to categorical variables). You can use the Pcor () function in the GGM package to calculate the partial correlation coefficients.

The call format for the function is:

Pcor (U,s)

The u is a numerical vector, the first two values indicate the variable subscript to calculate the correlation coefficients, and the remaining values are the subscript of the condition variable (that is, the variable to be excluded). S is the covariance matrix of the variables. This example helps illustrate usage:

Other types of related
The Hetcor () function in the Polycor package calculates a mixed correlation matrix, which includes the correlation coefficients of the Pearson product difference of the numerical variables, the multi-series correlation coefficients between the numerical variables and the ordered variables, the multi-lattice correlation coefficients between the ordered variables, and the four-factor correlation coefficients between the two-part variables. Multi-series, multi-grid and four-point correlation coefficients assume that ordered or binary variables are derived from a potential normal distribution.

(2) The significance of correlation test

The value of the parameter use= can be either "pairwise" or "complete" (respectively, a pair delete or row delete is performed on the missing value). The value of the parameter method= can be "Pearson" (default), "Spearman", or "Kendall".

Significant test of partial correlation coefficient:

Other significant tests:

The R.test () function in the psych package provides a variety of practical and significant testing methods.

4. T test

The most common behavior in the study was to compare two groups. Are patients receiving a new drug treated more significantly better than those who use an existing drug? Here we will look at the result variable as a continuous group comparison, and assume that it is normally distributed.
(1) t test of independent samples

Example: Testing whether the South and the South have the same probability of incarceration

(2) T-test of non-independent samples

Pre-and post-test designs (pre-post design) or repetitive measurement designs (repeated measures design) produce non-independent groups. The T-test of non-independent samples assumes that differences between groups are normally distributed.
The invocation format for the test is:
T.test (Y1,y2,paired=true)

Where Y1,y2 is a numeric vector of two non-independent groups.

Example: Check whether the average unemployment rate for older and younger males is the same.

(3) More than two groups of cases

Compared with more than two groups, and assuming that the data is sampled independently from the normal population, the variance can be used to analyze the ANOVA.

5. Non-parametric test of difference between groups

If the data does not meet the T-Test or ANOVA parameter assumptions, you can instead use nonparametric methods.

(1) Comparison of the two groups

Independent samples:

Non-isolated samples:

Wilcoxon symbol rank test is a non-parametric substitution method for the T test of independent samples. It is suitable for two situations where the assumptions about data and the inability to ensure normality are made. The call format is exactly the same as the Mann–whitney U test, but you can also add parameter paired=true.

(2) More than two groups of comparisons

R-Basic statistical analysis-ch7

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

R-Basic statistical analysis-ch7

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support