Data analysis Seventh: correlation analysis

Source: Internet
Author: User

Correlation analysis is a basic method of data analyzing, it can be used to find the correlation between different variables, which refers to the similarity of the change between data, which can be described by correlation coefficients. Finding relevance can help you predict the future, and discovering causality means you can change the world.

One, covariance and correlation coefficients

If the random variable x and y are independent of each other, then covariance

Cov (x, y) = e{[X-e (X)] [Y-e (Y)]} = 0,

This means that when covariance cov (x, y) is not equal to 0 o'clock, X and Y are not independent of each other, but have a certain relationship, at this time, called X and y correlation. Statistically, covariance and correlation coefficients are used to describe the correlation of random variables x and y:

covariance: If two variables change in the same direction, that is, if one is greater than its own expectation and the other is greater than its own expectation, then the covariance between the two variables is positive. If the trend of two variables is reversed, that is, one is greater than its own expectation and the other one is less than its own expectation, then the covariance between the two variables is negative. In terms of numerical values, the greater the covariance value, the greater the same degree of two variables.

,Μ is the expectation of variables.

correlation coefficient : the correlation coefficient eliminates the influence of two variable amplitude, but only the similarity of two variables per unit change.

,δ is the standard deviation of the variable.

Correlation coefficients are used to describe the relationship between quantitative variables, the correlation coefficients of the sign (+,-) indicates the direction of the relationship (positive correlation, negative correlation), the size of the value of the relationship to indicate the degree of strength (completely unrelated to 0, fully related to 1).

For example, in the following two cases, it is easy to see that x and Y are all in the same direction, and this "same-direction change" has a very significant feature: the process of X-and Y-changes, with very high similarity.

1, observe covariance , the covariance of condition one is:

The covariance of the condition two is:

Covariance values are 10,000 times times different and can only be judged from two covariance in which both X and Y are in the same direction, but it is not possible to see the similarity between the changes of X and Y in two cases.

2, observe the correlation coefficient, the correlation coefficient of case one is:

The correlation coefficients of the condition two are:

Although the covariance of the two cases is 10,000 times times different, their correlation coefficients are the same, indicating that the change of X has a high similarity with the change of Y.

Two, the related type

R can calculate a variety of correlation coefficients, including Pearson correlation coefficient, spearman correlation coefficient, kendall correlation coefficient, partial correlation coefficient, multi-grid (polychoric) correlation coefficient and multiple series (polyserial) correlation coefficients. Let's understand these correlation coefficients in turn.

1,pearson, Spearman and Kendall correlation coefficients

The correlation coefficient of Pearson is a measure of the linear correlation between the two quantitative variables, and the correlation coefficient between spearman rank and order variables is measured, and the Kendall correlation coefficient is also a non-parameter grade correlation measure. The Cor () function calculates the three correlation coefficients, and the cov () function calculates the covariance.

Cor (x, y = NULL, use ="everything", method = C ("Pearson","Kendall","Spearman")) cov (x, y= NULL, use ="everything", method = C ("Pearson","Kendall","Spearman"))

Parameter comment:

    • X: Matrix or data frame
    • Y: By default, Y=null represents y=x, which means that all variables are related to the 22 calculation, or other matrices or data frames can be specified, making the 22 calculation related to the variables of x and Y.
    • Use: Specifies how the missing data is handled, in the optional way all.obs (when the missing data times are wrong), everything (when the missing data is encountered, the calculation result of the correlation coefficient is set to missing), Complete.obs (row delete) and pairwise.complete.obs (paired delete)
    • Method: Specifies the type of correlation coefficient, the optional type is "Pearson", "Kendall", "Spearman"

For example, using the State.x77 dataset in the R base installation package, it provides data such as population, income, illiteracy rate (illiteracy), life Exp, murder rate, and high school graduation rate (HS Grad) in the 50 states of the USA.

States <-state.x77[,1:6]>Cor (states) Population Income illiteracy life Exp murder HS gradpopulation1.00000000  0.2082276  0.1076224-0.06805195  0.3436428-0.09848975Income0.20822756  1.0000000-0.4370752  0.34025534-0.2300776  0.61993232Illiteracy0.10762237-0.4370752  1.0000000-0.58847793  0.7029752-0.65718861Life Exp-0.06805195  0.3402553-0.5884779  1.00000000-0.7808458  0.58221620Murder0.34364275-0.2300776  0.7029752-0.78084575  1.0000000-0.48797102HS Grad-0.09848975  0.6199323-0.6571886  0.58221620-0.4879710  1.00000000

can see that There is a strong positive correlation between income and high school graduation rates (about 0.620), there is a strong positive correlation between illiteracy and murder rates (about 0.703), there is a strong negative correlation between illiteracy rates and high school graduation rates (about-0.657), and there is a strong negative correlation between life expectancy and murder rates (about-0.781).

2, Partial correlation

Partial correlation refers to the correlation between two other quantitative variables when controlling one or more quantitative variables (called condition variables). You can use the Pcor () function in the GGM package to calculate the partial correlation coefficients.

Pcor (U, S)

Parameter comment:

    • U: Position vector, the first two integers indicate the variable subscript to calculate the partial correlation coefficients, the remaining integers are the subscript of the condition variable.
    • S: Is the covariance matrix of the DataSet, which is the result of the CoV () function calculation

For example: under conditions that control income, illiteracy and high school graduation rates, the partial correlation coefficient between the calculated population and the murder rate is 0.346:

>Library (igraph)>Library (GGM)>colnames (states) [1]"Population" "Income"     "Illiteracy" "Life Exp"   "Murder"     "HS Grad"> Pcor (C (1,5,2,3,6), CoV (states)) [1]0.3462724
Third, the significance of correlation test

After calculating the correlation coefficients, it is necessary to test the correlation significantly, the common hypothesis is that the variables are uncorrelated (that is, the correlation coefficient of the whole is 0), and the Cor.test () function can be used to test the individual Pearson, Spearman and Kendall correlation coefficients. To verify that the original hypothesis is true. If the P-value is small, there is a correlation between variables, and the size of the correlation is determined by the correlation coefficient.

In the result of the significance test, the P-value (p values) is the probability that the sample observations obtained when the original hypothesis is true are present. If the P-value is very small, indicating that the probability of occurrence of the original hypothesis is very small, and if there is, according to the small probability principle, we have reason to reject the original hypothesis, the lower the P-value, we reject the original hypothesis more fully.

The small probability principle refers to the fact that in statistics, events with a probability of less than 5% in the real world are often referred to as "impossible events", usually defining the level of significance as 0.05, or 0.025. When the P-value is less than the significance level, the original hypothesis is considered an impossible event because the original hypothesis is rejected.

1,cor.test () test

Cor.test () can only test one correlation at a time, the original hypothesis is that there is no correlation between the variables, that is, the overall correlation coefficient is 0.

cor.test (x, y, alternative= C ("two.sided"," Less","Greater"), Method= C ("Pearson","Kendall","Spearman"), Exact= NULL, Conf.level =0.95, continuity = FALSE, ...)

Parameter comment:

    • Alternative: For the designation of two-sided test or one-sided test, the valid values are "two.sided", "greater" and "less", for a single side test, when the overall correlation coefficient is less than 0 o'clock, using alternative= "lesser" When the overall correlation coefficient is greater than 0 o'clock, use alternative= "greater"; By default, alternative= "Two.side", which indicates that the overall correlation coefficient is not equal to 0.
    • Method: Specifies the correlation type of the calculation,
    • Exact: Logical value, whether the exact p-value is calculated
    • Conf.level: Confidence level of the test

For example, the following code is used to test the Pearson correlation coefficient of 0 for life expectancy and murder rate.

> Cor.test (states[,3],states[,5]) Pearson's product-moment Correlationdata:states[,3] and states[,5]t=6.8479, df = -, P-value =1.258e-08alternative hypothesis:trueCorrelation isNot equal to0 thepercent Confidence interval:0.5279280 0.8207295Sample Estimates:cor0.7029752 

The result of the test is: P-value (p-value=1.258e-08), the correlation coefficient cor of the sample estimate is 0.703, which indicates:

Assuming that the overall correlation is 0, it is expected that there will be less than 1 opportunities to see the sample correlation of 0.703 in 10 million times, since this is almost impossible, rejecting the original hypothesis that the overall correlation between life expectancy and murder rate is not 0.

2,corr.test () test

The Corr.test () function in the psych package calculates the correlation matrix and the significance level, in turn, for Pearson, Spearman, or Kendall.

" pairwise ", method="Pearson", adjust="Holm", alpha=. , Ci=true)

Parameter comment:

    • Use: Specifies how the missing data is handled, the default value is pairwise (paired delete); complete (row delete)
    • Method: Calculate related methods, Pearson (default), Spearman, or Kendall

3, partial correlation of the significance of the test

Under the assumption of multivariate normality, the pcor.test () function in the psych package is used to verify the independence of two variables when controlling one or more condition variables.

Pcor.test (R, Q, N)

Parameter comment:

    • R: is the partial correlation coefficient calculated by the Pcor () function
    • Q: Variables to be controlled (represented by position vectors)
    • N: Sample size

Reference Documentation:

How can the concept of "covariance" and "correlation coefficients" be explained in an understandable way?

Data analysis Seventh: correlation analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.