Statistical tests--normality test and variance homogeneity test _r

Source: Internet
Author: User
Tags natural logarithm square root

First, the basic principles of statistics
12 Sample t Test Condition: ① Both obey normal distribution, ② two population variance is equal, namely variance homogeneity.
2 pairs of T-Test conditions: The difference of the overall normal distribution can be.

Two, using R for normal test and variance homogeneity test

1 Normality Test

1.1①shapiro-wilk Inspection (w test): N≤50;②shapiro-francia inspection (w ' test): 50

#R语言实现: w/w ' Test, Sample Size:3 < n < 5000 
shapiro.test (data)
p>0.05, consistent with normal distribution.

R:shapiro-wilk normality Test http://127.0.0.1:19715/library/stats/html/shapiro.test.html

Tips: Variable transformation

1 Logarithmic transformations (common logarithms or natural logarithm) log (data), LOG10 (data)
2 square root transform applicable to Upauson distribution data sqrt (data)
3 Inverse Chord transform applies to rate or percent data
4 Reciprocal transformation is applicable to data 1/data with large fluctuation at both ends

1.2 Normal QQ diagram
The normal QQ graph (Quantile-quantile plots) can be used to identify whether the sample data is approximate to the normal distribution.

Library (CAR)
#画出QQ图
PDF (qqplot.pdf)
qqnorm (data)
#画出与QQ图相对应的直线 (reference line)
qqline (data,col= "Red")
Dev.off ()

Guides: Add a straight line to the diagram, which is used for reference, to see if the scatter falls near the line. The line is determined by the two points of the One-fourth-point and Three-fourths-point points, the coordinates of the One-fourth-point coordinate are the One-fourth-point points (quantile (data,0.25)) of the actual data, and the ordinate is the One-fourth-point (QF (0.25)) of the theoretical distribution, and the three-fourths-point points are similar , these two points just determine the QQ map of the line.

Statistical distribution of the test there are many kinds, such as KS test, card-side inspection, from the graphical point of view, we can also use the QQ map (quantile-quantile plots) to check whether the data is subject to a certain distribution.

QQ diagram principle is not complex: If a batch of data x1, x2, ..., xn obey some kind of theoretical distribution, then the data will be sorted x (1), X (2), ..., x (n) and the theoretical distribution of the q1/n, q2/n, ..., qn/n to draw a scatter plot, the resulting n points should be roughly aligned on the diagonal , since the two numbers should be roughly equal.

From another point of view, check whether a batch of data is subject to a certain theoretical distribution, that is, its empirical distribution and theoretical distribution is consistent, and the sorted data x (1), X (2), ..., x (n) can be regarded as the 1/n of empirical distribution, 2/n, , if the n/n and the theoretical points are consistent, the empirical distribution and the theoretical distribution are similar to each other.

2 homogeneity test of variance

2.0 F Test
Condition: for two totality; data obeys normal distribution.

R:f Test to Compare two variances http://127.0.0.1:19715/library/stats/html/var.test.html
var: abbreviation for Variance Variance
Var.test ()

2.1 Bartlett Test Condition: for multiple population; data obeys normal distribution.

#对于单一自变量
bartlett.test (bdnf~state,data=conc)
or:
bartlett.test (BDNF$ACUTE~BDNF$CTL)

# For multiple arguments: we need to use the interaction () function to fold multiple arguments into a single variable to represent a combination of different variables. If this is not the case, the test's degrees of freedom will be wrong, leading to the wrong P value.
bartlett.test (Bdnf~interaction (STATE,BMI), Data=conc)

2.2 Levene test method is more robust and does not depend on the overall distribution, which is the preferred method of variance homogeneity test. It can be used both for the homogeneity test of the two population variance and for the homogeneity test of multiple population variance, which is encapsulated in the car package of R.

#对于单一自变量:
library (CAR)
levenetest (bdnf~state,data=conc)

#对于多个自变量: You do not need to use the interaction () function.
Levenetest (Bdnf~state*bmi,data=conc)

2.3 Fligner-killeen test is a kind of nonparametric test method, which does not depend on the overall distribution at all.

#对于单一自变量:
fligner.test (Bdnf~state,data=conc)
or:
fligner.test (BDNF$ACUTE~BDNF$CTL)

# For multiple arguments: you need to use the interaction () function to collapse multiple arguments to a single variable.
fligner.test (Bdnf~interaction (STATE,BMI), Data=conc)

The original hypothesis (H0) of the above three methods for homogeneity test of variance is "the total variance of the variables is all the same".

In addition, the variance of the original data is checked by Var.test and Bartlett.test, and Levenetest is the homogeneity test of the residual difference model. It is generally considered that the variance of the residual is homogeneous, so the general statistical software is levenetest.

"Graphic" Variance homogeneity Test _ Baidu Library https://wenku.baidu.com/view/f225b6b8e87101f69f31951a.html

R language China differential homogeneity test 丨 number Analysis College-Jianshu https://www.jianshu.com/p/dc8896fcd505

If the T-test can not be done, the method to be taken

For the measurement data, the hypothesis test method which does not satisfy the parameter test condition can be tried variable transformation to satisfy the parameter test condition, but sometimes it does not achieve the objective and the second is the nonparametric test.
For grade data, commonly used nonparametric test.

1 make the variable transform to conform to the T test condition

2 Non-parametric test

Non-parametric test has no strict assumption to the overall distribution, also known as arbitrary distribution test, it directly to the overall distribution as a hypothesis test. The advantage of nonparametric test is that it is not limited by the overall distribution and has wide application range.

The most commonly used nonparametric test is the nonparametric test of rank conversion. It is inferred whether the median m (nonparametric) of the distribution of an overall expression and the distribution of known M0, two or more populations is different.

2.1 (small sample) measurement data: The use of rank conversion nonparametric test, can not use T-Test or F-Test. If the distribution is known but does not satisfy the normal and variance homogeneity conditions, the nonparametric test of rank conversion is used, and if the distribution is unknown, the nonparametric test of rank transformation is directly selected. For the data of uncertain values (such as <0.5, >5.0, etc.) at one end or at both ends, the nonparametric test of rank conversion can only be selected, whether or not normal distribution.

The nonparametric test of the selected rank transformation will reduce the test efficiency. So if you can use (approximate) T-Test or F-test, do not use the Nonparametric test of rank conversion.

2.2 Grade Information:
The card square test of row x list data: Inference composition ratio difference.
Non-parametric test of rank conversion: Infer the grade intensity difference.

Non-parametric test of rank conversion

The principle of nonparametric test of rank conversion: First, the numerical variable data from small to large, or grade data from weak to strong conversion rank, and then calculate the test statistics, the characteristics of the hypothesis test results on the overall distribution of the shape of the difference is not sensitive to the overall distribution of the position is sensitive.

Wilcoxon sign rank test/symbol rank and test for comparison of 1 pairs of samples

Wilcoxon rank and test, used to infer whether there is a difference in the position of the two population distributions from two independent samples of measurement or grade data.

H0: Two total distribution locations are the same.

Applicable scope: The median and 0 comparisons used to match the sample difference, and also for the median and total median number of a single sample.

Median of paired sample difference and 0 comparison: The purpose is to infer whether the overall median of paired sample differences differs from 0, that is, whether the two population median from which the two related samples of the pairing is inferred is different.

Comparison of median and population median in a single sample: The purpose is to infer whether the total median m from the sample and a known total median M0 is different. Use the difference between the variables and M0 of the sample, that is, whether there is a difference between the total median and 0 of the inferred difference.

#秩和检验
wilcox.test (x,y,exact=false)

Wilcoxon rank and test for comparison of 22 independent samples
3 kruskal-wallis H test of multiple sample comparisons with complete random design
Test of Nemenyi method with more than 4 independent samples 22
Friedman m test of multiple sample comparisons for 5 random block design

The difference between the parameter test and the non-parametric test

The general eigenvalues are called parameters, and some specific distributions have their parameters, such as the normal distribution is determined by μ and σ two parameters. The parameter is to the whole, which is equal to the statistic to the sample.

Parameter test is a hypothesis of parameter, nonparametric test is the assumption of overall distribution, which is an important feature of distinguishing parameter test and nonparametric test.

The fundamental difference between the two is that the parameter test should use the overall information (the overall distribution, the overall parameters such as variance), the overall distribution and sample information to infer the overall parameters; Nonparametric testing does not require the use of overall information (overall distribution, some parameter characteristics of the population, such as variance), The overall distribution is inferred from the sample information.

Parameter inspection can only be used for isometric data and proportional data, and nonparametric test is mainly used for counting data. can also be used for isometric and proportional data, but the accuracy is reduced.

Non-parametric testing often does not assume the overall distribution type, directly to the overall distribution of some assumptions (such as symmetry, the size of the number of places, such as assumptions) for statistical testing. The most common nonparametric test statistics are 3 categories: count statistics, rank statistics, and symbol rank statistics.

The normal distribution is checked by parameters and nonparametric test for non-normal distribution.

VI, f Test/variance analysis

F test is also called variance analysis.

The application condition of variance analysis of multiple sample mean comparison: ① Each sample is independent random sample ② Each sample comes from the normal distribution ③ the total variance of each sample is equal, namely has the variance homogeneity. In short, it is independent, random, Normal, Fanchazzi.

The general purpose of variance analysis and T-Test is the same, which is to compare the average number of samples, but T test is to compare the average number of two samples, and variance analysis is to compare the average number of multiple samples.

Experiment/Experiment Design: The object of study is divided into several processing groups to exert different intervention, the intervention is called processing, the processing factor has at least two levels. The statistical analysis of this kind of scientific research data is to infer whether the difference of the average number of each treatment group is statistically significant by the obtained sample information, that is, the processing has no effect. The commonly used statistical analysis method is ANOVA (analysis of Variance, ANOVA) to commemorate Fisher, also known as F Test.

The basic idea of variance analysis: (complete random design data for a single processing factor)
The processing factor has a different level of G (G≥2), the experimental/experimental subjects were randomly divided into G groups, respectively receiving different levels of intervention, the sample content of the group I (i=1,2,..., g) was Ni, and the Xij of the first J (j=1,2,..., ni) in the treatment group was expressed in the same. The objective of variance analysis is to infer whether there is any difference between the total mean of G by analyzing the difference between the mean XI of each treatment group under the condition of the establishment of h0:μ1=μ2...=μg, so as to show that the effect of the treatment factors exists.

Some of the most important formulas in variance analysis:
SS Total =ss Group +SS Group
V Total =v Group +v Group
=ss between group V and group of MS Group
Inside/V Group within =SS Group in MS Group
Statistics F=MS Group/ms Group

If the F value is close to 1, there is no reason to refuse H0; Conversely, the greater the value of F, the greater the reason for rejecting H0.

"Data Analysis R language Combat" Learning notes eighth chapter variance analysis and R implementation-JPLD-Blog Park https://www.cnblogs.com/jpld/p/4594003.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.