Chi-Square Inspection

Source: Internet
Author: User

Chi-square test is a hypothesis testing method based on χ2 distribution, which is widely used, especially in the analysis of discrete variables, the χ2 distribution was first proposed by F.helmet in 1875, he calculated that the sample variance distribution from normal population obey χ2 distribution, and 1900 Karl Pearson also obtained χ2 distribution in the study of goodness of fit, and proposed χ2 statistic for hypothesis testing.

"The main uses of Chi-square inspection include the following aspects"

1. Verify that the distribution of a continuous variable is consistent with a theoretical distribution. If it conforms to the normal distribution, whether it obeys evenly distributed, obeys Poisson distribution, etc.

2. Whether the probability of the occurrence of an unordered categorical variable is equal to the specified probability, such as whether the probability of the occurrence of the dice is equal to 1\6, and whether the two sides of the coin equals 0.5

3. The independence test, consistency check, and calculation correlation size between the attributes of two unordered categorical variables

4. After controlling a certain classification factor, the independence test, the consistency test, and the calculation correlation size between the properties of the two unordered categorical variables

5. When different methods are used for the same variable, the effect is consistent. If two treatments work in the same group of patients, is the effect the same

In the above uses, in addition to the 1th is for the continuous variable, the rest is for the disorder classification variables, thus, the Chi-square test is mostly used in the test of classification variables play a role.

=======================================

"Chi-square test basic idea"

The chi-square test is based on the gradual χ2 distribution, and its 0 hypothesis H0 is: There is no difference between the observed frequency and the desired frequency.
By constructing the χ2 statistic, the P value is obtained and tested.

Should say, all through the construction of χ2 statistics are chi-square test, chi-square test is a kind of test (Greek alphabet χ The English phonetic transcription is approximately read as "card"), so we describe these different chi-square test, we will add a specific name to distinguish, for example, Pearson Chi Fang, McNemar paired with Chi-square, likelihood ratio chi-square and so on. Because it is Pearson the first to use Chi-square statistics to do hypothesis testing, so we usually say chi-square test, many times refers to Pearson Chi Fang.

=======================================

"Calculation and significance of χ2 statistics"

χ2 statistics actually indicate the degree of deviation between the observed and theoretical values, and the formula is

Among them, AI is the observation frequency of I level, EI is the desired frequency of I level, n is the total frequency, the PI is the desired rate of I level. The desired frequency of the I level TI equals the total frequency nxi level of the desired probability pi,k is the number of cells. When n is large, the χ2 statistic approximates the chi-square distribution of degrees of freedom of k-1 (the number of parameters used to calculate EI).

The calculation formula of Chi-square shows that the χ2 value is 0 when the frequency of observation is exactly the same as the expected frequency, the lower the difference between the observed frequency and the desired frequency, the smaller the χ2 value, and the greater the difference between the observed frequency and the desired frequency, the larger the χ2 value. In other words, a large χ2 value indicates that the observed frequency is far away from the desired frequency, indicating a far away hypothesis. The small χ2 value indicates that the observed frequency approximates the desired frequency, approaching the hypothesis. Therefore, χ2 is a measure of the distance between the frequency and the desired frequency, and is also a measure of whether the hypothesis is established or not. If the χ2 value is "small", the researcher tends not to reject the H0; if the χ2 value is large, it tends to reject H0. As to the extent to which χ2 can reject H0 in each specific study, it is necessary to find out the corresponding P-value by the chi-square distribution.

======================================

"Sample volume requirements for chi-square inspection"
Chi-square distribution itself is a continuous distribution, but in the statistical analysis of categorical data, it is obvious that the frequency can only appear in integers, so the calculated statistic is non-sequential. Only when the sample size is sufficient, you can ignore the difference between the two asked, otherwise it will likely lead to larger deviations in particular, it is generally believed that for each cell in the chi-square test, the minimum desired frequency is greater than 1, and at least 4/5 of the cell expected frequency is greater than 5, The probability value calculated using Chi-square distribution is accurate at this time. If the data does not meet the requirements, the exact probability method can be used to calculate the probability.

======================================

"Some other tests based on Chi-square distribution"

The above-mentioned Chi-square test is a kind of test, person put forward chi-square test thought, and a lot of people have expanded on this basis, forming a variety of test methods based on Chi-square distribution.

1.Yates Correction

Also known as Yates-Chi Square test, Yates continuity correction. Presented by Frank Yates of the British. Yates that the chi-square distribution is a continuous distribution, but the statistics calculated by the classification data are discrete. If the expected frequency of a cell is less than 5 o'clock, the assumption of progressive chi-square distribution of the χ2 statistic is not credible, so continuity correction is required, minus 0.5 in the residuals of each cell

The formula is:

Where a is the observed frequency, T is the desired frequency

Yayes calibration has certain conditions of use

1. is applicable to the cross-linked table (2x2 list of tables)
2. Sample size n greater than 40
3. The desired frequency of all cells is greater than 1
4. The expected frequency of the cell table below 1/5 is less than 5

It should be said that Yates's point of view is justified, but the problem is that the corrected P-value may be overly conservative, that is, it is possible to make a Ⅱ type error (H0 is false and not rejected), so in the case of small sample size, there is no need to do Yates correction. In the case of sufficient sample size, Yates correction can be done, and it is recommended to compare the results with the uncorrected, if inconsistent, may require careful treatment or redesign of the analytical method.

2. likelihood ratio test (likelihood Ratio test)

Referred to as LRT, and Pearson Chi Fang, LRT is also assumed to be independent of the row and column variables, but also the construction of χ2 statistics, subject to progressive chi-square distribution, but the difference is that LRT calculate Chi-square statistics of the formula is not the same.

The basic idea of likelihood ratio test is:
If the parameter constraint is valid, then adding such a constraint should not cause a significant decrease in the likelihood function maximum value. That is to say, the essence of likelihood ratio test is the maximum of likelihood function under the condition of comparing the likelihood function with the constraint condition. The likelihood ratio is defined as the ratio between the maximum likelihood function and the maximum likelihood function under the unconstrained condition.

A quasi-square distribution statistic can be constructed on the basis of likelihood ratio, so the likelihood ratio test is also based on Chi-square distribution.

3.Fisher exact test (Fisher's exact test)

Also known as Fisher exact probability method, is presented by Fisher. As the Pearson, Yates correction, likelihood ratio are only subject to progressive chi-square distribution (some statistical software such as SPSS, in the results displayed in the P-value monogram "progressive" word), when the sample volume is insufficient or a cell expected frequency is less than 1 o'clock, this progressive hypothesis may not be established, Fisher's exact test is not based on the progressive chi-square distribution, but on the hypergeometric distribution, so the results are more accurate than the other three.

Super Geometric distribution:

Hypergeometric distribution is a discrete probability distribution. It describes the number of times (not put back) of extracting n objects from a finite set of objects and successfully extracting objects of a specified type, which is actually a non-return sample.

Fisher Accurate test calculation method:

There are two algorithms for Fisher's exact test:

1. One is based on the actual observed cross-table corresponding to the probability size of the corresponding probability, and this method is called SF algorithm

2. The other is based on the observed difference between the actual frequency of the cross-table and the theoretical frequency to find the corresponding probability of the sum, this method is called the TF algorithm

The two calculation methods in most cases, the results are the same, but in the hyper-geometric distribution of serious asymmetry when they will be different, and most of the statistical software, such as Spss,sas are used SF algorithm, below we only introduce SF algorithm.

Under the condition that the total number of the four-grid table is fixed, the probability pi of all combinations of 4 actual frequency changes in the table is calculated, and then the cumulative probability p of one or two sides is deduced according to the test hypothesis, according to the Test level α.

The formula is:

To illustrate:

(1) Calculation of each combinatorial probability pi

Under the condition that the total number of four squares is constant, the combined number of the 4 actual frequency a,b,c,d changes in the table has a total of "the minimum number of +1" in the surrounding total. The number of 4 actual frequency changes in the table has a total of 9+1=10, in turn:

The probability pi of each combination obeys the hypergeometric distribution, and it is 1

(2) Calculation of cumulative probability (single, bilateral test is different)

The probability of the cross-product difference a*d*-b*c*=d* in the existing sample four-grid table is p*, and the cross-product difference of the remaining four-grid table is recorded as Di, and the probability is recorded as Pi

(1) Single side inspection
If the existing sample is d*>0 in four tables, the cumulative probability of the four tables under various combinations satisfying the di>=d* and pi<=p* conditions must be calculated. If d*<0, then

Calculates the cumulative probability of a four-grid table with various combinations satisfying di<=d* and pi>=p* conditions.

(2) Double side inspection
Compute Meets | di|>=| D*| and pi<=p* conditions of the various combinations under the four-grid table of cumulative probability. In the case of A+b=c+d or a+c=b+d, the sequence of various combinations in the four-grid table is symmetrically distributed, at which point the single-sided cumulative probability is calculated only by the condition of the unilateral test, and then multiplied by 2.

This example d*=-66 p*=0.08762728

Calculates the cumulative probability of a four-grid table that satisfies both di>=66 and pi<=p* conditions. This example P1, P2, P3, P4, P5 and P10 meet the conditions, the cumulative probability is p=p1+p2+p3+p4+p5+p10≈0.121>0.5, according to α=0.05 Test level does not refuse H0

Because the hypergeometric distribution has exhausted all the possibilities, so it calculates the results more accurate, the algorithm also has the versatility, in the Rxc table, the pairing table also applies, but the disadvantage is that the computational amount is large, especially for the larger RXC table, so when the sample size is large enough and the cell frequency meets the requirements of the case, We can use three other methods of testing. But with the development of computers, this shortcoming has become smaller, in the medium sample even multi-dimensional table, also began to use Fisher accurate test.

======================================

"Chi-square test use method"

1. suitability Test

The number of observations received in the actual execution of the experiment, compared with the expected number of null assumptions, is called chi-square moderation or goodness-of-fit testing, that is, to test the proximity of the two, using sample data to test whether the overall distribution is a specific distribution of statistical methods. For continuous variables, you can do a general distribution test (parametric or non-parametric test), such as the normal test, for discrete variables (categorical variables), you can also check whether the probability of the occurrence of an unordered categorical variable is equal to the specified probability, note: Here must be unordered categorical variables, and cannot be two categorical variables, Because the ordered categorical variables and the two categorical variables each have better testing methods, the test results of the two kinds of variables are more error than the Chi-square test.

Chi-square moderation test is mainly used for individual variables of the individual properties, if it is two or more variables, you need to use another method of Chi-square test described below.

Chi-square test to do continuous variable distribution test I have introduced in the normal test that article, no longer talk about, here is mainly about the discrete variable suitability test.

"Example" is assumed to throw a dice 120 times, the total number of points is a, b for each point appears the expected 120x1/6=20,.

Here, 1-6 of these six kinds of points can be regarded as the attributes of the dice, theoretically these six points appear the same probability, are 1/6, we use this hypothesis test

0 hypothesis H0: The observed distribution equals the desired distribution (or is actually consistent with the theory)

Alternative hypothesis H1: The observed distribution is not equal to the desired distribution

Calculate Chi-Square test statistics, right

D2= (B2-C2) ^2/c2

D8=sum (D2:D7)

Determine the degree of freedom, (6-1) x (2-1) = 5; Select a significant horizontal α=0.05.

Using the CHIINV function provided by Excel to find the critical value, type "=CHIINV (0.05,5)" in cell D9 to press ENTER, the threshold value is 11.07.

Comparing the critical value and statistic, 11.07>2.3, that is, the critical value is greater than the statistic, so the difference is not significant, accept H0

2. Test of independence between the attributes of two unordered categorical variables

We know that Chi-square moderation is an analysis of the proximity between the sample and the population, which can also be generalized to compare samples to samples. The comparison between samples and samples mainly involves two aspects of independence and consistency, and the observation values of the properties of two or more variables can usually be summed up as a rxc table, which we call the list of tables, the most common 2x2 table, also called a cross-linked table or a crosstab table. As mentioned earlier, Chi-square test has certain requirements for the sample size, in fact, it is also required for the types of categorical variables, such as two categorical variables, ordered categorical variables, such variables using chi-square test is not the best choice, the most suitable for chi-square test variables are unordered categorical variables, Therefore we should pay attention to the variable type when using, cannot blindly use.

"Example" a body wants to know whether sex is related to income now, they randomly sampled 500 people, ask about this opinion. There are two variables involved in the problem, namely gender and perception, where the gender variable has two attributes (male, female), the view variable has three attributes (related, irrelevant, not known), and now the data is organized into a list of tables, sampled data

0 hypothesis H0: Gender is not related to income.

Alternative hypothesis H1: Gender and income have a certain relationship

To determine the degree of freedom (3-1) x (2-1) = 2, select a significant horizontal α=0.05.

To calculate the number of expected times between men and women for different views on income and gender, where the total value of the row is divided by the total value to compute each expectation, as shown in 4, type "=b5*e3/e5" in Cell B9, similarly (the first equals sign is understood to be typed in a cell):
B10= "=b5*e4/e5",
C9= "=c5*e3/e5",
C10= "=c5*e4/e5",
D9= "=d5*e3/e5",
D10= "=d5*e4/e5".

Calculate the statistic using the chi-square statistic Calculation formula, type "= (B3-B9) ^2/b9" in cell B15, and the rest of the cells, and so on, the result.

Finally, the χ2 value calculated according to the Chi-square formula is 21.4675.

Use the CHIINV function provided by Excel to calculate a significant level of 0.05, the degree of freedom of 2 chi-square distribution of the critical

Value, type "=CHIINV (0.05,2)" In the Excel cell and press ENTER to have the threshold value of 5.9915.

Comparison of statistical measures and thresholds, statistics of 21.4675 is greater than the critical value of 5.9915, so reject the 0 hypothesis, that is, gender and income is a certain relationship.

3. conformance checking between the attributes of two unordered categorical variables

The consistency test of two unordered categorical variables is actually the generalization of the moderation test, which is a comparison between the observed frequency and the expected frequency, is a single sample test, and the consistency test is a sample variable observation frequency and another sample variable observation frequency comparison, is a double sample test.

"Example" a consulting firm wants to know whether people in Nanjing and Beijing have the same degree of satisfaction with minimum living security. They took out 600 residents from Nanjing and 600 residents in Beijing, each with a choice of satisfaction (very satisfied, satisfied, dissatisfied, very dissatisfied) and only one. Type the statistical results into an Excel worksheet,

Here are the steps to use Excel to resolve this issue.

(1) 0 Suppose H0: Residents in Nanjing and Beijing share the same level of satisfaction with minimum living security.

(2) Determine the degree of freedom (4-1) x (2-1) = 3, select a significant horizontal α=0.05.

(3) To solve the L critical value of chi-square test, type "=CHIINV (0.05,3)" In the Excel cell and press back

The critical value of the car key is 7.81.

(4) Calculate the expectations of different satisfaction levels in Beijing and Nanjing, type "= $B $7*d3/$D $7" and "= $C $7*d3/$D $7" respectively in cells B11 and C11, select B11:c11, press and hold C11 in the lower right corner to fill the control point, and fill to C14.

(5) Calculate Chi-square statistics, type "= (B3-B11) ^2/b11" in cell B19, and so on.

Finally, the χ2 value calculated according to the Chi-square formula is 1.3875.

(6) Comparative statistics and thresholds, statistics 1.3875 is less than the threshold value of 7.81, so accept the 0 hypothesis, that is, Nanjing and Beijing residents of the minimum living security satisfaction is the same.

Chi-Square Inspection

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.