The chi-square distribution of "data analysis/Mining essential knowledge" statistics

Last Update:2015-08-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The chi-square distribution of statistics

Author Bai Ningsu
August 9, 2015 22:33:00

absrtact: This paper summarizes the study of the chi-square distribution of statistics. This article first describes what is the chi-square distribution and what is the use of chi-square distribution. Then, according to the main function and characteristics of the analysis. Introduce the way into the card-side, in order to understand the concept, the use of problem-solving methods, encountered problems, first introduce its concept and the actual use of the scene. The main line uses the Chi-square two main uses to test the goodness of fit and to test the independence of two variables , if you hear this concept for the first time, do not worry about following the introduction. Finally, according to the concept of adaptation to make a summary. On the basis of the core content is extended and the necessary parts of the code implementation or experimental validation. This article is original, reproduced to indicate the source.

Article Navigation:

Geometric distribution of statistics, distribution of two items and Poisson distribution

Directory:

Introduction to the topic and basic knowledge
Chi-square test goodness of fit
Chi-Square testing the independence of two variables
Summary of this chapter
Content Extensions
Reference documents

I. Introduction and basic knowledge what is chi-square distribution?

If n independent random variables ξ, ξ?、......、 ξn, are subject to the standard normal distribution (also known as independent distribution in the standard normal distribution), then the sum of squares and $ $Q =\sum_{i=1}^{n}ξ_i^2$$ of random variables that obey the standard normal distribution is a new random variable, Its chi-square distribution law is called x^2, distribution (chi-square distribution), where the parameter n is called degrees of freedom, just as the normal distribution of the mean or variance is another $x^2$ normal distribution, the difference of degrees of freedom is another distribution. Recorded as Q~x^2 (k). Chi-square distribution is a new distribution constructed from the normal distribution, and when the degree of Freedom N is large, the x^2 distribution is approximately normal. For any positive integer k, the chi-square distribution of degrees of freedom K is a probability distribution of a random variable x.

Why should I refer to chi-square distribution?

The long-term result is stable and can be grasped clearly when the specific probability distribution is modeled as a certain situation. But what about the difference between expectation and fact? is the deviation a normal small amplitude fluctuation? Or a modeling error? At this time, using chi-square distribution analysis results , to exclude suspicious results . "The use of chi-square distribution to verify the facts and expectations "

What's happening in life (the Riddle of the Lottery machine)?

Lottery machine, certainly not unfamiliar, now some shopping malls are placed at the door. Under normal circumstances, the probability of the award is certain, the basic business proceeds. If there is a time when the prize is always out of the ordinary, it is abnormal, then is a certain stage is a small probability event or someone to operate it? what happened to the lottery machine? in response to this phenomenon or similar to the phenomenon of the problem can be the use of Chi Fang Test, for the moment how to test, or to supplement the basic knowledge, and then gradually solve the problem. "Unusual behavior in general events, how to check the problem in the case of using chi-square distribution "

question Description: The Riddle of the lottery machine? Question one: Chi-square test goodness of fit case

The following is the desired distribution of a lottery machine, where x represents the net yield per game (independent event per board):

In practice, the frequency of people's income is:

At a significant level of 5%, see if there is enough evidence to prove that the lottery machine has been tampered with.

1. Calculate the actual frequency of each x value compared with the expected frequency based on the probability distribution?
2, the use of the lottery machine observation frequency and the expected frequency table to calculate the test statistics?
3. What is the original hypothesis to be tested? What is the alternative hypothesis?
4. What is the deny domain with 4 degrees of freedom and 5% levels?
5. What is the test statistic?
6. Is the test statistic outside the deny domain or the deny domain?
7. Will you accept or reject the original hypothesis?

Question two: Chi-square Test independence case

The following table shows the observed frequency of the banker,

The hypothesis test is performed at a 1% significance level to see if the betting results are independent of the Republicans banker.

1, you are the task is to calculate all the expected frequency.
2, according to the above expected frequency, calculate the test statistics x^2.
3. Determine the assumptions to be tested and the alternative assumptions.
4. Find the desired frequency and degree of freedom.
5. Determine the deny domain used to make the decision.
6, calculate the test statistic x^2
7. See if the test statistic is in the denied domain.
8. Make decisions.

Second, chi-square test goodness of fit (problem i)

Problem Summary : Lottery machine Ordinary income is always business, suddenly a period of time always out of the prize. Originally small probability event frequency, we use Chi Square test goodness of fit to see if there is enough evidence to determine that the lottery machine has been tampered with

Knowledge Reserve: Expected frequency calculation

Desired frequency = (observed frequency sum ()) x (probability of each result) such as: x= (-2) Expected frequency: 977= (0.977) x (1000)
The difference between the observed frequency and the expected frequency is examined by the chi-square hypothesis.

1. Calculate the actual frequency of each x value compared with the expected frequency based on the probability distribution?
Answer:

Knowledge Reserve: Chi-square Test assessment differences

Chi-square distribution: A test statistic is adopted to compare the difference between the expected result and The actual result , and then the occurrence probability of observing the frequency extremum is obtained.
Calculate the statistics step: (the sum of expected frequencies equals the sum of the observed frequencies)
1, the table fill in the corresponding observation frequency and expected frequency
2, the use of chi-square formula to calculate the test statistics: (o represents the observed expectations, E represents the desired frequency)
$$ x^2=\sum_{}^{}\frac{(O-E) ^2}{e} $$
Note: Where x^2 represents the test statistic, o indicates the observed frequency, and e represents the desired frequency.
That is: for each probability of probability distribution, take the difference between the expected frequency and the actual frequency, the squared number of the difference, divided by the desired frequency, and then add all the results.
test statistic significance : the smaller the difference between O and E, the smaller the test statistics. The divisor of E is the ratio of the difference to the desired frequency.
Chi-Square test criteria: If the statistic value (x^2) is very small, it indicates that the difference between the observed frequency and the expected frequency is not significant, the greater the statistic, the more significant the difference.

2, the use of the lottery machine observation frequency and the expected frequency table to calculate the test statistics?
Answer:

Knowledge Reserve: Chi-square hypothesis test

purpose of Chi-square distribution: Check when there is a significant difference between the actual and expected results.
1, test the goodness of fit: that is to test a given set of data and the specified distribution of the degree of consistency. For example: use it to test the profit of the lottery machine the frequency of observation and we expect the degree of coincidence.
2. Verify the independence of two variables: This method checks if there is a relationship between the variables.
Degrees of Freedom V: The number of independent variables used to calculate test statistics.
1, freedom of the Greek letter V, read as "new", v influence probability distribution
2, when v equals 1 or 2 o'clock: Chi square Distribution first high and low smooth curve, test statistics equal to the probability of a smaller value far greater than the probability of a larger value, that is, the observation frequency may be close to the desired frequency. Graphics:

3, when the V is greater than 2 o'clock: chi-square distribution first low and then low, its shape along the positive distortion, but when the V is very large, the graph is close to normal distribution. Graphics:

4, the specific parameters V (barthelemy) chi-square distribution and test statistics can be recorded as:

5, V Calculation: (example: v=5-1)
v= (number of groups)-(Limit number)
significance: Chi-square distribution indicates that the difference between the observed frequency and the desired frequency is significant, as is the case with other assumptions, depending on the level of significance.

1, the dominant level α test, then writing: ( common significance of the level 1% and 5%)

2, testing standards: chi-square distribution test is a single-tailed test and is the right end, the right end is used as a deny domain. Therefore, the probability of the expected distribution is determined by examining whether the test statistic is located within the Reject domain at the right end.

3, the use of chi-square probability table: Chi-square critical value table is given can be queried

For example: 5% of the significance of the level, 8 degrees of freedom to test. 15.51 is detected, so as long as the test statistic is greater than 15.51, the test statistic is located in the Reject domain.

Chi-square distribution hypothesis test: (Always use the right end)
Steps:
1, determine the hypothesis to be tested (H0) and its alternative hypothesis H1.
2, to find the desired e and degrees of Freedom v.
3. Determine the Deny domain (right end) used to make the decision.
4, calculate the test statistics.
5. Check if the test statistic is in the denied domain.
6, make the decision.
Chi-square distribution test is actually a special form of hypothesis testing.

3. What is the original hypothesis to be tested? What is the alternative hypothesis?
Answer:

Knowledge Reserve: Reject domain Solver

For example: 5% of the significance of the level, 8 degrees of freedom to test. 15.51 is detected, so as long as the test statistic is greater than 15.51, the test statistic is located in the Reject domain.

4. What is the rejection domain for the 4,5% level of freedom?
Answer:

Knowledge Reserve: Calculation of test statistics

have been asked before.

5. What is the test statistic?
Answer:

Knowledge Reserve: Determination of rejection in the field of test statistics

1. Find out the test statistic a
2, through the degree of freedom and the significance of the level of denial of the threshold of the domain B
3, A>b is located in the Deny domain, conversely, is located in the rejection of the extraterritorial.

6. Is the test statistic outside the deny domain or the deny domain?
Answer:

Knowledge Reserve: The principle of decision-making

If you are in a deny domain we reject the original hypothesis H0 and accept H1.
If not in the Deny domain we accept the original hypothesis H0, reject H1

7. Will you accept or reject the original hypothesis?
Answer:

Note:Only a set of observed frequencies can be obtained and the desired frequency is calculated, the chi-square can test the goodness of fit for any probability distribution.

The answer: The lottery machine was moved by people!!!!!

Three, Chi Square test the independence of two variables (question two)

"Brief description of the problem": The lottery machine has been moved hands and feet, after the technical staff to deal with, but now the new problem arises, because the boss found that the 21-point gambling table of the dealer to admire the money above reasonable value. The banker is suspected to be an inner ghost. Whether the betting result depends on the Republicans of the banker, that is, whether the dealer is in black-box operation, the result of the game is related? This problem requires the Chi-square distribution check Independence to solve the case.

The table below shows the observed frequency of the bankers,

The hypothesis test is performed at a 1% significance level to see if the betting results are independent of the Republicans banker.

Knowledge Reserve: Using probability to find the desired frequency

1, Independence check: is used to determine whether two factors are independent of each other, or whether they are related.
2, expected probability solution step:
1, calculate the game result and the banker's frequency and the sum of the total, such as the following table is called the List of tables

2, calculate the winning expectations of banker A.
A, the probability of winning: P (win) = Win total/sum
B, banker a Republicans probability: P (a) = Total A/sum
C, assuming that the banker a and the result of the game independent, its Republicans occurrence win the probability: P (a Republicans win) =p (win) X P (a)
C , the expected frequency of the win = Sum *p (a Republicans win)
that is:

3, promotion: expected frequency = line Total X column total/sum
4, find test statistics: (With Previous Face)
$ $x ^2=\sum_{}^{}\frac{(O-E) ^2}{e}$$

1, you are the task is to calculate all the expected frequency.
Answer:
2, according to the above expected frequency, calculate the test statistics x^2.
Answer:
3. Determine the assumptions to be tested and the alternative assumptions.
Answer:
4. Find the desired frequency and degree of freedom.
Answer:
5. Determine the deny domain used to make the decision.
Answer:
6, calculate the test statistic x^2
Answer:
7. See if the test statistic is in the denied domain.
Answer:
8. Make decisions.
Answer:

The calculation method of freedom is summed up:

Table of degrees of freedom calculation, table as follows K column, H line
v= (h-1) X (k-1) Note: Each line is calculated to the last, with a total of-other, so one number limit, one column limit. Therefore, the above formula.

Note:

1, in the goodness of Fit test, v= group number-Limit number
2, in two variable independence test, such as the column table is h row k column: v= (h-1) X (k-1)

Iv. Why should the chi-square distribution be referred to in this chapter?

The long-term result is stable and can be grasped clearly when the specific probability distribution is modeled as a certain situation. But what about the difference between expectation and fact? is the deviation the normal small amplitude fluctuation or the modeling error how to distinguish? At this time, using chi-square distribution analysis results, to exclude suspicious results. "The use of chi-square distribution to verify the facts and expectations"

Chi-square test goodness of fit case

expected calculation :

Desired frequency = (observed frequency sum ()) x (probability of each result) such as: -2:977= (0.977) x (1000)

Chi-square distribution

A test statistic is adopted to compare the difference between the expected result and the actual result, then the probability of the occurrence of the observed frequency extremum is obtained.

Calculate the Statistics step: (The sum of expected frequencies equals the sum of the observed frequencies)

1, the table fill in the corresponding observation frequency and expected frequency
2, the use of chi-square formula to calculate the test statistics: (o represents the observed expectations, E represents the desired frequency)

$ $x ^2=\sum_{}^{}\frac{(O-E) ^2}{e}$$

that is : for each probability of probability distribution, take the difference between the expected frequency and the actual frequency, the squared number of the difference, divided by the desired frequency, and then add all the results.

test statistic significance

The smaller the difference between O and E, the smaller the test statistics. The divisor of E is the ratio of the difference to the desired frequency. Chi-Square test criteria: If the statistic value (x^2) is very small, it shows that the difference between the observed frequency and the expected frequency is not significant, the greater the statistic, the more significant the difference.

Use of chi-square distribution

Check when there is a significant difference between the actual and expected results.
1, test the goodness of fit: that is to test a given set of data and the specified distribution of the degree of consistency. For example: use it to test the profit of the lottery machine the frequency of observation and we expect the degree of coincidence. 2. Verify the independence of two variables: This method checks if there is a relationship between the variables.

Freedom V

The number of independent variables used to calculate the test statistics.
1, freedom of the Greek letter V, read as "new", v influence probability distribution
2, when v equals 1 or 2 o'clock: Chi square Distribution first high and low smooth curve, test statistics equal to the probability of a smaller value far greater than the probability of a larger value, that is, the observation frequency may be close to the desired frequency. 3, when the V is greater than 2 o'clock: chi-square distribution first low and then low, its shape along the positive distortion, but when the V is very large, the graph is close to normal distribution.
4. Chi-square distribution of specific parameters V (barthelemy) and test statistics
5, V Calculation: (example: v=5-1)
v= (number of groups)-(Limit number)

Significance of

The chi-square distribution indicates that the difference between the observed frequency and the desired frequency is significant, as with other assumptions, depending on the level of significance.

1, the dominant level α test, then writing: (Common significance of the level 1% and 5%)
2, testing standards: chi-square distribution test is a single-tailed test and is the right end, the right end is used as a deny domain. Therefore, the probability of the expected distribution is determined by examining whether the test statistic is located within the Reject domain at the right end.
3, the use of chi-square probability table: Chi-square critical value table is given can be queried

Chi-square distribution hypothesis test step: Always use right end

1, determine the hypothesis to be tested (H0) and its alternative hypothesis H1.
2, to find the desired e and degrees of Freedom v.
3. Determine the Deny domain (right end) used to make the decision.
4, calculate the test statistics.
5. Check if the test statistic is in the denied domain.
6, make the decision.
Chi-square distribution test is actually a special form of hypothesis testing.

Decision-making Principles

If you are in a deny domain we reject the original hypothesis H0 and accept H1. If not in the Deny domain we accept the original hypothesis H0, reject H1

Chi-Square testing the independence of two variables (question two)

Independence test:

Used to determine whether two factors are independent of each other, or whether they are connected.

expected probability solution steps:

1, calculate the game result and the banker's frequency and the sum of the various, such as the following table called the list of tables

2, calculate the dealer A's win expectations.
A, to find the probability of win: P (Win) = Win total/sum
B, banker A Republicans probability: P (a) = Total/Sum
c, assuming that the banker A and the result of the game independent, its Republicans appear win the probability: P (a Republicans win) =p (win) X P (a)
C, the expected frequency of the win = Sum *p (a Republicans win)
That

Promote:

Expected frequency = (Row total X column total)/sum

To find out the test statistics:(as before)

$ $x ^2=\sum_{}^{}\frac{(O-E) ^2}{e}$$

The calculation method of freedom is summed up:

Table of degrees of freedom calculation, table as follows K column, H line

v= (h-1) X (k-1)

Note: Each row is calculated to the last, with a total of-other, so one number limit, the same column limit. Therefore, the above formula.

Note:

1, in the goodness of Fit test, v= group number-Limit number
2, in two variable independence test, such as the column table is h row k column: v= (h-1) X (k-1)

V. Content expansion

Statistical Testing Java Code implementation

/*** Test Statistics Calculation formula * x^2=\sum\frac{(O-E) ^2}{e} * where x^2 represents the test statistic, O indicates the observed frequency, E represents the desired frequency *@paramO int, indicating the observed frequency *@paramE int, indicating the desired frequency *@returnts=x^2 double type reserved two decimal places, test statistics*/ Public Static DoubleTeststatistic (Double[] data) {    intLen = data[0].length; DoubleTS = 0;//Test Statistics//Loop Overlay COMPUTE Expectations     for(inti = 0; i < Len; i++) {TS+ = (Math.pow ((Data[0][i]-data[1][i), 2))/data[1][i];//calculated based on the test statistic Formula x^2=\sum\frac{(O-E) ^2}{e}    }    //and retains the result 3 decimal placesTs=numformat.decformat (3, TS); System.out.println ("Test statistics:" +TS); returnTS;}

The expected frequency of chi-Square independence Java code implementation

 /*   * desired frequency of chi-square Independence * Formula: expected frequency = ((line total) * (column total)/sum * @param sum1 Double, row total * @param sum2 double, column total * @param sum double, sum * @return Enum double, expected frequency  */ public  static  double  expfre (double  sum2,double   sum) { double  enum=0;    Enum  = (sum1*sum2)/sum;    Numformat.decformat ( 2,enum);    System.out.println ( "Independence expectation Frequency: enum=" +enum);  return   Enum;}

Java code Implementation of computational freedom

/* * Computational Freedom: The number of independent variables used to calculate test statistics. * Formula: v= (h-1)-(k-1) * H represents the table row, K represents the column * @param h int, number of groups * @param k int, limit: number of effects calculated * @return v   */Publ ICstaticint nihefreenum (int h,int  k) {    int v=0 ;    V= (h-1) * (k-1);    System.out.println ("fit degrees of Freedom: v=" +V)    ; return v;}

Expectation and variance of chi-square

The mean value of the distribution is degrees of freedom N, $$ E (x^2) = n$$
The variance of the distribution is twice times the degree of freedom (2n), recorded as $$ D (x^2) = 2n$$

Properties

1) in the first quadrant, the Chi-square value is positive, positive-biased (right-biased), with the increase of the parameter n, the distribution tends to normal distribution, the area under the chi-square distribution density curve is 1.
2) the mean and variance of the distribution can be seen, with the increase of the degree of Freedom N, the χ2 distribution to the positive Infinity direction extension (because the mean n is more and more large), the distribution curve is increasingly low and wide (because the variance of 2n more & larger).
3) different degrees of freedom determine the different chi-square distribution, the smaller the degree of freedom, the more skewed distribution.

Vi. Reference Documents

1. Chi-square distribution
2, four table card square inspection
3, chi-Square test percentage and theoretical value difference
4, the relevant sample chi-square test
5, chi-square test whether the classification variables are related
6. Layered Chi-square inspection
7, several common indiscriminate (disorderly) use of chi-square test case
8.----Chi-square test
9. Think tank
10, chi-square test of SPSS operation

The chi-square distribution of "data analysis/Mining essential knowledge" statistics

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The chi-square distribution of "data analysis/Mining essential knowledge" statistics

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The chi-square distribution of "data analysis/Mining essential knowledge" statistics

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support