Chi-Square Test principle

Source: Internet
Author: User

Recently tutor let do a text classification of things, accidentally see Chi Square test, do not understand (feel oneself is really water home, blog in the foundation of things, to me here is the sky looking for information science), so found some blog articles, summed up, some experience.

Introduction

First, what is chi-square testing? The chi-square test, by definition, is: Verify that the actual data distribution is the same as the distribution of the theory. This is more abstract, here is a concrete example:

  Describe the number of years of precipitation in an area. For example, 365 days a year in the region has 180 days of precipitation, then the region's precipitation probability is approximately equal to 50%, then for each month, whether the probability of precipitation reached the expected 50% (that is, 15 days)?

The chi-square test is used to solve such problems.

Principle

 Also use an example of a text classification to illustrate the principle of chi-square testing (or the use of chi-square tests).

We have a bunch of good labels and good-looking text, for simplicity, we have only two categories: science and technology and non-technology, and we find that the word "machine learning" is frequently seen in our text library, so we want to study whether the text appears "machine learning" and Whether the text is related to the two propositions of the Science and Technology category .

Let's take some text samples and make up a four-grid table like this:

Table 1 Actual sample results four-grid table

Constituencies

Technology Category

Non-technical category

Total

Does not include "machine learning"

19

24

43

Includes "Machine learning"

34

10

44

Total

53

34

87

Judging from the results of the sample , whether or not the word "machine learning" does have an effect on whether the text belongs to the science and technology category in the sample results (from the results of the table, the text containing "machine learning" is the probability of science and technology class is indeed higher than the case). However, this result may be caused by sampling error, in order to further illustrate the relationship between the two, we first assume that contains no "machine learning" and whether it is a technology- based class, then, any text, the probability of belonging to the science and technology can be calculated as such (19 +34)/(24+10) =60.9%, so we can get an expected result (i.e., under hypothetical conditions) with a table of four tables (table 1) of the actual sampling results in the four-grid table

Table 2 Expected sample results four-grid table

Constituencies

Technology Category

Non-technical category

Total

Does not include "machine learning"

43*60.9%=26.2

43-26.2=16.8

43

Includes "Machine learning"

53-26.2=26.8

44-26.8=17.2

44

Total

53

34

87

Note: When a data is obtained by probability, the other data can be obtained by adding and reducing the total

Now we have the "expected results " and "actual results" that the chi-square test requires.

So how to determine whether these two propositions are related by calculation? Here we will draw out the calculation formula:

This is the general case, the formula that simplifies our case is

Here's A is our reality, T is the expected result

X2 is used to measure the degree of difference between the actual value and the theoretical value (the core idea of Chi-square testing), which contains the following two information:

    • The absolute size of the deviation between the actual value and the theoretical value (due to the existence of the square, the difference is magnified)

    • The relative size of difference degree and theoretical value

The X2 value of 10.01 is calculated for the above scenario.

  

The next is how to use the result, that is, how do we measure whether the 10.01 is big or small. A concept of "freedom" needs to be introduced here.

Degrees of freedom equals V = (行数 - 1) * (列数 - 1) , to four-grid tables, degrees of freedom V = 1 .

Yes V = 1 , the critical probability of the chi-square distribution is:

Based on the degree of freedom, we check the distribution threshold table and discover 10.01>7.88, then the probability of the assumption we made before is <0.005, that is, 0.5%. Obviously, the possibilities are very high.

Application

 Through the previous case, we obtained the verification value of chi-square test, then the size of this value can be used to illustrate the relevance of the two propositions. In the case of the previous text classification, we can calculate the correlation between the occurrence of all the words in a dictionary and whether the article is scientific or not, and in reverse order, we can know what kind of words are most likely to appear in a certain type of article.

Postscript

The chi-square test in the sophomore probability theory and mathematical statistics seems to have learned, at that time to learn vaguely, now re-pick up the feeling is really difficult. On the other hand, 3 years of undergraduate study is really just "remember", but it does not "use."

Chi-Square Test principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.