Recently tutor let do a text classification of things, accidentally see Chi Square test, do not understand (feel oneself is really water home, blog in the foundation of things, to me here is the sky looking for information science), so found some blog articles, summed up, some experience.
Introduction
First, what is chi-square testing? The chi-square test, by definition, is: Verify that the actual data distribution is the same as the distribution of the theory. This is more abstract, here is a concrete example:
Describe the number of years of precipitation in an area. For example, 365 days a year in the region has 180 days of precipitation, then the region's precipitation probability is approximately equal to 50%, then for each month, whether the probability of precipitation reached the expected 50% (that is, 15 days)?
The chi-square test is used to solve such problems.
Principle
Also use an example of a text classification to illustrate the principle of chi-square testing (or the use of chi-square tests).
We have a bunch of good labels and good-looking text, for simplicity, we have only two categories: science and technology and non-technology, and we find that the word "machine learning" is frequently seen in our text library, so we want to study whether the text appears "machine learning" and Whether the text is related to the two propositions of the Science and Technology category .
Let's take some text samples and make up a four-grid table like this:
Table 1 Actual sample results four-grid table
Constituencies |
Technology Category |
Non-technical category |
Total |
Does not include "machine learning" |
19 |
24 |
43 |
Includes "Machine learning" |
34 |
10 |
44 |
Total |
53 |
34 |
87 |
Judging from the results of the sample , whether or not the word "machine learning" does have an effect on whether the text belongs to the science and technology category in the sample results (from the results of the table, the text containing "machine learning" is the probability of science and technology class is indeed higher than the case). However, this result may be caused by sampling error, in order to further illustrate the relationship between the two, we first assume that contains no "machine learning" and whether it is a technology- based class, then, any text, the probability of belonging to the science and technology can be calculated as such (19 +34)/(24+10) =60.9%, so we can get an expected result (i.e., under hypothetical conditions) with a table of four tables (table 1) of the actual sampling results in the four-grid table
Table 2 Expected sample results four-grid table
Constituencies |
Technology Category |
Non-technical category |
Total |
Does not include "machine learning" |
43*60.9%=26.2 |
43-26.2=16.8 |
43 |
Includes "Machine learning" |
53-26.2=26.8 |
44-26.8=17.2 |
44 |
Total |
53 |
34 |
87 |
Note: When a data is obtained by probability, the other data can be obtained by adding and reducing the total
Now we have the "expected results " and "actual results" that the chi-square test requires.
So how to determine whether these two propositions are related by calculation? Here we will draw out the calculation formula:
This is the general case, the formula that simplifies our case is
Here's A is our reality, T is the expected result
X2 is used to measure the degree of difference between the actual value and the theoretical value (the core idea of Chi-square testing), which contains the following two information:
The absolute size of the deviation between the actual value and the theoretical value (due to the existence of the square, the difference is magnified)
The relative size of difference degree and theoretical value
The X2 value of 10.01 is calculated for the above scenario.
The next is how to use the result, that is, how do we measure whether the 10.01 is big or small. A concept of "freedom" needs to be introduced here.
Degrees of freedom equals V = (行数 - 1) * (列数 - 1)
, to four-grid tables, degrees of freedom V = 1
.
Yes V = 1
, the critical probability of the chi-square distribution is:
Based on the degree of freedom, we check the distribution threshold table and discover 10.01>7.88, then the probability of the assumption we made before is <0.005, that is, 0.5%. Obviously, the possibilities are very high.
Application
Through the previous case, we obtained the verification value of chi-square test, then the size of this value can be used to illustrate the relevance of the two propositions. In the case of the previous text classification, we can calculate the correlation between the occurrence of all the words in a dictionary and whether the article is scientific or not, and in reverse order, we can know what kind of words are most likely to appear in a certain type of article.
Postscript
The chi-square test in the sophomore probability theory and mathematical statistics seems to have learned, at that time to learn vaguely, now re-pick up the feeling is really difficult. On the other hand, 3 years of undergraduate study is really just "remember", but it does not "use."
Chi-Square Test principle