As mentioned above, in addition to classification algorithms, the feature extraction algorithms for processing classification texts also have a huge impact on the final effect. Feature Extraction algorithms are classified into feature selection and feature extraction, among them, the feature selection algorithm has more than 10 types of information, such as mutual information, document frequency, information gain, and square test. This time, we will first introduce the method for testing the Feature Selection Algorithm with better results.
We should remember that the square test is a common method used in mathematical statistics to test the independence of two variables. (What? You are a student of literature and history science and have never learned mathematical statistics? So what text classification do you do? What's wrong here ?)
The basic idea of the square test is to determine whether the theory is correct by observing the deviation between the actual value and the theoretical value. When doing this, we often first assume that two variables are indeed independent (the line is called "original hypothesis"), and then observe the actual value (also called the observed value) the degree of deviation from the theoretical value (this theoretical value refers to the value that should exist if the two are indeed independent). If the deviation is small enough, we consider the error as a natural sample error, it is caused by inaccurate measurement methods or accidental occurrence. The two are indeed independent and accept the original assumptions. If the deviation is large to a certain extent, this error is unlikely to be caused by accidental occurrence or inaccurate measurement. We believe that the two are actually related, that is, to deny the original hypothesis and accept the alternative hypothesis.
So what is used to measure the degree of deviation? Assume that the theoretical value is E (this is also the expected mathematical symbol), and the actual value is X. If we only use the sum of the observed values of all samples and the theoretical value x-E
To measure, a single observed value is good to say, when there are multiple observed values x1, x2, X3, it is very likely that the x1-E, x2-E, x3-E value has positive and negative, therefore, they offset each other, making the final result look like the deviation is 0, but in fact each has a deviation, and it is not small! At this time, the direct idea is to use variance instead of the mean, which solves the problem of positive and negative offset, that is, using
At this time, a new problem emerged. For the mean of 500, the difference of 5 is actually very small (a 1% difference), and for the mean of 20, 5 is equivalent to the difference of 25%, this cannot be reflected by the variance. Therefore, we should consider improving the formula above so that the mean value does not affect our judgment on the degree of difference.
Formula (1)
The above formula is already quite good. In fact, this formula is the difference value measurement formula used by the square test. When the observed values of several samples are provided: x1, x2 ,...... XI ,...... After XN, the open value can be obtained after being substituted into formula (1). Use this value to compare it with the preset threshold. If it is greater than the threshold (that is, the deviation is very large ), the original hypothesis is not true, but the original hypothesis is true.
In the feature selection stage of the text classification problem, we mainly care about whether a word T (a random variable) and a class C (another random variable) are mutually independent? If the word T is independent, it can be said that T has no effect on category C, that is, we cannot determine whether a document belongs to Category C based on the appearance of T. However, unlike the most common square test, we do not need to set a threshold, because it is hard to say that the degree to which word T and class C are associated is a characterization function, we just want to use this method to select some of the most relevant ones.
At this point, we still need to understand what the original hypothesis is for feature selection. Because the larger the calculated square value, the greater the deviation from the original hypothesis, the more we tend to think that the opposite of the original hypothesis is correct. Can we set the original hypothesis as "T-related to Category C"? In principle, of course, this is also the right (Laugh) granted to every citizen by a sound democratic society, but at this time you will find that you do not know the theoretical value at this time! You will turn yourself into a dead end. Therefore, we generally use the word "t is irrelevant to Category C" to make the original assumptions. The process of selection also changes to calculating the square value between each word and class C in ascending order (the greater the square value, the more relevant ), you can select the first K (K value can be selected based on your own needs, which is also the right granted to every citizen by a sound democratic society ).
Well, the principle is correct. Let's give an example to illustrate how to calculate it.
For example, there are now n documents, m of which are about sports, we want to examine the correlation between the word "basketball" and the category "Sports" (anyone can tell that the two are very relevant, but unfortunately, we are smart creatures, computers are not, it cannot be seen at all. To make it aware of this, it can only be regarded ). We have four observations available:
1. The number of documents that contain "basketball" and belong to the "Sports" category, named
2. The number of documents that contain "basketball" but are not in the "Sports" category. The name is B.
3. The number of documents that do not contain "basketball" but belong to the "Sports" category. The name is C.
4. The number of documents that neither contain "Basketball" nor fall into the "Sports" category, named D
Use the following table to make it clearer:
Feature Selection |
1. belongs to "Sports" |
2. Not "Sports" |
Total |
1. including "Basketball" |
A |
B |
A + B |
2. "basketball" is not included" |
C |
D |
C + D |
Total |
A + C |
B + d |
N |
If you don't see some features, let's talk about it. First, A + B + C + D = N (this is not a nonsense ). Secondly, the meaning of A + C is actually "the number of articles in Sports". Therefore, it is equal to m, and B + D is equal to n-M.
Well, what is the theoretical value? Take the number of documents that contain "basketball" and belong to the "Sports" category as an example. If the original assumption is true, that is, there is no correlation between "basketball" and sports articles, then in all the articles, the word "basketball" should be equal probability, regardless of whether the article is sports or not. We don't know what the probability is, but it should be reflected in the observed results (just like the probability of throwing a coin is 1/2, can be roughly determined by observing the results of multiple throws), so we can say that this probability is close
(Because A + B is the number of articles that contain "basketball", dividing by the total number of documents is the probability that "basketball" appears. Of course, here we think it is enough to appear in an article, no matter how many times) the number of articles in the sports category is A + C. in these documents, there should be
This article contains the word "basketball" (number multiplied by probability ).
But what is actually there? Test you (Reader: cut, of course it's a, it's written in the form ......).
In this case, the difference value is obtained (the formula of formula (1), which should be
Similarly, we can calculate the differences between D12, D21, and D22 in the remaining three cases. Smart readers can calculate the differences by themselves (readers: Cut, obviously they are too lazy to write ......). With the difference between all the observed values, we can calculate the open Value of "basketball" and "Sports" articles.
The values of D11, D12, D21, and D22 are substituted and simplified to obtain
Word T and class C can be written in a more general form
Formula (2)
Next, we can calculate other words, such as "Volleyball", "product", "bank", and other open-party sports values, and sort them by size, select the largest number of words we need as feature items.
Actually formula (2) can be further simplified. Note that if a document set (such as our training set) and a category are given, then n, m, n-M (that is, A + C and B + D) is the same for all words in the same category document, and we only care about the size order of a bunch of words on the open value of a category, we do not care about the specific values. Therefore, we can remove them from formula (2). Therefore, we use them in actual calculation.
Formula (3)
Okay, it's not advanced, right?
The experimental results of English Text show that, when using the feature selection method, the square test and information gain are the best (same classification algorithm, different feature selection algorithms are used to obtain the comparison results ); the performance of the Document Frequency Method is roughly the same as that of the first two methods. The performance of the term strength method is average. The performance of the mutual information method is the worst (document [17]).
However, the test is not perfect. Let's look back at how the values of A and B come out. It counts whether word t exists in the document, no matter how many times T appears in this document, this will make him biased towards low-frequency words (because it exaggerated the role of low-frequency words ). In some cases, a word appears only once in each document of a class of articles, but its open value is greater than 10 times in the document of the class of articles 99%, in fact, the following words are more representative, but only because the number of documents is less than the previous word "1 ", during feature selection, the latter words may be screened out and the former is retained. This is the famous "Low Frequency Word defect" tested by the open party. Therefore, the test is often used in combination with other factors such as Word Frequency to foster strengths and circumvent weaknesses.
Well, there are so many open-side tests that we have the opportunity to introduce other feature selection algorithms.
Appendix: I would like to say a few more to students proficient in Statistics. formula (1) is actually a formula for calculating the difference between continuous random variables, the "document quantity" we are counting here is obviously a discrete value (All integers), so there is a correction process when calculating statistics, however, this correction still only affects the specific square value without affecting the order of the size. Therefore, this correction is not made in text classification.