Getting Started with text classification--the root test of feature selection algorithm

Source: Internet
Author: User

Http://www.blogjava.net/zhenandaci/archive/2008/08/31/225966.html

As mentioned above, in addition to the classification algorithm, the feature extraction algorithm for the classification text processing has a great impact on the final effect, and feature extraction algorithm is divided into feature selection and feature extraction two categories, wherein the feature selection algorithm has mutual information, document frequency, information gain, root test and so on ten kinds, In this paper, we first introduce the algorithm of feature selection, which has good effect in the method of root examination.

It should be remembered that the root test is a commonly used method to test the independence of two variables in mathematical statistics. What You are a student of history and literature, have not studied mathematical statistics? So what do you do with text categorization? What's the mess here? )

The most basic idea of the root test is to determine whether the theory is correct by observing the deviation between the actual value and the theoretical value. When doing it, it is often assumed that the two variables are indeed independent (jargon is called "the original Hypothesis"), and then the actual value (which can also be called the observation value) and the theoretical value (the theoretical value is the value that should be in case of "if the two are really independent"), if the deviation is small enough, We think that the error is a natural sample error, the measurement means is not accurate or accidental occurrence, the two are really independent, at this time accept the original hypothesis, if the deviation is too large to a certain extent, so that the error is unlikely to occur accidentally or inaccurate measurement, we think the two are actually related, That is to negate the original hypothesis and accept the alternative hypothesis.

So what is the measure of the degree of deviation? Assuming that the theoretical value is E (which is also the symbol of mathematical expectation), the actual value is X, if only the sum of the difference between the observed value of all samples and the theoretical value is used X-E

To measure, a single observation is good to say, when there are multiple observations x1,x2,x3, it is likely that the values of x1-e,x2-e,x3-e are positive and negative, thus offsetting each other, making the final result look like a deviation of 0, but in fact each has a deviation, and is not small! The immediate idea is to use variance instead of mean, which solves the problem of positive and negative offsets, i.e. using

At this time also attracted a new problem, for the average value of 500, the difference of 5 is actually very small (1%), and 20 mean, 5 is equivalent to 25% of the difference, which is not reflected by the use of variance. Therefore, we should consider improving the above formula so that the mean size does not affect our judgment on the degree of difference.

Formula (1)

The above formula is pretty good. In fact, this formula is the difference measure used in the root test. When the observation value of several samples is provided X1,X2,......XI,......XN, the root value can be obtained by substituting the formula (1), and the value is compared with the pre-set threshold value, if it is greater than the threshold value (that is, the deviation is very large), the original hypothesis is not established, whereas the original hypothesis is considered to be established.

In the feature selection phase of the text categorization problem, we mainly care about whether a word t (a random variable) is independent of one class C (another random variable)? If you are independent, you can tell that T has no representation on category C at all, that is, we simply cannot judge whether a document belongs to the C category based on the appearance of T or not. But unlike the most common root tests, we don't need to set thresholds, because it's hard to say how much the word T and Class C relate to a characterization, and we just want to use this method to select some of the most relevant ones.

At this point we still need to understand what the original hypothesis is for feature selection, because the larger the calculated root value, the greater the deviation from the original hypothesis, the more we tend to think that the opposite of the original hypothesis is correct. Can we set the original hypothesis as "the word T is related to category C"? In principle, of course, this is also a sound democratic society gives every citizen the right (laughter), but at this point you will find it is not known at this time how much the theoretical value! You're going to put yourself in a dead end. Therefore, we generally use "Word t and Category C irrelevant" to make the original hypothesis. The process of selection has also become a formula for each word to calculate its root value with category C, from large to small order (when the root of the larger the more relevant), the first k can be (k value can be selected according to their own needs, which is a sound democratic society to the rights of each citizen).

Well, the principle has, should be an example to say exactly how to forget.

For example, there are now N documents, of which M is about sports, we want to examine the relevance of the word "basketball" and the category "Sports" (anyone can see that the two are very relevant, but unfortunately, we are intelligent creatures, the computer is not, it can not be seen, want to let it realize this, only let it count). We have four observations to use:

1. Number of documents that contain "basketball" and belong to the "sports" category, named a

2. Number of documents containing "basketball" but not category "sports", named B

3. Number of documents that do not contain "basketball" but belong to the "sports" category, named C

4. Number of documents that do not contain "basketball" or "sports" category, named D

Use the following table to make it clearer:

Feature Selection

1. belong to "sports"

2. Does not belong to "sports"

Total

1. Include "Basketball"

A

B

A+b

2. Does not include "basketball"

C

D

C+d

Total

A+c

B+d

N

If there are some features you don't see, then I say, first of all, a+b+c+d=n (this is not nonsense). Secondly, the meaning of A+c is to say "the number of articles belonging to sports", therefore, it is equal to M, at the same time, B+d equals n-m.

Okay, so what's the theoretical value? Take the example of the number of documents that contain "basketball" and belong to the "sports" category. If the original hypothesis is established, that is, "basketball" and sports articles are not related, then in all the articles, "basketball" the word should be equal probability appears, regardless of whether the article is sports class. This probability is exactly how much, we don't know, but he should be reflected in the observation results (like the probability of tossing a coin is one-second, can be determined by observing the results of multiple throws), so we can say that the probability of approaching

(Because A+b is the number of articles that contain "basketball", divided by the total number of documents is the probability that "basketball" appears, of course, it appears in an article, regardless of a few times) and the number of articles belonging to the sports category is A+c, in these documents, there should be

This article contains the word "basketball" (number multiplied by probability).

But how much is it actually? Test You (reader: Cut, of course, a, the form is written ...) )。

At this point the difference in this case is derived (formula (1), which should be

In the same way, we can calculate the difference between the remaining three cases of d12,d21,d22, the smart reader will be able to calculate their own (reader: cut, obviously is not lazy to write ... )。 With the difference of all observations, you can calculate the root value of "basketball" and "sports" articles.

By substituting and simplifying the values of d11,d12,d21,d22, we can get

The general form of the root value of the word T and category C can be written

Formula (2)

Next we can calculate other words such as "volleyball", "Product", "bank" and so on with the type of the sports category of the root value, and then according to the size of the order, choose the largest number of words we need as a feature.

The formula (2) can also be further simplified, noting that if a collection of documents (such as our training set) and a category are given, then n,m,n-m (i.e., a+c and b+d) are the same for all the words in the same category of documents, and we only care about the order of the size of a bunch of words in a category. Instead of paying attention to specific values, it is perfectly possible to remove them from the formula (2), so we use them when we actually calculate them.

Formula (3)

Well, not deep, right?

The experimental results of plain text in English show that: as a feature selection method, the root test and the information gain are the best (the same classification algorithm, using different feature selection algorithm to get the comparison results); The performance of the document frequency method is quite similar to the previous two, the term strength method has the same performance, and the mutual information method has the worst performance. [17]).

But the root test is not perfect. Looking back at how the values of A and B came out, it counted whether the word T appeared in the document, but no matter how many times T appeared in the document, it made him partial to the low-frequency word (because it exaggerated the effect of low-frequency words). There may even be situations where a word appears only once in each document of a class of articles, and its root value is larger than the 10 occurrences of the word in the 99% document, but the latter is more representative, but only because the number of documents it appears is less than the previous word "1", When the feature is selected, it is possible to sift out the latter words and retain the former. This is the famous "low-frequency word defect" of the prescription test. Therefore, the root test also often with other factors such as word frequency comprehensive consideration to avoid weaknesses.

Okay, so much for the root test, there are opportunities to introduce other feature selection algorithms.

Attached: To the students who are proficient in statistics to say a few words, formula (1) is actually a continuous type of random variables of the formula for the difference, and we here the "number of documents" is obviously a discrete number (all integers), so the real statistics in the calculation of the time, there is a correction process, but this amendment still affects only the specific Without affecting the order of size, so this correction is not made in the text classification.

Getting Started with text classification--the root test of feature selection algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.