Popular Science series of Feature Word Selection Algorithms in text classification (preface and 1)

Source: Internet
Author: User

(Please indicate the source for reprinting, Author: finallyliuyu)

Preface:

It has been learned that many colleagues in the garden who have already worked but are interested in information retrieval and natural language processing, as well as practitioners in many related fields. I am currently engaged in text Feature Selection Research. Therefore, I plan to write a series of generic blogs on this topic to share my insights with you. You also wantAlgorithmCommunication with industry insiders in terms of understanding.

The plan of this series is to introduce various Feature Word selection methods. Refer to the paper "A comparative study on Feature Selection in textcategorization" from yiming Yang in 1997 ".

More specifically, the Chinese corpus is used.I would like to express my gratitude to sogou lab for its selfless dedication.) Verify the insights in yiming Yang.

Lu You has a poem: "The end of the paper is shallow, and you must be aware of this ". This is why we have this series of blogs. This series of blogs will not only introduce the efficiency of various Feature Word selection algorithms, but also provide the corpus (Libsvm data format note: the source of the corpus is provided by finallyliuyu.) For researchers and learners to download. This blog has two purposes:I. beginners like meLearners do not have to look for clues about the efficiency of Feature Word selection methods from textbooks and papers without color discrimination. In this series of blog posts, you will see color pictures and get a first-hand intuitive understanding of such problems. More importantly, you can download the corpus, call libsvm classification, and call MATLAB to draw images to experience the charm of various Feature Word selection algorithms.". II. We also have open-source libsvm format data. Later libsvm beginners will not be limited to the classification materials provided on their websites. 3. All corpus files will be downloaded on the csdn download channel without any points due to limited uploading capability of the blog Park.

In view of my background in science and engineering and poor description, I hope you can correct me. At the same time, I also hope and welcome everyone to raise objections, criticize, and correct my Blog content.

(1) is feature word selection useful?

Some people (including myself) have doubted whether the so-called Feature Word selection algorithms can maintain or improve the classification accuracy while reducing the feature dimension? Is there a higher number of feature words, a higher classification accuracy?

See the following charts.

N: indicates the size of the document set, and M indicates the feature dimension.

Without using any feature word selection algorithm, take m words from the word bag in sequence as the 5-fold cross-validation accuracy curve of feature words, such as (1, 2)

 

 

 

Figure 3 using the IG method to select M-dimension characters

From figure 1 and figure 2, we can see that:

From the above two images, we can see that in order to selectM(That is, select a word from the bag of words in the first category. IfM>When the total number of words in the bag of the first type is selected from the bag of the second type,I) The minimum classification accuracy is50%Above. This is not hard to understand, because the worst case is the selectedMAll feature words are in the bag of words formed by the first training document set, so these words can guarantee a good prediction for the first type of test documents;II) As the number of feature words increases, the overall classification accuracy is on the rise.I) ", But the overall accuracy is not high, in the feature dimension is3000The highest accuracy is(91+-1) %.

As shown in figure 3:

The Feature Word selection algorithm is effective. Through the feature word selection algorithm, the selected feature words can improve the classification accuracy. As shown in figure 2, when the document set scale is 200, the classification accuracy decreases steadily with the increase of Feature Word dimension.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.