The topic for text classification (ultimate) is definitely the most comprehensive C ++ open-source text classification code and the most refreshing experimental explanation.

Source: Internet
Author: User

Final Version text classificationCode, Corpus, and intermediate files have been shared open source:Http://www.cnblogs.com/finallyliuyu/archive/2012/01/15/2322721.html. Due to data andProgramIf the scale is relatively large, it will not be uploaded in the blog Park. You can register and download it by yourself.

(Note: Please indicate the author and Source: finallyliuyu Source: blog Park)

Applicable to: Text Classification for beginners, beginners, cainiao, and amateurs

Objective: 1. to classify text in books, such as classifier and Feature Word SelectionAlgorithmAnd so on. It is implemented using a program to give the author a perceptual and specific understanding of text classification. After all, the mathematical formula is quite abstract;

2. "It's better to believe everything in books than to have no books." You'll never feel too light on paper. You'll never know how to do this ", this platform can be used to verify the conclusions of classifier and Feature Word Selection Algorithms in books;

3. Write it to myself and experience the magic of mathematics ".

1. Obtain a corpus

Method 1: sogou 2008 corpus. For the processing procedures, see cainiao learning C ++ programming sogou 2008 corpus-acquiring the classification corpus.

Method 2: For the corpus provided by finallyliuyu in the blog park space, refer to the second corpus of Chinese news categories for amateurs who are keen on natural language processing.

2. Text Classification System Design Framework

 

Preprocessing Process Flowchart

Classification module flowchart:

 

 

3. Code explanation

Preprocessing module

3.1 create a dictionary

3.2 global DF Feature Word Selection Algorithm

3.3 Local DF Feature Word selection algorithm or by category DF Feature Word Selection Algorithm

3.4 chi-square Feature Word Selection Algorithm

3.5 information gain method and point Mutual Information Method

3.6 VSM model creationCubeMethod

Classification Module

3.7 KNN Classification Algorithm

3.8 accuracy, recall rate, F value calculation

 

4. program call instructions:

4.1 text classification Step by Step 1

4.2 text classification Step by Step 2

 

 

5. Some insights on the Feature Word Selection Algorithm for classification problems(Key recommendations)

6. Download resources (right-click to download resources with the thunder software ):

Space Provider: blog Park. Thanks again to the blog site team and Dudu

Note: The corpus used in the experiment is in the mssql2000 backup format. If you want to restore the corpus, please read the relevant information on your own. You will not go into details about this knowledge on the Internet.

Program resources

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.