Final Version text classificationCode, Corpus, and intermediate files have been shared open source:Http://www.cnblogs.com/finallyliuyu/archive/2012/01/15/2322721.html. Due to data andProgramIf the scale is relatively large, it will not be uploaded in the blog Park. You can register and download it by yourself.
(Note: Please indicate the author and Source: finallyliuyu Source: blog Park)
Applicable to: Text Classification for beginners, beginners, cainiao, and amateurs
Objective: 1. to classify text in books, such as classifier and Feature Word SelectionAlgorithmAnd so on. It is implemented using a program to give the author a perceptual and specific understanding of text classification. After all, the mathematical formula is quite abstract;
2. "It's better to believe everything in books than to have no books." You'll never feel too light on paper. You'll never know how to do this ", this platform can be used to verify the conclusions of classifier and Feature Word Selection Algorithms in books;
3. Write it to myself and experience the magic of mathematics ".
1. Obtain a corpus
Method 1: sogou 2008 corpus. For the processing procedures, see cainiao learning C ++ programming sogou 2008 corpus-acquiring the classification corpus.
Method 2: For the corpus provided by finallyliuyu in the blog park space, refer to the second corpus of Chinese news categories for amateurs who are keen on natural language processing.
2. Text Classification System Design Framework
Preprocessing Process Flowchart
Classification module flowchart:
3. Code explanation
Preprocessing module
3.1 create a dictionary
3.2 global DF Feature Word Selection Algorithm
3.3 Local DF Feature Word selection algorithm or by category DF Feature Word Selection Algorithm
3.4 chi-square Feature Word Selection Algorithm
3.5 information gain method and point Mutual Information Method
3.6 VSM model creationCubeMethod
Classification Module
3.7 KNN Classification Algorithm
3.8 accuracy, recall rate, F value calculation
4. program call instructions:
4.1 text classification Step by Step 1
4.2 text classification Step by Step 2
5. Some insights on the Feature Word Selection Algorithm for classification problems(Key recommendations)
6. Download resources (right-click to download resources with the thunder software ):
Space Provider: blog Park. Thanks again to the blog site team and Dudu
Note: The corpus used in the experiment is in the mssql2000 backup format. If you want to restore the corpus, please read the relevant information on your own. You will not go into details about this knowledge on the Internet.
Program resources