This paper uses Sogou Chinese corpus mini version of the text data, a total of nine categories (finance, IT, health, sports, tourism, education, recruitment, culture, military), each category A total of 1990 text, and before the experiment through. The PY program captures the first 500 text data as a training set.
Data preprocessing includes text segmentation, word-stopping, frequency statistics, feature selection, using vector space model to represent documents and so on. The next few posts will be followed by these advances 棸 to preprocess the text.
Text segmentation mainly through Python call the Chinese lexical analysis system of CAS Nlpir/ictclas word function, because the use of this article in the Sogou Chinese text corpus in each category is involved in a number of text, so in the word, you need to traverse the text of the folder, the text for batch word processing, and save to Local. You can add a user dictionary as needed in the process so that the words you want to keep are not split. The word segmentation result contains all the characters in the text, including punctuation marks, and so on.
Bloggers are using a 32-bit Windows system, and the following is the code for text segmentation:
#!/usr/bin/env python #-*-coding:utf-8-*-__author__ = ' peter_howe