Python calls Nlpir/ictclas for text participle __python

Source: Internet
Author: User

This paper uses Sogou Chinese corpus mini version of the text data, a total of nine categories (finance, IT, health, sports, tourism, education, recruitment, culture, military), each category A total of 1990 text, and before the experiment through. The PY program captures the first 500 text data as a training set.

Data preprocessing includes text segmentation, word-stopping, frequency statistics, feature selection, using vector space model to represent documents and so on. The next few posts will be followed by these advances 棸 to preprocess the text.

Text segmentation mainly through Python call the Chinese lexical analysis system of CAS Nlpir/ictclas word function, because the use of this article in the Sogou Chinese text corpus in each category is involved in a number of text, so in the word, you need to traverse the text of the folder, the text for batch word processing, and save to Local. You can add a user dictionary as needed in the process so that the words you want to keep are not split. The word segmentation result contains all the characters in the text, including punctuation marks, and so on.

Bloggers are using a 32-bit Windows system, and the following is the code for text segmentation:

#!/usr/bin/env python #-*-coding:utf-8-*-__author__ = ' peter_howe 



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.