Pymining-open-source Chinese text data mining platform ver 0.1 released

Source: Internet
Author: User

Project homepage:

Http:// (may need to flip)

Tutorial and other content have been added to the googlecode. You can view it in the Wiki.


Project Introduction (copied from the project homepage ):

This is a matrix that can be represented in CSV format or a Chinese document based on the source data.AlgorithmA platform to get results.

Algorithms can run one by one through the xml configuration file. For example, at the beginning, we can run the principal component analysis algorithm for special selection, and then run the random forest Algorithm for classification.

Currently, algorithms are mainly used for tasks that can be completed on a single machine. This architecture is well scalable and allows you to complete the algorithms you want in a short time and use them in Engineering (believe me, it must be faster and better than WEKA ).Another feature of this project is its ability to support the classification and clustering of Chinese texts..

Just write down the followingProgramYou can get amazing results (select the features of the text, get the naive Bayes classification model, and make predictions ):

1: # Load config

2:Config = configuration. fromfile ("CONF/test. xml")

3:Pymining. INIT (config,"_ Global __")


5: # Get matrix from Source Text

6:Matcreater = classifiermatrix (config,"_ Matrix __")

7:[Trainx, trainy] = matcreater. createtrainmatrix ("Data/train.txt")


9: # Get Chi square filter

10:Chifilter = chisquarefilter (config,"_ Filter __")

11:Chifilter. trainfilter (trainx, trainy)


13: # Runs naive-Bayes model to get Model

14:Nbmodel = twcnaivebayes (config,"Twc_naive_bayes")

15:Nbmodel. Train (trainx, trainy)


17: # Using the model to predict an unseen Doc to target class

18:[Testx, testy] = matcreater. createpredictmatrix ("Data/test.txt")

19:[Testx, testy] = chifilter. matrixfilter (testx, testy)

20:Rety = nbmodel. testmatrix (testx, testy)


Current version:

Ver 0.1 (Second Development Edition)



Feature of the previous version:

    • Supports Chinese text input, word segmentation, and other operations, as the source data of classification
    • Feature selector with Chi square test)
    • Parameter Adjustment (parameter tuning) supports the xml configuration file


Add feature:

    • Added the K-means algorithm for text clustering.
    • Added a supplement-based Naive Bayes algorithm to greatly improve the classification accuracy. Currently, this algorithm is used in sogou lab text classification data, the prediction accuracy of data of about 20000 articles and 8 categories is about 90%.
    • Added the sogou lab text classification data import tool for more experiments.


Obtain pymining:

In & can = 2 & Q = # makechanges. You can obtain the latest version of Ver 0.1 (you may need to flip the wall)

No wall flip version:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.