Pymining-open-source Chinese text data mining platform ver 0.1 released

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Project homepage:

Http://code.google.com/p/python-data-mining-platform/ (may need to flip)

Tutorial and other content have been added to the googlecode. You can view it in the Wiki.

Project Introduction (copied from the project homepage ):

This is a matrix that can be represented in CSV format or a Chinese document based on the source data.AlgorithmA platform to get results.

Algorithms can run one by one through the xml configuration file. For example, at the beginning, we can run the principal component analysis algorithm for special selection, and then run the random forest Algorithm for classification.

Currently, algorithms are mainly used for tasks that can be completed on a single machine. This architecture is well scalable and allows you to complete the algorithms you want in a short time and use them in Engineering (believe me, it must be faster and better than WEKA ).Another feature of this project is its ability to support the classification and clustering of Chinese texts..

Just write down the followingProgramYou can get amazing results (select the features of the text, get the naive Bayes classification model, and make predictions ):

1: # Load config

2:Config = configuration. fromfile ("CONF/test. xml")

3:Pymining. INIT (config,"_ Global __")

4:

5: # Get matrix from Source Text

6:Matcreater = classifiermatrix (config,"_ Matrix __")

7:[Trainx, trainy] = matcreater. createtrainmatrix ("Data/train.txt")

8:

9: # Get Chi square filter

10:Chifilter = chisquarefilter (config,"_ Filter __")

11:Chifilter. trainfilter (trainx, trainy)

12:

 13: # Runs naive-Bayes model to get Model

14:Nbmodel = twcnaivebayes (config,"Twc_naive_bayes")

15:Nbmodel. Train (trainx, trainy)

16:

 17: # Using the model to predict an unseen Doc to target class

18:[Testx, testy] = matcreater. createpredictmatrix ("Data/test.txt")

19:[Testx, testy] = chifilter. matrixfilter (testx, testy)

20:Rety = nbmodel. testmatrix (testx, testy)

Current version:

Ver 0.1 (Second Development Edition)

Features:

Feature of the previous version:

Supports Chinese text input, word segmentation, and other operations, as the source data of classification
Feature selector with Chi square test)
Parameter Adjustment (parameter tuning) supports the xml configuration file

Add feature:

Added the K-means algorithm for text clustering.
Added a supplement-based Naive Bayes algorithm to greatly improve the classification accuracy. Currently, this algorithm is used in sogou lab text classification data, the prediction accuracy of data of about 20000 articles and 8 categories is about 90%.
Added the sogou lab text classification data import tool for more experiments.

Obtain pymining:

In http://code.google.com/p/python-data-mining-platform/downloads/detail? Name1_pymining_0_1.zip & can = 2 & Q = # makechanges. You can obtain the latest version of Ver 0.1 (you may need to flip the wall)

No wall flip version: http://files.cnblogs.com/LeftNotEasy/pymining_0_1.zip

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Pymining-open-source Chinese text data mining platform ver 0.1 released

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Pymining-open-source Chinese text data mining platform ver 0.1 released

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support