Project homepage:
Http://code.google.com/p/python-data-mining-platform/ (may need to flip)
Tutorial and other content have been added to the googlecode. You can view it in the Wiki.
Project Introduction (copied from the project homepage ):
This is a matrix that can be represented in CSV format or a Chinese document based on the source data.AlgorithmA platform to get results.
Algorithms can run one by one through the xml configuration file. For example, at the beginning, we can run the principal component analysis algorithm for special selection, and then run the random forest Algorithm for classification.
Currently, algorithms are mainly used for tasks that can be completed on a single machine. This architecture is well scalable and allows you to complete the algorithms you want in a short time and use them in Engineering (believe me, it must be faster and better than WEKA ).Another feature of this project is its ability to support the classification and clustering of Chinese texts..
Just write down the followingProgramYou can get amazing results (select the features of the text, get the naive Bayes classification model, and make predictions ):
1: # Load config
2:Config = configuration. fromfile ("CONF/test. xml")
3:Pymining. INIT (config,"_ Global __")
4:
5: # Get matrix from Source Text
6:Matcreater = classifiermatrix (config,"_ Matrix __")
7:[Trainx, trainy] = matcreater. createtrainmatrix ("Data/train.txt")
8:
9: # Get Chi square filter
10:Chifilter = chisquarefilter (config,"_ Filter __")
11:Chifilter. trainfilter (trainx, trainy)
12:
13: # Runs naive-Bayes model to get Model
14:Nbmodel = twcnaivebayes (config,"Twc_naive_bayes")
15:Nbmodel. Train (trainx, trainy)
16:
17: # Using the model to predict an unseen Doc to target class
18:[Testx, testy] = matcreater. createpredictmatrix ("Data/test.txt")
19:[Testx, testy] = chifilter. matrixfilter (testx, testy)
20:Rety = nbmodel. testmatrix (testx, testy)
Current version:
Ver 0.1 (Second Development Edition)
Features:
Feature of the previous version:
- Supports Chinese text input, word segmentation, and other operations, as the source data of classification
- Feature selector with Chi square test)
- Parameter Adjustment (parameter tuning) supports the xml configuration file
Add feature:
- Added the K-means algorithm for text clustering.
- Added a supplement-based Naive Bayes algorithm to greatly improve the classification accuracy. Currently, this algorithm is used in sogou lab text classification data, the prediction accuracy of data of about 20000 articles and 8 categories is about 90%.
- Added the sogou lab text classification data import tool for more experiments.
Obtain pymining:
In http://code.google.com/p/python-data-mining-platform/downloads/detail? Name1_pymining_0_1.zip & can = 2 & Q = # makechanges. You can obtain the latest version of Ver 0.1 (you may need to flip the wall)
No wall flip version: http://files.cnblogs.com/LeftNotEasy/pymining_0_1.zip