Python Data Mining Domain Toolkit

Source: Internet
Author: User
Tags new set svm theano nltk


In the field of scientific computing, Python has two important extension modules: NumPy and scipy. Where NumPy is a scientific computing package implemented in Python. Including:

    • A powerful n-dimensional array object;
    • A relatively mature (broadcast) function library;
    • A toolkit for consolidating C + + and Fortran code;
    • Practical linear algebra, Fourier transform, and random number generation functions.

SciPy is an open source Python algorithm library and Math Toolkit, SCIPY contains modules with optimizations, linear algebra, integrals, interpolation, special functions, fast Fourier transforms, signal processing and image processing, ordinary differential equation solving, and other commonly used calculations in science and engineering. Its functionality is similar to software matlab, Scilab, and GNU Octave.

NumPy and scipy are often used in conjunction, Python most machine learning libraries rely on these two modules, drawing and visualization depends on the Matplotlib module, matplotlib style similar to MATLAB. Python Machine learning Library is very large, and most open source, mainly:

1. Scikit-learn

Scikit-learn is a scipy and numpy based open-source machine learning module, including classification, regression, clustering algorithm, the main algorithm has SVM, logistic regression, Naive Bayes, Kmeans, Dbscan, etc., currently funded by INRI, Occasionally Google also grants a little.

Project homepage:



NLTK (Natural Language Toolkit) is a natural language processing module for Python, including a series of character processing and linguistic statistical models. NLTK is often used in academic research and teaching, in the fields of linguistics, cognitive science, artificial intelligence, information retrieval, machine learning and so on. NLTK provides more than 50 corpus and dictionary resources, including classification, word segmentation, stemming, parsing, and semantic inference in a text processing library. Can be run stably on Windows, Mac OS x and Linux platforms.

Project homepage:

3. mlpy

Mlpy is a numpy/scipy-based Python machine learning module, which is a cython extension application. The machine learning algorithms included are:

L Regression

least squares, Ridge regression, least angle regression, elastic net, kernel ridge regression, support vector machines (SV M), partial least squares (PLS)

L Classification

Linear discriminant Analysis (LDA), Basic Perceptron, Elastic Net, Logistic regression, (Kernel) Support Vector machines ( SVM), Diagonal Linear discriminant analysis (Dlda), Golub Classifier, parzen-based, (kernel) Fisher discriminant classifie R, k-nearest neighbor, iterative RELIEF, classification Tree, Maximum likelihood Classifier

L Clustering

Hierarchical clustering, memory-saving hierarchical clustering, K-means

L Dimension Reduction

(Kernel) Fisher discriminant Analysis (FDA), Spectral Regression discriminant analysis (Srda), (kernel) Principal component Analysi S (PCA)

Project homepage:


4. Shogun

Shogun is an open-source, large-scale machine learning toolkit. At present, the machine learning function of Shogun is divided into several parts: feature, feature preprocessing, nuclear function representation, nuclear function standardization, distance representation, classifier representation, clustering method, distribution, performance evaluation method, regression method, structured output learner.

The core of SHOGUN is implemented by C + + and provides Matlab, R, Octave, and Python interfaces. The main application is on the Linux platform.

Project homepage:

5. MDP

The Modular Toolkit for Data Processing (MDP) , a modular toolkit for data processing, a Python data processing framework.

From the user's point of view, MDP is a group of supervised learning and unsupervised learning algorithms and other data processing units that can be integrated into data processing sequences and more complex feedforward network structures. Calculations are performed efficiently according to speed and memory requirements. From a scientific developer's point of view, MDP is a modular framework that can be easily extended. The implementation of the new algorithm is easy and intuitive. The newly implemented unit is then automatically integrated with the rest of the library's components. MDP was written in the context of neuroscience research, but it has been designed to be useful in any situation where training data processing algorithms can be used. Its simplicity on the user side, various readily available algorithms, and reusability of the application unit make it a useful teaching tool.

Project homepage:

6. Pybrain

Pybrain (python-based reinforcement Learning, Artificial Intelligence and Neural Network) is a machine learning module for Python, Its goal is to provide a flexible, easy-to-apply, powerful machine learning algorithm for machine learning tasks. (the name is very domineering)

Pybrain, as its name includes neural networks, reinforcement learning (and their combination), unsupervised learning, evolutionary algorithms. Because many of the current problems require processing of continuous state and behavior space, function approximations (such as neural networks) must be used to cope with high-dimensional data. Pybrain the neural network as the core, all the training methods are based on the neural network as an example.

Project homepage:


BIGML makes machine learning easy for data-driven decisions and predictions, and BIGML creates elegant predictive models with easy-to-understand interactive operations. BIGML uses to bundle Python.

Project homepage:


8. Pyml

PYML is a Python machine learning toolkit that provides a flexible architecture for each classification and regression approach. It mainly provides feature selection, model selection, combinatorial classifier, classification evaluation and other functions.

Project homepage:

9. Milk

Milk is a Python machine learning toolkit that focuses on providing supervised taxonomies with several effective classification analyses: SVMs (based on LIBSVM), K-nn, stochastic forest economics and decision trees. It also allows for feature selection. These classifications can be combined in many ways to form different classification systems.

For unsupervised learning, it provides k-means and affinity propagation clustering algorithms.

Project homepage:



PYMVPA (multivariate Pattern analysis in Python) is a Python toolkit that provides statistical learning analytics for large datasets, providing a flexible and extensible framework. It provides functions such as classification, regression, feature selection, data import and export, visualization, etc.

Project homepage:


One. Pattern

Pattern is a Python web mining module, which binds Google, Twitter, Wikipedia API, provides web crawler, HTML parsing function, text analysis includes shallow rule parsing, WordNet interface, syntactic and semantic analysis, TF-IDF, LSA, etc., also provides clustering, classification, and graph network visualization capabilities.

Project homepage:



Pyrallel .

Pyrallel (Parallel Data Analytics in Python) based on the distributed computing model of machine learning and semi-interactive pilot projects, can be run on small clusters, the scope of application:

L focus on small to medium datasets that fits in memory on a small (+ + nodes) to Medium cluster (100+ nodes).

L focus on small to medium data (with data locality when possible).

L Focus on CPU bound tasks (e.g. training Random forests) while trying to limit disk/network access to a minimum.

L don't focus on Ha/fault tolerance (yet).

L do not try to invent new set of high level programming abstractions (yet): Use a low level programming model (IPYTHON.P Arallel) to finely control the cluster elements and messages transfered and help identify what is the practical underlyin g Constraints in distributed machine learning setting.

Project homepage:



Monte .

Monte machines Learning in pure python is a pure Python machine learning Library. It can quickly build neural networks, conditional random-airports, logistic regression models, use INLINE-C optimization, easy to use and expand.

Project homepage:




Orange is a component-based data mining and machine learning software suite that features a friendly, yet powerful, fast and versatile visual programming front end for browsing data analysis and visualization, binding Python for scripting development. It contains a complete set of components for data preprocessing and provides the functions of data Accounting, transition, modeling, model evaluation and exploration. It is developed by C + + and Python, and its graphics library is developed by a cross-platform QT framework.

Project homepage:

Theano .

Theano is a Python library that defines, optimizes, and simulates mathematical expression calculations for efficient resolution of multidimensional array calculations. Features of Theano:

L Tightly integrated NumPy

l Efficient data-intensive GPU computing

L Efficient Symbolic differential operation

L High speed and stable optimization

L Generate C code dynamically

• Extensive unit testing and self-validation

Since 2007, Theano has been widely used in scientific operations. Theano makes it easier to build deep learning models that can quickly implement the following models:

L Logistic Regression

L Multilayer Perceptron

L Deep convolutional Network

L Auto encoders, denoising autoencoders

L Stacked denoising Auto-encoders

L Restricted Boltzmann Machines

L Deep Belief Networks

L HMC Sampling

L contractive auto-encoders

Theano, a Greek beauty, the daughter of Croton's most powerful Milo, later became Pythagoras ' wife.

Project homepage:


Pylearn2 .

Pylearn2 built on the Theano, partly dependent on Scikit-learn, currently Pylearn2 is in development, will be able to deal with vectors, images, video and other data, to provide MLP, RBM, SDA and other deep learning model. PYLEARN2 's goals are:

  • Researchers add features as they need them. We avoid getting bogged down by too much top-down planning in advance.
  • A Machine Learning Toolbox for easy scientific experimentation.
  • All models/algorithms published by the LISA Lab should has reference implementations in PYLEARN2.
  • PYLEARN2 may wrap other libraries such as Scikits.learn if this is practical
  • PYLEARN2 differs from Scikits.learn in, PYLEARN2 aims to provide great flexibility and make it possible for a Er to does almost anything, while Scikits.learn aims to work as a "black box" that can produce good results even if the user Does not understand the implementation
  • Dataset interface for vector, images, video, ...
  • Small framework for all, needed for one normal mlp/rbm/sda/convolution experiments.
  • Easy reuse of sub-component of Pylearn2.
  • Using one sub-component of the library does not force your to Use/learn to use all of the other sub-components if you Cho OSE not to.
  • Support Cross-platform serialization of learned models.
  • Remain approachable enough to is used in the classroom (IFT6266 at the University of Montreal).

Project homepage:


There are other Python machine learning libraries, such as:


Pymining (

Ease (

Textmining (

More machine learning libraries can be found through Https://

Category: Data Mining, Programming (Python,java) good text to the top of my collection of this article

Python Data Mining Domain Toolkit

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.