Python is widely used in scientific computing: computer vision, artificial intelligence, mathematics, astronomy, and so on. It also applies to machine learning and is expected.
This article lists and describes the most useful machine learning tools and libraries for Python. In this list, we do not require these libraries to be written in Python, as long as there is a Python interface is enough.
Our goal is not to list all machine learning libraries in Python (the Python Package index (PyPI) returns 139 results when searching for machine learning), but rather to list the useful and well-maintained ones we know.
In addition, although some modules can be used for a variety of machine learning tasks, we only list the main focus in the machine learning Library. For example, although Scipy1 contains some clustering algorithms, its main focus is not machine learning but a comprehensive set of scientific computing tools. So we ruled out the scipy (though we also use it!).
Another thing to mention is that we also evaluate these libraries based on the integration with other scientific computing libraries, because machine learning (supervised or unsupervised) is also part of the data processing system. If you are using a library that does not match the other libraries in your data processing system, you will spend a lot of time creating the middle tier between different libraries. It's important to have a great library in the toolset, but it's just as important to have this library well integrated with other libraries.
If you are good at other languages, but also want to use Python packages, we also simply describe how to integrate with Python to use the libraries listed in this article.
Scikit-learn
Scikit LEARN7 is the machine learning tool that we use in CB insights. We use it for classification, feature selection, feature extraction, and aggregation.
One of our favorites is that it has an easy-to-use consistency API and offers a lot of * * out-of-the-box evaluation, diagnostics, and cross-validation methods (does it sound familiar?). Python also provides a way to "battery is ready to go out of the box"). The icing on the cake is that it uses SCIPY data structures at the bottom and is well adapted to the rest of Python's scientific calculations using SCIPY, Numpy, Pandas, and matplotlib.
So, if you want to visualize the performance of the classifier (for example, using a precision rate and feedback rate (Precision-recall) chart, or a receiver Operating characteristics,roc curve), Matplotlib can help with rapid visualization.
Taking into account the time spent cleaning and structuring data, it is very convenient to use this library because it can be tightly integrated into other scientific computing packages.
In addition, it contains limited natural language processing feature extraction capabilities, as well as word bag (bag of words), TFIDF (term Frequency inverse Document Frequency algorithm), preprocessing (discontinued word/stop-words, Custom preprocessing, parser).
In addition, if you want to quickly perform different benchmarks for a small dataset (toy dataset), its own dataset module provides a common and useful set of data. You can also create your own small datasets based on these datasets so that you can test your model to meet your expectations before applying it to the real world. For parameter optimization and parameter adjustment, it also provides grid search and random search.
None of these features can be achieved without strong community support or poor maintenance. We look forward to its first stable release.
Statsmodels
Statsmodels is another powerful library focused on statistical models, primarily for predictive and exploratory analysis. Statsmodels is ideal if you want to fit a linear model, perform statistical analysis, or predictive modeling. It provides a fairly comprehensive statistical test that covers most of the validation tasks.
If you are a user of R or S, it also provides the R syntax for some statistical models. Its model also accepts numpy arrays and pandas data frames, allowing the middle structure to become the past!
Pymc
PYMC is the tool that makes * * Bayesian curve * *. It contains Bayesian models, statistical distributions, and model convergence diagnostic tools, as well as a number of hierarchical models. If you want to perform Bayesian analysis, you should look at it.
Shogun
SHOGUN1 is a machine learning toolkit that focuses on support vector machines (supported vectors machines, SVM), written in C + +. It is in active development and maintenance, provides the Python interface, and is the best documentation interface. However, compared to Scikit-learn, we found that its API is more difficult to use. Also, there are not many diagnostic and evaluation algorithms available for unpacking. However, speed is a big advantage.
Gensim
Gensim is defined as the "People's Theme Modeling tool (topic Modeling for humans)". Its home page describes its focus on Dirichlet division (latent Dirichlet Allocation, LDA) and variants. Unlike other packages, it supports natural language processing, which makes it easier to group NLP and other machine learning algorithms together.
If your domain is in NLP and want to do aggregation and basic classification, you can look at it. At present, they introduce Google's recursive neural networks (recurrent neural network)-based text representation Word2vec. This library is written only in Python.
Orange
Orange is the only one in all libraries listed in this article with a graphical user interface (graphical user Interface,gui). It is quite comprehensive and has some cross-validation methods for classification, aggregation, and feature selection methods. In some ways it is better than Scikit-learn (classification method, some preprocessing ability), but compared with other scientific computing systems (Numpy, Scipy, Matplotlib, Pandas) is less suitable than scikit-learn.
However, the inclusion of a GUI is an important advantage. You can visualize the results of cross-validation, model and feature selection methods (some features require installation of Graphviz). For most algorithms, Orange has its own data structure, so you need to wrap the data into orange-compatible data structures, which makes the learning curve steeper.
Pymvpa
PYMVPA is another statistical learning library, and the API is much like Scikit-learn. Includes cross-validation and diagnostic tools, but not scikit-learn comprehensive.
Deep learning
While deep learning is a sub-section of machine learning, the reason we create a separate section here is that it has recently attracted a lot of attention from Google and the Facebook talent recruiting department.
Theano
Theano is the most mature deep learning library. It provides a good data structure (tensor, tensor) to represent the layers of a neural network, which is efficient for linear algebra and similar to numpy arrays. It is important to note that its API may not be very intuitive, and the user's learning curve will be high. There are many Theano-based libraries that are using their data structures. It also supports GPU programming available out of the box.
Pylearn
There is another Theano-based library, PYLEARN2, which introduces modularity and configuration to Theano, allowing you to create neural networks with different configuration files, which makes it easier to experiment with different parameters. It can be said that if the parameters and attributes of the neural network are separated into the configuration file, it is more powerful in modularity.
Decaf
Decaf is a recent deep learning library published by UC Berkeley, tested in the Imagenet Classification challenge, and its neural network implementation is very advanced (state of art).
Nolearn
If you want to use the excellent Scikit-learn Library API in deep learning, encapsulating the decaf Nolearn will make it easier for you to use it. It is the packaging for decaf, compatible with Scikit-learn (mostly), making decaf even more magical.
Overfeat
Overfeat is the winner of the recent Cat vs. dog (Kaggle Challenge) 4, which is written in C + + and contains a Python wrapper (as well as MATLAB and LUA). The GPU is used through the torch library, so it's fast. Also won the imagenet classification of detection and localization challenges. If your field is computer vision, you may need to look at it.
Hebel
Hebel is another neural network library with GPU support available out of the box. You can use Yaml files (similar to Pylearn2) to determine the properties of a neural network, providing a way to separate the Divine Network from code-friendly and run the model quickly. Due to development, the document is scarce in depth and breadth. In terms of neural network models, there is also a limitation, because only one neural network model (forward feedback, Feed-forward) is supported.
However, it is written in pure Python and will be a very friendly library because it contains many useful functions, such as schedulers and monitors, which we do not find in other libraries.
Neurolab
The Neurolab is another API-friendly (similar to MATLABAPI) neural network library. Unlike other libraries, it contains different variants of the recursive neural network (recurrent neural network,rnn) implementations. If you want to use RNN, this library is one of the best choices for the same API.
Integration with other languages
You don't know python but are good at other languages? Don't despair! One of the strengths of Python (and others) is that it is a perfect glue language, and you can access these libraries through Python using your favorite programming language. The following packages for various programming languages can be used to combine other languages with Python:
R-Rpython
Matlab-Matpython
Jython, Java
Lunatic Python, Lua
Julia-PYCALL.JL
Inactive libraries
These libraries have not released any updates for more than a year and we list them because you might be useful, but these libraries are unlikely to be bug fixes, especially for future enhancements.
Mdp2mlpy
Ffnet
Pybrain