Python is widely used in scientific computing: Computer vision, artificial intelligence, mathematics, astronomy, etc. It also applies to machine learning. This article lists and describes Python's wide application in Scientific Computing: Computer vision, artificial intelligence, mathematics, astronomy, etc. It also applies to machine learning.
This article lists and describes the most useful machine learning tools and libraries for Python. In this list, we do not require these libraries to be written in Python, as long as there is a Python interface.
Our goal is not to list all the machine learning libraries in Python (139 results are returned when the Python Package Index (PyPI) is searched for "Machine Learning ), instead, it lists the useful and well-maintained information we know.
In addition, although some modules can be used for multiple machine learning tasks, we only list the databases that focus primarily on machine learning. For example, although Scipy1 contains some clustering algorithms, its main focus is not Machine learning but a comprehensive set of scientific computing tools. So we excluded Scipy (although we also use it !).
Another thing that needs to be mentioned is that we will also evaluate these libraries based on the integration results with other scientific computing libraries, because Machine Learning (supervised or unsupervised) it is also part of the data processing system. If the database you use does not match other databases in the data processing system, you need to spend a lot of time creating intermediate layers between different databases. It is important to have a great library in the toolset, but this library can be well integrated with other libraries.
If you are good at other languages but want to use Python packages, we also briefly describe how to integrate with Python to use the libraries listed in this article.
Scikit-Learn
Scikit Learn7 is the machine learning tool we used in CB Insights. We use it for classification, feature selection, feature extraction, and clustering.
One of our favorites is that it has an easy-to-use consistent API and provides ** many ** out-of-the-box evaluation, diagnosis, and cross-validation methods (doesn't it sound familiar? Python also provides the battery-ready method ). The icing on the cake is that it uses the Scipy data structure at the underlying layer, which is well adapted to the rest of Python that use Scipy, Numpy, Pandas, and Matplotlib for scientific computing.
Therefore, if you want to visualize the performance of a classifier (for example, using a precision-recall chart or a Receiver-operated Characteristics (ROC) curve ), matplotlib can help with quick visualization.
Considering the time spent cleaning and constructing data, it is very convenient to use this database because it can be closely integrated into other scientific computing packages.
In addition, it also contains limited natural language processing feature extraction capabilities, as well as bag of words and tfidf (Term Frequency Inverse Document Frequency algorithm), preprocessing (deprecated word/stop-words, custom preprocessing, Analyzer ).
In addition, if you want to quickly perform different benchmarking tests on a small dataset (toy dataset), its built-in dataset module provides common and useful datasets. You can also create your own small datasets based on these datasets so that you can check whether the model meets your expectations based on your own purposes before applying the model to the real world. For parameter optimization and parameter adjustment, it also provides grid search and random search.
None of these features can be achieved without strong community support or poor maintenance. We look forward to its first stable release.
Statsmodels
Statsmodels is another powerful library focusing on statistical models. it is mainly used for predictive and exploratory analysis. If you want to fit a linear model, perform statistical analysis, or perform predictive modeling, Statsmodels is ideal. It provides a comprehensive statistical test covering verification tasks in most cases.
If you are a user of R or S, it also provides the R syntax of some statistical models. Its model also accepts Numpy arrays and Pandas data frames, making the intermediate data structure a thing of the past!
PyMC
PyMC is a tool for ** Bayesian curves. It includes Bayesian model, statistical distribution, and model convergence diagnostic tools, as well as some hierarchical models. If you want to perform Bayesian analysis, you should take a look.
Shogun
Shogun1 is a machine learning toolbox focusing on Support Vector Machines (SVM). it is written in C ++. It is under active development and maintenance. It provides Python interfaces and is also the best documented interface. However, compared with Scikit-learn, we find its API is relatively difficult to use. In addition, many out-of-the-box diagnostic and evaluation algorithms are not provided. However, speed is a great advantage.
Gensim
Gensim is defined as "topic modeling for humans )". As described on its homepage, the focus is on the Dirichlet division (Latent Dirichlet Allocation, LDA) and its variants. Unlike other packages, it supports natural language processing and makes it easier to combine NLP and other machine learning algorithms.
If your domain is in NLP and you want to perform aggregation and basic classification, you can look at it. Currently, they introduce Google's text representation word2vec based on the recursive Neural Network (Recurrent Neural Network. This library is only written in Python.
Orange
Orange is the only graphic User Interface (GUI) in all databases listed in this article. For classification, clustering, and feature selection methods, it is quite comprehensive, and there are also some cross-validation methods. In some ways it is better than Scikit-learn (classification method, some preprocessing capabilities), but it is better than other scientific computing systems (Numpy, Scipy, Matplotlib, Pandas) the adaptation is not comparable to that of Scikit-learn.
However, including GUI is an important advantage. You can visualize cross-validation results, models, and feature selection methods (Graphviz needs to be installed for some features ). For most algorithms, Orange has its own data structure, so you need to package the data into Orange-compatible data structures, which leads to a steep learning curve.
PyMVPA
PyMVPA is another statistical learning library. The API is similar to Scikit-learn. Cross-validation and diagnostic tools are included, but not comprehensive with Scikit-learn.
Deep learning
Although deep learning is a sub-section of machine learning, the reason we have created a separate section here is that it has attracted a lot of attention from Google and Facebook talent recruitment departments.
Theano
Theano is the most mature deep learning database. It provides a good data structure (tensor) to represent the neural network layer, which is very efficient for linear algebra and similar to Numpy arrays. It should be noted that its API may not be very intuitive, and the user's learning curve will be very high. Many Theano-based databases use their data structures. It also supports out-of-the-box GPU programming.
PyLearn
Another Theano-based library, PyLearn2, introduced modularity and configurability to Theano. you can create neural networks through different configuration files, in this way, it is easier to try different parameters. It can be said that, if the parameters and attributes of the neural network are separated to the configuration file, its modularization capability is more powerful.
Decaf
Decaf is a deep learning library recently released by UC Berkeley. in the challenges of Imagenet classification, it is found that its neural network implementation is very advanced (state of art ).
Nolearn
If you want to use the excellent Scikit-learn library API in deep learning, encapsulating the Decaf Nolearn will make it easier for you to use it. It is a package for Decaf and is compatible with Scikit-learn (mostly), making Decaf more incredible.
OverFeat
OverFeat is the winner of the latest cat vs. dog (kaggle challenge) 4. it is written in C ++ and also contains a Python package (as well as Matlab and Lua ). Using GPU through Torch library is fast. It also won the challenges of ImageNet classification detection and localization. If your field is computer vision, you may need to look at it.
Hebel
Hebel is another GPU-supported neural network library available out of the box. You can use the YAML File (similar to Pylearn2) to determine the attributes of the neural network. It provides a friendly way to separate the God-class network from the code and quickly run the model. Due to the short term of development, there is a lack of documentation in depth and breadth. The neural network model is also limited because it only supports one neural network model (forward feedback, feed-forward ).
However, it is written in pure Python and will be a very friendly library, because it contains many practical functions, such as schedulers and monitors, which are not found in other libraries.
Neurolab
NeuroLab is another API-friendly neural network Library (similar to the matlab api. Unlike other libraries, it contains different variants of recursive Neural networks (RNN) implementation. If you want to use RNN, this library is one of the best options for similar APIs.
Integration with other languages
Do you not know Python but are very good at other languages? Don't despair! One of the strengths of Python (and others) is that it is a perfect glue language. you can access these libraries through Python using your common programming language. The following packages can be used to combine other languages with Python:
R-> RPython
Matlab-> matpython
Java-> Jython
Lua-> Lunatic Python
Julia-> PyCall. jl
Inactive Library
These libraries have not released any updates for more than a year. we list these libraries because they may be useful, but these libraries are unlikely to fix bugs, especially in the future.
MDP2MlPy
FFnet
PyBrain
The above describes the details of the machine learning library commonly used in Python. For more information, see other related articles in the first PHP community!