"Reprint" Python's weapon spectrum in big data analysis and machine learning

Source: Internet
Author: User
Tags svm theano nltk



A lightweight web framework for the Flask:python system.






1. Web Crawler toolset


Scrapy

Recommended Daniel Pluskid an early article: "Scrapy easy to customize web crawler"

Beautiful Soup Objectively speaking, Beautifu soup is not entirely a set of crawler tools, need to cooperate with urllib use, but a set of html/xml data analysis, cleaning and acquisition tools.
Python-goose Goose was originally written in Java and later rewritten in Scala, a Scala project. Python-goose is rewritten with Python and relies on the beautiful Soup. Before time used, feel very good, given an article URL, get the title and content of the article is very convenient




















2. Text Processing


Nltk-natural Language Toolkit

2 Books recommended:

1. The official "Natural Language processing with Python", to introduce NLTK in the main function usage, accompanied by some Python knowledge, while the domestic Chen Tao classmate Friendship translated a Chinese version, here you can see: recommended " Natural language processing with Python Chinese translation-nltk supporting book;

2. "Python Text processing with NLTK 2.0 Cookbook", this book to go deeper, will involve NLTK code structure, but also will show how to customize their own corpus and model, etc., quite good

Pattern The pattern, produced by the clips Laboratory at the University of Antwerp in Belgium, objectively says that pattern is not just a set of text processing tools, it is a Web data mining tool that includes  modules (including Google, Twitter, Wikipedia APIs, As well as crawlers and HTML analyzers), Text processing modules (part-of-speech tagging, sentiment analysis, etc.), machine learning modules (VSM, clustering, SVM), and visual modules, it can be said that this whole set of logic of the pattern is also the organization logic of this article, But here we put the pattern in the Text Processing section. My personal main use is its English processing module pattern.en, there are many very good text processing functions, including basic tokenize, part-of-speech tagging, sentence segmentation, grammar checking, spelling correction, affective analysis, syntactic analysis, etc., quite good.
Textblob Textblob is an interesting Python Text processing toolkit that is actually encapsulated based on the above two Python toolkit nlkt and pattern (Textblob stands on the giant shoulders of NLTK and Pattern, and plays nicely with both), while providing many interfaces for text processing, including POS tagging, noun phrase extraction, sentiment analysis, text categorization, spell checking, and even translation and language detection, but this is based on Google's API, There is a limit to the number of calls. Textblob relatively young, interested students can pay attention to.
MBSP for Python MBSP and pattern homologous, the same from the Belgian University of Antwerp Clips Laboratory, provides word tokenization, sentence segmentation, POS tagging, Chunking, lemmatization, syntactic analysis and other basic text processing functions, Interested students can pay attention to.
Gensim Gensim is a fairly professional topic Model Python Toolkit, whether it's code or documentation, we've introduced the installation and use of Gensim in how to calculate the similarity of two documents, and there's not much to say here.
langid.py Language detection is a very interesting topic, but relatively mature, there are a lot of solutions, there are a lot of good open source Toolkit, but for Python, I used langid this toolkit, but also very willing to recommend it. LangID currently supports detection in 97 languages and offers many easy-to-use features, including the ability to launch a recommended server, call its API via JSON, customize the training of its own language detection model, and so on, which can be said to be "perfectly formed".
Jieba: Stuttering Chinese participle Finally can say a domestic Python Text Processing toolkit: stuttering participle, its functions include supporting three types of Word segmentation (precision mode, full mode, search engine mode), support traditional word segmentation, support custom dictionaries, etc., is a very good Python Chinese word-breaker solution.
Xtas Our team of colleagues previously released Xtas, also Python-based text mining toolkit, welcome, Link: Http://t.cn/RPbEZOW. Look good, look back and try it.










































































3. Python Scientific Computing Toolkit



Numpy, Scipy, Matplotlib, IPython












4. Python machine learning and Data Mining toolkit


Scikit-learn The famous Scikit-learn,scikit-learn is an open-source machine learning toolkit based on NumPy, SciPy, matplotlib, mainly covering classification, regression and clustering algorithms such as SVM, logistic regression, Naive Bayes, random forest, Algorithms such as K-means, code and documentation are very good, and are used in many Python projects. For example, in our familiar NLTK, the classifier has an interface specifically for Scikit-learn, which can call Scikit-learn's classification algorithm and train data to train the classifier model. A video is recommended here, and I recommended it when I met Scikit-learn in the early days: A python machine learning Toolkit Scikit-learn and related video –tutorial:scikit-learn–machine are recommended learning In Python official homepage: http://scikit-learn.org/
Pandas Pandas is also based on NumPy and matplotlib development, mainly for data analysis and data visualization, its data structure Dataframe and R language Data.frame very much like, especially for time series data has its own set of analysis mechanism, very good. Here is a book, "Python for Data analysis", the author is the main development of pandas, introduced in turn Ipython, NumPy, pandas related functions, data visualization, data cleaning and processing, time data processing, etc. Examples include financial stock data mining and so on, quite good. Official homepage: http://pandas.pydata.org/
Mlpy Official homepage: http://mlpy.sourceforge.net/
Mdp MDP's modular Toolkit for data processing, a Python data processing framework. From the user's point of view, MDP is a group of supervised learning and unsupervised learning algorithms and other data processing units that can be integrated into data processing sequences and more complex feedforward network structures. Calculations are performed efficiently according to speed and memory requirements. From a scientific developer's point of view, MDP is a modular framework that can be easily extended. The implementation of the new algorithm is easy and intuitive. The newly implemented unit is then automatically integrated with the rest of the library's components. MDP was written in the context of neuroscience research, but it has been designed to be useful in any situation where training data processing algorithms can be used. Its simplicity on the user side, various readily available algorithms, and reusability of the application unit make it a useful teaching tool. "Official homepage: http://mdp-toolkit.sourceforge.net/
Pybrain Pybrain (python-based reinforcement Learning, Artificial Intelligence and Neural Network) is a machine learning module for Python, Its goal is to provide a flexible, easy-to-apply, powerful machine learning algorithm for machine learning tasks. (this name is very domineering) Pybrain as its name includes neural networks, reinforcement learning (and the combination of both), unsupervised learning, evolutionary algorithms. Because many of the current problems require processing of continuous state and behavior space, function approximations (such as neural networks) must be used to cope with high-dimensional data. Pybrain the neural network as the core, all the training methods are based on the neural network as an example. "Official homepage: http://www.pybrain.org/
Pyml "Pyml is a Python machine learning toolkit that provides a flexible architecture for each classification and regression approach. It mainly provides feature selection, model selection, combinatorial classifier, classification evaluation and other functions. ”
Milk

Machine Learning Toolkit in Python.

"Milk is a machine learning toolkit for Python that focuses on providing supervised taxonomies with several effective classification analyses: SVMs (based on LIBSVM), K-nn, stochastic forest economics and decision trees. It also allows for feature selection. These classifications can be combined in many ways to form different classification systems. For unsupervised learning, it provides k-means and affinity propagation clustering algorithms. ”

Official homepage: Http://luispedro.org/software/milkhttp://luispedro.org/software/milk
Pymvpa

Multivariate Pattern Analysis (MVPA) in Python

The PYMVPA (multivariate Pattern analysis in Python) is a Python toolkit that provides statistical learning analytics for large datasets, providing a flexible and extensible framework. It provides functions such as classification, regression, feature selection, data import and export, visualization, etc. official homepage: http://www.pymvpa.org/
Pyrallel

Parallel Data Analytics in Python

"Pyrallel (Parallel Data Analytics in Python) based on the distributed computing model of machine learning and semi-interactive pilot project, can run on a small cluster" GitHub code page: Http://github.com/pydata/pyrallel
Monte

Gradient based Learning in Python

"Learning in pure Python is a pure Python machine learning Library. It can quickly build neural networks, conditional random-airports, logistic regression models, use INLINE-C optimization, easy to use and expand. "Official homepage: http://montepython.sourceforge.net
Theano The Theano is a Python library that defines, optimizes, and simulates mathematical expression calculations for efficient resolution of multidimensional array calculations. Theano Features: Tightly integrated numpy, efficient data-intensive GPU computing, efficient symbolic differential operations, high-speed and stable optimization, dynamic generation of C code, extensive unit testing and self-validation. Since 2007, Theano has been widely used in scientific operations. Theano makes it easier to build deep learning models and can quickly implement multiple models. Ps:theano, a Greek beauty, the daughter of Croton's most powerful Milo, later became Pythagoras ' wife.
Pylearn2 "Pylearn2 built on Theano, part of the reliance on Scikit-learn, the current Pylearn2 is in development, will be able to deal with vectors, images, video and other data, to provide MLP, RBM, SDA and other deep learning model. "Official homepage: http://deeplearning.net/software/pylearn2/


"Reprint" Python's weapon spectrum in big data analysis and machine learning


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.