[resource-] Python Web crawler & Text Processing & Scientific Computing & Machine learning & Data Mining weapon spectrum

Source: Internet
Author: User
Tags svm python web crawler theano nltk



Reference:http://www.52nlp.cn/python-%e7%bd%91%e9%a1%b5%e7%88%ac%e8%99%ab-%e6%96%87%e6%9c%ac%e5%a4%84%e7%90%86 -%e7%a7%91%e5%ad%a6%e8%ae%a1%e7%ae%97-%e6%9c%ba%e5%99%a8%e5%ad%a6%e4%b9%a0-%e6%95%b0%e6%8d%ae%e6%8c%96%e6%8e% 98






A Python web crawler toolset



A real project must start with getting the data. Regardless of the text processing, machine learning and data mining, all need data, in addition to through some channels to buy or download professional data, often need to do their own data crawling, this time, the crawler is particularly important, fortunately, Python provides a lot of good web crawler tool framework, can not only crawl data, We can also get and clean the data, and we'll start here:



1. Scrapy


Scrapy, a fast high-level screen scraping and web crawling framework for Python.


The famous scrapy, I believe many students have heard, the course map of many courses are to rely on Scrapy grabbed, this aspect of the introduction of a lot of articles, recommended Daniel Pluskid an early article: "Scrapy easily customized web crawler", the long-lasting new.



Official homepage: http://scrapy.org/
GitHub code page: https://github.com/scrapy/scrapy



2. Beautiful Soup


You didn ' t write that awful page. You ' re just trying to get some data out of it. Beautiful Soup is a here-help. Since 2004, it ' s been saving programmers hours or days of work on quick-turnaround screen scraping projects.


Reading through the "collective Wisdom Programming" this book know beautiful soup, and then occasionally will use, very good set of tools. Objectively speaking, Beautifu soup is not entirely a set of crawler tools, need to cooperate with urllib use, but a set of html/xml data analysis, cleaning and acquisition tools.



Official homepage: http://www.crummy.com/software/BeautifulSoup/



3. Python-goose


Html content/article Extractor, web scrapping Lib in Python


Goose was originally written in Java and later rewritten in Scala, a Scala project. Python-goose is rewritten with Python and relies on the beautiful Soup. Before time used, feel very good, given an article URL, get the title and content of the article is very convenient.



GitHub Home: Https://github.com/grangier/python-goose



Ii. python Text Processing toolset



After obtaining the text data from the webpage, according to the task different, needs to carry on the basic text processing, for example in English, needs the basic tokenize, for Chinese, then needs the common Chinese word participle, further words, regardless English Chinese, also can the part of speech annotation, the syntactic analysis, the keyword extraction, the text classification , emotional analysis and so on. This aspect, especially for the English language field, has a lot of excellent toolkit, we have one by one ways to come.



1. Nltk-natural Language Toolkit


NLTK is a leading platform for building Python programs to work with human language data. IT provides easy-to-use interfaces to more corpora and lexical resources such as WordNet, along with a suite of text PR ocessing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active dis Cussion Forum.


Natural language processing students should be no one do not know nltk it, there is not much to say. However, two books are recommended for those who have just contacted NLTK or need to know more about NLTK: One is the official "Natural Language processing with Python" to introduce the function usage in NLTK, with some Python knowledge, At the same time the domestic Chen Tao classmate Friendship translated a Chinese version, here you can see: recommended "natural language processing with Python" Chinese translation-nltk supporting book; another one is "Python Text processing with NLTK 2.0 Cookbook", This book is going to go deeper, it will involve the NLTK code structure, and will also show how to customize their corpus and model, etc., quite good.



Official homepage: http://www.nltk.org/
GitHub code page: HTTPS://GITHUB.COM/NLTK/NLTK



2. Pattern


The Pattern is a web mining module for the Python programming language.

It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language proce Ssing (Part-of-speech taggers, N-gram Search, sentiment analysis, WordNet), machine learning (vector space model, Clusteri Ng, SVM), network analysis and canvas visualization.


The pattern, produced by the clips Laboratory at the University of Antwerp in Belgium, objectively says that pattern is not just a set of text processing tools, it is a Web data mining tool that includes  modules (including Google, Twitter, Wikipedia APIs, As well as crawlers and HTML analyzers), Text processing modules (part-of-speech tagging, sentiment analysis, etc.), machine learning modules (VSM, clustering, SVM), and visual modules, it can be said that this whole set of logic of the pattern is also the organization logic of this article, But here we put the pattern in the Text Processing section. My personal main use is its English processing module pattern.en, there are many very good text processing functions, including basic tokenize, part-of-speech tagging, sentence segmentation, grammar checking, spelling correction, affective analysis, syntactic analysis, etc., quite good.



Official homepage: Http://www.clips.ua.ac.be/pattern



3. textblob:simplified Text Processing


Textblob is a Python (2 and 3) the library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as Part-of-speech tagging, no Un phrase extraction, sentiment analysis, classification, translation, and more.


Textblob is an interesting Python Text processing toolkit that is actually encapsulated based on the above two Python toolkit nlkt and pattern (Textblob stands on the giant shoulders of NLTK and Pattern, and plays nicely with both), while providing many interfaces for text processing, including POS tagging, noun phrase extraction, sentiment analysis, text categorization, spell checking, and even translation and language detection, but this is based on Google's API, There is a limit to the number of calls. Textblob relatively young, interested students can pay attention to.



Official homepage: http://textblob.readthedocs.org/en/dev/
GitHub code page: Https://github.com/sloria/textblob



4. mbsp for Python


MBSP is a text analysis system based in the TIMBL and MBT memory based learning applications developed at CLiPS and ILK. IT provides tools for tokenization and sentence splitting, part of Speech Tagging, Chunking, lemmatization, Relation Findi Ng and prepositional Phrase Attachment.


MBSP and pattern homologous, the same from the Belgian University of Antwerp Clips Laboratory, provides word tokenization, sentence segmentation, POS tagging, Chunking, lemmatization, syntactic analysis and other basic text processing functions, Interested students can pay attention to.



Official homepage: Http://www.clips.ua.ac.be/pages/MBSP



5. Gensim:topic Modeling for humans



Gensim is a fairly professional topic Model Python Toolkit, whether it's code or documentation, we've introduced the installation and use of Gensim in how to calculate the similarity of two documents, and there's not much to say here.



Official homepage: http://radimrehurek.com/gensim/index.html
GitHub code page: Https://github.com/piskvorky/gensim



6. Langid.py:stand-alone Language Identification System



Language detection is a very interesting topic, but relatively mature, there are a lot of solutions, there are a lot of good open source Toolkit, but for Python, I used langid this toolkit, but also very willing to recommend it. LangID currently supports detection in 97 languages and offers many easy-to-use features, including the ability to launch a recommended server, call its API via JSON, customize the training of its own language detection model, and so on, which can be said to be "perfectly formed".



GitHub Home: https://github.com/saffsd/langid.py



7. Jieba: Stuttering Chinese participle


"Stuttering" Chinese word segmentation: Do the best python Chinese sub-phrase "Jieba" (Chinese for "to Stutter") Chinese text segmentation:built to be the better Python chines E Word segmentation module.


Well, finally can say a domestic Python Text Processing toolkit: stuttering participle, its functions include supporting three types of Word segmentation (precision mode, full mode, search engine mode), support traditional participle, support custom dictionaries, etc., is currently a very good Python Chinese word-breaker solution.



GitHub Home: Https://github.com/fxsjy/jieba



8. Xtas


Xtas, the extensible text Analysis Suite, a distributed text analysis package based on celery and Elasticsearch.


Thanks to Weibo friends @ the spring of the great hillside provides clues: Our group colleagues previously released Xtas, also Python-based text mining Toolkit, welcome to use, Link: http://t.cn/RPbEZOW. Look good, look back and try it.



GitHub code page: Https://github.com/NLeSC/xtas



Third, the Python Scientific Computing Toolkit



Speaking of scientific calculation, we first think of MATLAB, set numerical calculation, visualization tools and interaction in one, but unfortunately a commercial product. Open Source In addition to the GNU octave in trying to do a matlab-like toolkit, Python's several toolkits together can also replace the corresponding functions of matlab: Numpy+scipy+matplotlib+ipython. At the same time, these toolkits, especially NumPy and scipy, are also the foundation of many Python Text processing & machine learning & Data mining toolkits, and are very important. Finally, we recommend a series of "use Python to do scientific calculation", will involve NumPy, SciPy, matplotlib, can do reference.



1. NumPy


NumPy is the fundamental package for scientific computing with Python. It contains among other things:
1) A powerful N-dimensional array object
2) sophisticated (broadcasting) functions
3) tools for integrating C + + and Fortran code
4) Useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also is used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.


NumPy is almost an unavoidable scientific computing toolkit, the most commonly used may be its n-dimensional array objects, others include some mature library of functions, for the integration of C + + and FORTRAN code toolkit, Linear algebra, Fourier transform and random number generation functions. NumPy provides two basic types of objects: Ndarray (n-dimensional Array object) and Ufunc (Universal function Object). Ndarray is a multidimensional array that stores a single data type, while Ufunc is a function that can be processed on an array.



Official homepage: http://www.numpy.org/



2. scipy:scientific Computing Tools for Python


SciPy refers to several related but distinct entities:

1) The SciPy Stack, a collection of open source software for scientific computing in Python, and particularly a specified s ET of core packages.
2) The community of people who use and develop this stack.
3) Several conferences dedicated to scientific computing in python–scipy, Euroscipy and scipy.in.
4) The SciPy library, one component of the SciPy stack, providing many numerical routines.


"SciPy is an open source Python algorithm library and Math Toolkit, SCIPY contains modules with optimizations, linear algebra, integrals, interpolation, special functions, fast Fourier transforms, signal processing and image processing, ordinary differential equation solving, and other commonly used calculations in science and engineering. Its functionality is similar to software matlab, Scilab, and GNU Octave. NumPy and scipy are often used in conjunction, and most of Python's machine learning libraries rely on these two modules. "--Quote from" Python Machine Learning Library "



Official homepage: http://www.scipy.org/



3. Matplotlib


Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and Interactive environments across platforms. Matplotlib can used in Python scripts, the Python and Ipython Shell (ala matlab®* or mathematica®†), Web application SE RVers, and six graphical user interface toolkits.


Matplotlib is Python's most famous drawing library, which provides a complete set of command APIs similar to those of MATLAB, making it ideal for interactive mapping. It can also be easily used as a drawing control, embedded in GUI applications. Matplotlib can be used in conjunction with the Ipython shell, providing a graphic experience as much as MATLAB, in short, with a good deal.



Official homepage: http://matplotlib.org/



4. IPython


IPython provides a rich architecture for interactive computing with:

1) Powerful interactive shells (terminal and qt-based).
2) A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
3) support for interactive data visualization and use of GUI toolkits.
4) Flexible, embeddable interpreters to load into your own projects.
5) Easy-to-use, high-performance tools for parallel computing.


"IPython is a python interactive shell that works much better and more powerful than the default Python shell. She supports syntax highlighting, auto-completion, code debugging, object introspection, support for Bash shell commands, built-in many useful features and functions, etc., and is very easy to use. "Ipython–pylab" with this command when starting the Ipython, the Matploblib drawing interaction is turned on by default, which is convenient to use.



Official homepage: http://ipython.org/



Iv. Python Machine Learning & Data Mining Toolkit



The two concepts of machine learning and data mining are not very well differentiated and are put together here. There are a lot of open source Python toolkits in this area, which we will start with from the familiar, and then supplement the information from other sources.



1. Scikit-learn:machine Learning in Python


Scikit-learn (formerly Scikits.learn) is a open source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, logistic regre Ssion, naive Bayes, random forests, gradient boosting, K-means and DBSCAN, and is designed-interoperate with the Python Numerical and scientific libraries NumPy and SciPy.


The first recommendation is that the famous Scikit-learn,scikit-learn is an open-source machine learning toolkit based on NumPy, SciPy, matplotlib, mainly covering classification, regression and clustering algorithms such as SVM, logistic regression, naive Bayesian, random forest, Algorithms such as K-means, code and documentation are very good, and are used in many Python projects. For example, in our familiar NLTK, the classifier has an interface specifically for Scikit-learn, which can call Scikit-learn's classification algorithm and train data to train the classifier model. A video is recommended here, and I recommended it when I met Scikit-learn in the early days: A python machine learning Toolkit Scikit-learn and related video –tutorial:scikit-learn–machine are recommended learning In Python



Official homepage: http://scikit-learn.org/



2. Pandas:python Data Analysis Library


Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers the data structures and operations for manipulating numerical tables and time series.


The first contact with pandas was due to the udacity of a data analysis course "Introduction to Data Science" project needed to use the Pandas Library, so learned a bit pandas. Pandas is also based on NumPy and matplotlib development, mainly for data analysis and data visualization, its data structure Dataframe and R language Data.frame very much like, especially for time series data has its own set of analysis mechanism, very good. Here is a book, "Python for Data analysis", the author is the main development of pandas, introduced in turn Ipython, NumPy, pandas related functions, data visualization, data cleaning and processing, time data processing, etc. Examples include financial stock data mining and so on, quite good.



Official homepage: http://pandas.pydata.org/



=====================================================================
Split Line, the above toolkit is basically their own use, the following from other students clues, in particular, "Python Machine Learning Library", "23 Python Machine learning Package", do a little additions and deletions to modify, welcome you to add
=====================================================================



3. Mlpy–machine Learning Python


Mlpy is a Python module for machine learning built on top of numpy/scipy and the GNU scientific Libraries.

Mlpy provides a wide range of State-of-the-art machine learning methods for supervised and unsupervised problems and it is Aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. Mlpy is multiplatform, it works with Python 2 and 3 and it's Open Source, distributed under the GNU general public Licens E version 3.


Official homepage: http://mlpy.sourceforge.net/



4. Mdp:the Modular Toolkit for Data processing


Modular Toolkit for Data Processing (MDP) is a Python data processing framework.
From the user's perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data process ing units that can is combined into the data processing sequences and more complex Feed-forward network architectures.
from the scientific developer ' s perspective, MDP is a modular framework, which can easily be expanded. The implementation of new algorithms is easy and intuitive. The new implemented units is then automatically integrated with the rest of the library.
The base of available algorithms is steadily increasing and includes signal processing methods (Principal Component A Nalysis, independent Component analysis, Slow Feature analysis), manifold learning methods ([Hessian] locally Linear Embed Ding), several classifiers, probabilistic methods (Factor analysis, RBM), data pre-processing methods, and many others.


"MDP's modular Toolkit for data processing, a Python data processing framework. From the user's point of view, MDP is a group of supervised learning and unsupervised learning algorithms and other data processing units that can be integrated into data processing sequences and more complex feedforward network structures. Calculations are performed efficiently according to speed and memory requirements. From a scientific developer's point of view, MDP is a modular framework that can be easily extended. The implementation of the new algorithm is easy and intuitive. The newly implemented unit is then automatically integrated with the rest of the library's components. MDP was written in the context of neuroscience research, but it has been designed to be useful in any situation where training data processing algorithms can be used. Its simplicity on the user side, various readily available algorithms, and reusability of the application unit make it a useful teaching tool. ”



Official homepage: http://mdp-toolkit.sourceforge.net/



5. Pybrain


Pybrain is a modular machine learning Library for Python. Its goal are to offer flexible, easy-to-use yet still powerful algorithms for machine learning Tasks and a variety of prede Fined environments to test and compare your algorithms.

Pybrain is short for python-based reinforcement learning, Artificial Intelligence and neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive "backronym".


"Pybrain (python-based reinforcement Learning, Artificial Intelligence and Neural Network) is a machine learning module for Python, Its goal is to provide a flexible, easy-to-apply, powerful machine learning algorithm for machine learning tasks. (the name is very domineering)



Pybrain, as its name includes neural networks, reinforcement learning (and their combination), unsupervised learning, evolutionary algorithms. Because many of the current problems require processing of continuous state and behavior space, function approximations (such as neural networks) must be used to cope with high-dimensional data. Pybrain the neural network as the core, all the training methods are based on the neural network as an example. ”



Official homepage: http://www.pybrain.org/



6. Pyml–machine Learning in Python


Pyml is a interactive object oriented framework for machine learning written in Python. PYML focuses on SVMs and other kernel methods. It's supported on Linux and MAC OS x.


"Pyml is a Python machine learning toolkit that provides a flexible architecture for each classification and regression approach. It mainly provides feature selection, model selection, combinatorial classifier, classification evaluation and other functions. ”



Project home: http://pyml.sourceforge.net/



7. Milk:machine Learning Toolkit in Python.


Its focus was on the supervised classification with several classifiers available:
SVMs (based on LIBSVM), k-nn, random forests, decision trees. It also performs
Feature Selection. These classifiers can be combined on many ways to form
Different classification systems.


"Milk is a machine learning toolkit for Python that focuses on providing supervised taxonomies with several effective classification analyses: SVMs (based on LIBSVM), K-nn, stochastic forest economics and decision trees. It also allows for feature selection. These classifications can be combined in many ways to form different classification systems. For unsupervised learning, it provides k-means and affinity propagation clustering algorithms. ”



Official homepage: Http://luispedro.org/software/milk



Http://luispedro.org/software/milk



8. Pymvpa:multivariate Pattern Analysis (MVPA) in Python


PYMVPA is a Python package intended to ease statistical learning analyses of large datasets. IT offers a extensible framework with a high-level interface to a broad range of algorithms for classification, Regressio N, feature selection, data import and export. It is designed to integrate well with related software packages, such as Scikit-learn, and MDP. While it isn't limited to the neuroimaging domain, it's eminently suited for such datasets. PYMVPA is free software and requires nothing but free-software to run.


The PYMVPA (multivariate Pattern analysis in Python) is a Python toolkit that provides statistical learning analytics for large datasets, providing a flexible and extensible framework. It provides functions such as classification, regression, feature selection, data import and export, visualization, etc.



Official homepage: http://www.pymvpa.org/



9. Pyrallel–parallel Data Analytics in Python


Experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data Analytics tasks.


"Pyrallel (Parallel Data Analytics in Python) is a machine learning and semi-interactive pilot project based on a distributed computing model that can be run on a small cluster"



GitHub code page: Http://github.com/pydata/pyrallel



Ten. Monte–gradient Based learning in Python


Monte (python) is a Python framework for building gradient based learning machines, like neural networks, conditional rand Om fields, logistic regression, etc. Monte contains modules (that hold parameters, a cost-function and a gradient-function) and trainers (that can adapt a modu Le ' s parameters by minimizing it cost-function on training data).

Modules is usually composed of other Modules, which can in turn contain other Modules, etc. Gradients of decomposable systems like these can is computed with back-propagation.


"Learning in pure Python is a pure Python machine learning Library. It can quickly build neural networks, conditional random-airports, logistic regression models, use INLINE-C optimization, easy to use and expand. ”



Official homepage: http://montepython.sourceforge.net



One. Theano


Theano is a Python library that allows your to define, optimize, and evaluate mathematical expressions Inv Olving multi-dimensional arrays efficiently. Theano Features:
1) Tight integration with Numpy–use Numpy.ndarray in theano-compiled functions.
2) Transparent US E of a gpu–perform data-intensive calculations up to 140x faster than with CPU. (float32 only)
3) Efficient symbolic Differentiation–theano does your derivatives for function with one or many inputs.
4) Speed and stability optimizations–get the right answer for log (1+x) Even if X is really tiny.
5) Dynamic C code generation–evaluate expressions faster.
6) Extensive unit-testing and self-verification–detect and diagnose many types of mistake.
Theano has been powering large-scale computationally intensive scientific investigations since 2007. But it was also approachable enough to being used in the classroom (IFT6266 at the University of Montreal).


The Theano is a Python library that defines, optimizes, and simulates mathematical expression calculations for efficient resolution of multidimensional array calculations. Theano Features: Tightly integrated numpy, efficient data-intensive GPU computing, efficient symbolic differential operations, high-speed and stable optimization, dynamic generation of C code, extensive unit testing and self-validation. Since 2007, Theano has been widely used in scientific operations. Theano makes it easier to build deep learning models and can quickly implement multiple models. Ps:theano, a Greek beauty, the daughter of Croton's most powerful Milo, later became Pythagoras ' wife. ”



Pylearn2.


PYLEARN2 is a machine learning library. Most of its functionality are built on top of Theano. This means can write Pylearn2 plugins (new models, algorithms, etc) using mathematical expressions, and Theano would op Timize and stabilize those expressions for your, and compile them to a backend of your choice (CPU or GPU).


"Pylearn2 built on Theano, part of the reliance on Scikit-learn, the current Pylearn2 is in development, will be able to deal with vectors, images, video and other data, to provide MLP, RBM, SDA and other deep learning model. ”



Official homepage: http://deeplearning.net/software/pylearn2/



Other, welcome to add, here also will continue to update this article.



Note: Original article, reproduced please specify the source "I love Natural Language processing": www.52nlp.cn



This article link address: http://www.52nlp.cn/python-Web crawler-Text processing-Scientific computing-machine learning-data mining






[resource-] Python Web crawler & Text Processing & Scientific Computing & Machine learning & Data Mining weapon spectrum


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.