Now, we have entered into the big data age, when the data processing analysis, the first is the need for data, and as an important source of data from the web crawler, in view of the Python language is simple and efficient, and the strong support for crawler technology, we choose to use Python as the main programming language. The Python version is 2.7.
The main content of this article turns from: Http://cuiqingcai.com/category/technique/python, Bo main introduction of very detailed, so it is intended to directly copy over, but also hope that more people can take this learning progress.
First explain what a reptile is. A crawler is a program or script that automatically crawls network information according to certain rules. The more important modules for Python crawlers are the basics of Python learning, the use of urllib and URLLIB2 libraries, the use of regular expressions, the learning and use of the crawler framework Scrapy, and some of the more advanced features, most importantly their hands-on practice.
1, basic knowledge of Python learning
For the basic knowledge of learning, has been simple to write a few blog post, here to give you some entry-level learning site for everyone to learn. MU Lesson Network python tutorial http://www.imooc.com/view/177, liao Xuefeng python tutorial http://www.liaoxuefeng.com/wiki/ 001374738125095c955c1e6d8bb493182103fac9270762a000, concise python tutorial Http://woodpecker.org.cn/abyteofpython_cn/chinese /pr01.html#s01
2. Usage of urllib and URLLIB2 libraries
Urllib and Urllib2 Library is the most basic library of learning Python crawler, using this library we can get the content of the Web page, and the content with regular expression to extract the analysis, to get the results we want.
3. Python Regular expression
A regular expression is a powerful weapon used to match strings. Its design idea is to use a descriptive language to define a rule for a string, and any string that conforms to the rule, we think it "matches", otherwise the string is illegal.
4. Reptile Frame Srapy
If you are a Python master, basic crawler knowledge has been mastered, then look for the Python framework, we choose the framework is the Scrapy framework. What are the powerful features of this framework? Here is an official introduction to it:
Built-in support for Html,xml source data selection and extraction
Provides a series of reusable filters (that is, item loaders) shared between spiders, providing built-in support for intelligent processing of crawled data.
Multi-format (JSON, CSV, XML) with built-in support for multi-storage backend (FTP, S3, local file system) via feed export
Media pipeline available to automatically download pictures (or other resources) from crawled data
High scalability. You can customize your functionality by using signals, designed APIs (middleware, Extensions,pipelines)
The built-in middleware and extensions provide support for the following features:
Ciikies and Session Processing
HTTP Compression
HTTP Authentication
HTTP caching
User-agent Simulation
Robots.txt
Crawl depth Limit
Automatic detection and robust encoding support for non-standard or error-based coding declarations in a language
Supports the creation of crawlers based on templates. While accelerating the creation of crawlers, keeping the code in large projects more consistent. For more information, see the Genspider command
For multi-crawler performance evaluation. Failure detection. Provides a scalable state-gathering tool
Provides an interactive shell terminal for you to test XPath expressions, writing and debugging crawlers provides great convenience
Provides system service to streamline deployment and operation in production environments
Built-in Web service is where you can monitor and control your machine
Built-in Telnet terminal allows you to hook into the Python terminal in the scrapy process, allowing you to view and debug crawlers
Logging provides you with the convenience of capturing errors during crawling
Support for Sitemaps crawling
DNS Resolver with cache
Official Document: http://doc.sccrapy.org/en/latest/
A summary of the introduction of Python crawler