A summary of the introduction of Python crawler

Last Update:2015-08-03 Source: Internet

Author: User

Tags http authentication

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Now, we have entered into the big data age, when the data processing analysis, the first is the need for data, and as an important source of data from the web crawler, in view of the Python language is simple and efficient, and the strong support for crawler technology, we choose to use Python as the main programming language. The Python version is 2.7.

The main content of this article turns from: Http://cuiqingcai.com/category/technique/python, Bo main introduction of very detailed, so it is intended to directly copy over, but also hope that more people can take this learning progress.

First explain what a reptile is. A crawler is a program or script that automatically crawls network information according to certain rules. The more important modules for Python crawlers are the basics of Python learning, the use of urllib and URLLIB2 libraries, the use of regular expressions, the learning and use of the crawler framework Scrapy, and some of the more advanced features, most importantly their hands-on practice.

1, basic knowledge of Python learning

For the basic knowledge of learning, has been simple to write a few blog post, here to give you some entry-level learning site for everyone to learn. MU Lesson Network python tutorial http://www.imooc.com/view/177, liao Xuefeng python tutorial http://www.liaoxuefeng.com/wiki/ 001374738125095c955c1e6d8bb493182103fac9270762a000, concise python tutorial Http://woodpecker.org.cn/abyteofpython_cn/chinese /pr01.html#s01

2. Usage of urllib and URLLIB2 libraries

Urllib and Urllib2 Library is the most basic library of learning Python crawler, using this library we can get the content of the Web page, and the content with regular expression to extract the analysis, to get the results we want.

3. Python Regular expression

A regular expression is a powerful weapon used to match strings. Its design idea is to use a descriptive language to define a rule for a string, and any string that conforms to the rule, we think it "matches", otherwise the string is illegal.

4. Reptile Frame Srapy

If you are a Python master, basic crawler knowledge has been mastered, then look for the Python framework, we choose the framework is the Scrapy framework. What are the powerful features of this framework? Here is an official introduction to it:

Built-in support for Html,xml source data selection and extraction

Provides a series of reusable filters (that is, item loaders) shared between spiders, providing built-in support for intelligent processing of crawled data.

Multi-format (JSON, CSV, XML) with built-in support for multi-storage backend (FTP, S3, local file system) via feed export

Media pipeline available to automatically download pictures (or other resources) from crawled data

High scalability. You can customize your functionality by using signals, designed APIs (middleware, Extensions,pipelines)

The built-in middleware and extensions provide support for the following features:

Ciikies and Session Processing

HTTP Compression

HTTP Authentication

HTTP caching

User-agent Simulation

Robots.txt

Crawl depth Limit

Automatic detection and robust encoding support for non-standard or error-based coding declarations in a language

Supports the creation of crawlers based on templates. While accelerating the creation of crawlers, keeping the code in large projects more consistent. For more information, see the Genspider command

For multi-crawler performance evaluation. Failure detection. Provides a scalable state-gathering tool

Provides an interactive shell terminal for you to test XPath expressions, writing and debugging crawlers provides great convenience

Provides system service to streamline deployment and operation in production environments

Built-in Web service is where you can monitor and control your machine

Built-in Telnet terminal allows you to hook into the Python terminal in the scrapy process, allowing you to view and debug crawlers

Logging provides you with the convenience of capturing errors during crawling

Support for Sitemaps crawling

DNS Resolver with cache

Official Document: http://doc.sccrapy.org/en/latest/

A summary of the introduction of Python crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More