A summary of the introduction of Python crawler

Source: Internet
Author: User
Tags http authentication

Now, we have entered into the big data age, when the data processing analysis, the first is the need for data, and as an important source of data from the web crawler, in view of the Python language is simple and efficient, and the strong support for crawler technology, we choose to use Python as the main programming language. The Python version is 2.7.

The main content of this article turns from: Http://cuiqingcai.com/category/technique/python, Bo main introduction of very detailed, so it is intended to directly copy over, but also hope that more people can take this learning progress.

First explain what a reptile is. A crawler is a program or script that automatically crawls network information according to certain rules. The more important modules for Python crawlers are the basics of Python learning, the use of urllib and URLLIB2 libraries, the use of regular expressions, the learning and use of the crawler framework Scrapy, and some of the more advanced features, most importantly their hands-on practice.

1, basic knowledge of Python learning

For the basic knowledge of learning, has been simple to write a few blog post, here to give you some entry-level learning site for everyone to learn. MU Lesson Network python tutorial http://www.imooc.com/view/177, liao Xuefeng python tutorial http://www.liaoxuefeng.com/wiki/ 001374738125095c955c1e6d8bb493182103fac9270762a000, concise python tutorial Http://woodpecker.org.cn/abyteofpython_cn/chinese /pr01.html#s01

2. Usage of urllib and URLLIB2 libraries

Urllib and Urllib2 Library is the most basic library of learning Python crawler, using this library we can get the content of the Web page, and the content with regular expression to extract the analysis, to get the results we want.

3. Python Regular expression

A regular expression is a powerful weapon used to match strings. Its design idea is to use a descriptive language to define a rule for a string, and any string that conforms to the rule, we think it "matches", otherwise the string is illegal.

4. Reptile Frame Srapy

If you are a Python master, basic crawler knowledge has been mastered, then look for the Python framework, we choose the framework is the Scrapy framework. What are the powerful features of this framework? Here is an official introduction to it:

Built-in support for Html,xml source data selection and extraction

Provides a series of reusable filters (that is, item loaders) shared between spiders, providing built-in support for intelligent processing of crawled data.

Multi-format (JSON, CSV, XML) with built-in support for multi-storage backend (FTP, S3, local file system) via feed export

Media pipeline available to automatically download pictures (or other resources) from crawled data

High scalability. You can customize your functionality by using signals, designed APIs (middleware, Extensions,pipelines)

The built-in middleware and extensions provide support for the following features:

Ciikies and Session Processing

HTTP Compression

HTTP Authentication

HTTP caching

User-agent Simulation

Robots.txt

Crawl depth Limit

Automatic detection and robust encoding support for non-standard or error-based coding declarations in a language

Supports the creation of crawlers based on templates. While accelerating the creation of crawlers, keeping the code in large projects more consistent. For more information, see the Genspider command

For multi-crawler performance evaluation. Failure detection. Provides a scalable state-gathering tool

Provides an interactive shell terminal for you to test XPath expressions, writing and debugging crawlers provides great convenience

Provides system service to streamline deployment and operation in production environments

Built-in Web service is where you can monitor and control your machine

Built-in Telnet terminal allows you to hook into the Python terminal in the scrapy process, allowing you to view and debug crawlers

Logging provides you with the convenience of capturing errors during crawling

Support for Sitemaps crawling

DNS Resolver with cache

Official Document: http://doc.sccrapy.org/en/latest/

A summary of the introduction of Python crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.