A summary of the introduction of 1.Python crawler

Source: Internet
Author: User
Tags http authentication

To learn the Python crawler, we have to learn a total of the following points:

  • Python Basics
  • Usage of urllib and URLLIB2 libraries in Python
  • Python Regular Expressions
  • Python crawler Frame Scrapy
  • More advanced features of Python crawlers
  • 1.Python Basic Learning

    First of all, we want to use Python to write crawler, definitely want to understand the basis of Python, towering high-rise, can not forget the foundation, haha, then I would like to share some of the python I have seen some tutorials, small partners can be used as a reference.

    1) MU-class Network Python tutorial

    There have been some basic grammar in the MU class on the Internet to see, the above with some exercises, after learning can be used as exercise, feeling the effect is pretty good, but a little regret is the content is basically the most basic, the beginning of the word, this bar

    Learning Web site: MU-class Network Python tutorial

    2) Liaoche python tutorial

    Later, I found that teacher Liao's Python tutorial, which is very easy to understand, it is also very good, if you want to learn more about Python look at this bar.

    Study site: Liaoche python Tutorial

    3) Concise Python tutorial

    There's also a concise Python tutorial that I've seen, and I feel good about it.

    Learning Web site: A Concise python tutorial

    4) Wanghai's laboratory

    This is my undergraduate laboratory seniors, when the introduction of his article, he re-made a summary, and then these series of articles on his basis added some content.

    Study site: Wanghai's lab

    2.Python urllib and Urllib2 library usage

    Urllib and Urllib2 Library is the most basic library of learning Python crawler, using this library we can get the content of the Web page, and the content with regular expression to extract the analysis, to get the results we want. I will share this with you in the course of learning.

    3.Python Regular Expressions

    A python regular expression is a powerful weapon for matching strings. Its design idea is to use a descriptive language to define a rule for a string, and any string that conforms to the rule, we think it "matches", otherwise the string is illegal. This will be shared at the back of the post.

    4. Reptile Frame Scrapy

    If you are a Python master, basic crawler knowledge has been mastered, then look for the Python framework, I chose the framework is the Scrapy framework. What are the powerful features of this framework? Here is an official introduction to it:

    Built-in support for HTML, XML source data selection and extraction
    Provides a series of reusable filters (that is, Item loaders) shared between spiders, providing built-in support for intelligent processing of crawled data.
    Multi-format (JSON, CSV, XML) with built-in support for multi-storage backend (FTP, S3, local file system) via feed export
    Media pipeline is available to automatically download pictures (or other resources) from crawled data.
    High scalability. You can customize your functionality by using signals, a well-designed API (middleware, extensions, pipelines).
    The built-in middleware and extensions provide support for the following features:
    Cookies and session Processing
    HTTP compression
    HTTP Authentication
    HTTP Caching
    User-agent Simulation
    Robots.txt
    Crawl depth Limit
    Automatic detection and robust encoding support are provided for non-standard or incorrect coding claims in the English language.
    Supports the creation of crawlers based on templates. While accelerating the creation of crawlers, keeping the code in large projects more consistent. For more information, see the Genspider command.
    A scalable state-gathering tool is provided for performance evaluation and failure detection under multiple crawlers.
    Provides an interactive shell terminal for you to test XPath expressions, writing and debugging crawlers provides great convenience
    Provides System service to streamline deployment and operation in production environments
    Built-in WEB service allows you to monitor and control your machine
    Built-in Telnet terminal allows you to view and debug crawlers by hooking into the Python terminal in the scrapy process
    Logging provides you with the convenience of capturing errors during crawling
    Support for Sitemaps crawling
    DNS Resolver with cache

    Official Document: http://doc.scrapy.org/en/latest/

    When we have mastered the basic knowledge, then use this scrapy framework!

    Pull so much, as if there is not much useful amount of things, then don't pull!

    Let's go to the Reptile tour now!

  • There is also a powerful three-party crawler library: Requests lxml and so on, it is recommended to first learn Python's own basic crawler framework Urllib and URLLIB2
  • From: Quiet? A summary of the introduction of Python crawler

A summary of the introduction of 1.Python crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.