Python Tutorial---crawler introductory tutorial One

Source: Internet
Author: User
Tags http authentication

The Python version used for this tutorial is 2.7!!!

At the beginning of college, always on the internet to see what reptiles, because at that time is still learning C + +, no time to learn python, but also did not go to learn the crawler, and take advantage of this project to learn the basic use of Python, so have mentioned the interest of learning reptiles, also wrote this series of blog, To record their own accumulation.

Below to get to the chase:

What is a reptile?

Web crawler ( also known as Web page spider, network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script.

What knowledge do you need to use when learning reptiles?

    • Python Basics
    • Usage of urllib and URLLIB2 libraries in Python
    • Python Regular Expressions
    • Python crawler Frame Scrapy
    • More advanced features of Python crawlers

1.Python Basic Knowledge Learning

This is a resource that is often used when learning online:

A) Liaoche python tutorial

b) Python official documentation

2. About the use of urllib and URLLIB2 libraries

There is a tutorial on the Internet, after the blog will also have my own study introduction, but the best learning should be to the official documents to learn.

3. Regular expressions

As I am still a beginner, but also just understand a little, and now can not give a very good learning experience, but more use of search engines, should be able to learn quickly

4. Reptile Frame Scrapy

After the basic knowledge of crawlers has been fully mastered, you can try to use the framework to accomplish better things. What I learned in the course of learning is the scrapy framework, which is described in the official documentation:

TML, built-in support for XML source data selection and extraction
provides a series of reusable filters (that is, Item loaders) shared between spiders, providing built-in support for intelligent processing of crawled data.
Multi-format (JSON, CSV, XML) with built-in support for multi-storage backend (FTP, S3, local file system) via feed export
Media pipeline is available to automatically download pictures (or other resources) from crawled data.
High scalability. You can customize your functionality by using signals, a well-designed API (middleware, extensions, pipelines).
the built-in middleware and extensions provide support for the following features:
cookies and Session processing
HTTP Compression
HTTP Authentication
HTTP Caching
user-agent Simulation
robots.txt
Crawl Depth Limit
automatic detection and robust encoding support are provided for non-standard or incorrect coding claims in the English language.
supports the creation of crawlers based on templates. While accelerating the creation of crawlers, keeping the code in large projects more consistent. For more information, see the Genspider command.
A scalable state-gathering tool is provided for performance evaluation and failure detection under multiple crawlers.
provides an interactive shell terminal for you to test XPath expressions, writing and debugging crawlers provides great convenience
provides System service to streamline deployment and operation in production environments
built-in WEB service allows you to monitor and control your machine
built-in Telnet terminal allows you to view and debug crawlers by hooking into the Python terminal in the scrapy process
Logging provides you with the convenience of capturing errors during crawling
support for Sitemaps crawling
DNS resolver with cache

Scrapy Official documents

Reference Blog: A summary of the Python crawler introduction

Python Tutorial---crawler introductory tutorial One

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.