Python Tutorial---crawler introductory tutorial One

Last Update:2016-07-08 Source: Internet

Author: User

Tags http authentication

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Python version used for this tutorial is 2.7!!!

At the beginning of college, always on the internet to see what reptiles, because at that time is still learning C + +, no time to learn python, but also did not go to learn the crawler, and take advantage of this project to learn the basic use of Python, so have mentioned the interest of learning reptiles, also wrote this series of blog, To record their own accumulation.

Below to get to the chase:

What is a reptile?

Web crawler ( also known as Web page spider, network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script.

What knowledge do you need to use when learning reptiles?

Python Basics
Usage of urllib and URLLIB2 libraries in Python
Python Regular Expressions
Python crawler Frame Scrapy
More advanced features of Python crawlers

1.Python Basic Knowledge Learning

This is a resource that is often used when learning online:

A) Liaoche python tutorial

b) Python official documentation

2. About the use of urllib and URLLIB2 libraries

There is a tutorial on the Internet, after the blog will also have my own study introduction, but the best learning should be to the official documents to learn.

3. Regular expressions

As I am still a beginner, but also just understand a little, and now can not give a very good learning experience, but more use of search engines, should be able to learn quickly

4. Reptile Frame Scrapy

After the basic knowledge of crawlers has been fully mastered, you can try to use the framework to accomplish better things. What I learned in the course of learning is the scrapy framework, which is described in the official documentation:

TML, built-in support for XML source data selection and extraction
provides a series of reusable filters (that is, Item loaders) shared between spiders, providing built-in support for intelligent processing of crawled data.
Multi-format (JSON, CSV, XML) with built-in support for multi-storage backend (FTP, S3, local file system) via feed export
Media pipeline is available to automatically download pictures (or other resources) from crawled data.
High scalability. You can customize your functionality by using signals, a well-designed API (middleware, extensions, pipelines).
the built-in middleware and extensions provide support for the following features:
cookies and Session processing
HTTP Compression
HTTP Authentication
HTTP Caching
user-agent Simulation
robots.txt
Crawl Depth Limit
automatic detection and robust encoding support are provided for non-standard or incorrect coding claims in the English language.
supports the creation of crawlers based on templates. While accelerating the creation of crawlers, keeping the code in large projects more consistent. For more information, see the Genspider command.
A scalable state-gathering tool is provided for performance evaluation and failure detection under multiple crawlers.
provides an interactive shell terminal for you to test XPath expressions, writing and debugging crawlers provides great convenience
provides System service to streamline deployment and operation in production environments
built-in WEB service allows you to monitor and control your machine
built-in Telnet terminal allows you to view and debug crawlers by hooking into the Python terminal in the scrapy process
Logging provides you with the convenience of capturing errors during crawling
support for Sitemaps crawling
DNS resolver with cache

Scrapy Official documents

Reference Blog: A summary of the Python crawler introduction

Python Tutorial---crawler introductory tutorial One

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More