The Python version used for this tutorial is 2.7!!!
At the beginning of college, always on the internet to see what reptiles, because at that time is still learning C + +, no time to learn python, but also did not go to learn the crawler, and take advantage of this project to learn the basic use of Python, so have mentioned the interest of learning reptiles, also wrote this series of blog, To record their own accumulation.
Below to get to the chase:
What is a reptile?
Web crawler ( also known as Web page spider, network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script.
What knowledge do you need to use when learning reptiles?
- Python Basics
- Usage of urllib and URLLIB2 libraries in Python
- Python Regular Expressions
- Python crawler Frame Scrapy
- More advanced features of Python crawlers
1.Python Basic Knowledge Learning
This is a resource that is often used when learning online:
A) Liaoche python tutorial
b) Python official documentation
2. About the use of urllib and URLLIB2 libraries
There is a tutorial on the Internet, after the blog will also have my own study introduction, but the best learning should be to the official documents to learn.
3. Regular expressions
As I am still a beginner, but also just understand a little, and now can not give a very good learning experience, but more use of search engines, should be able to learn quickly
4. Reptile Frame Scrapy
After the basic knowledge of crawlers has been fully mastered, you can try to use the framework to accomplish better things. What I learned in the course of learning is the scrapy framework, which is described in the official documentation:
TML, built-in support for XML source data selection and extraction provides a series of reusable filters (that is, Item loaders) shared between spiders, providing built-in support for intelligent processing of crawled data. Multi-format (JSON, CSV, XML) with built-in support for multi-storage backend (FTP, S3, local file system) via feed export Media pipeline is available to automatically download pictures (or other resources) from crawled data. High scalability. You can customize your functionality by using signals, a well-designed API (middleware, extensions, pipelines). the built-in middleware and extensions provide support for the following features: cookies and Session processing HTTP Compression HTTP Authentication HTTP Caching user-agent Simulation robots.txt Crawl Depth Limit automatic detection and robust encoding support are provided for non-standard or incorrect coding claims in the English language. supports the creation of crawlers based on templates. While accelerating the creation of crawlers, keeping the code in large projects more consistent. For more information, see the Genspider command. A scalable state-gathering tool is provided for performance evaluation and failure detection under multiple crawlers. provides an interactive shell terminal for you to test XPath expressions, writing and debugging crawlers provides great convenience provides System service to streamline deployment and operation in production environments built-in WEB service allows you to monitor and control your machine built-in Telnet terminal allows you to view and debug crawlers by hooking into the Python terminal in the scrapy process Logging provides you with the convenience of capturing errors during crawling support for Sitemaps crawling DNS resolver with cache |
Scrapy Official documents
Reference Blog: A summary of the Python crawler introduction
Python Tutorial---crawler introductory tutorial One