Transferred from: http://cuiqingcai.com/927.html
Hello everyone, recently Bo Master in the study of Python, during the study also encountered some problems, gained some experience, this will be their own learning system to organize down, if you are interested in learning Crawler, you can use these articles as a reference, but also welcome everyone to share learning experience.
Python Version: 2.7,python 3 Please find another blog post.
First, what is a reptile?
Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script.
According to my experience, to learn the Python crawler, we have to learn the following points:
- Python Basics
- Usage of urllib and URLLIB2 libraries in Python
- Python Regular Expressions
- Python crawler Frame Scrapy
- python crawler more advanced 1.Python Basic Learning
First of all, we want to use Python to write crawlers, we must understand the basis of Python, towering high-rise, can not forget the foundation, haha, then I would like to share some of the python I have seen the tutorial , small partners can be used as a reference.
1) MU lesson Network python tutorial There have been some basic grammar in the MU class online, with some exercises, after the study can be used as a practice, feeling the effect is pretty good, but a little regret is the content is basically the most basic, the beginning of the word, this bar
Learning Web site: mu Lesson network python tutorial
2) Liaoche Python tutorial Later, I discovered that teacher Liao's Python tutorial, which is very easy to understand which is very popular, it is also very good, if you want to learn more about Python look at this.
Learning Web address: Liaoche Python tutorial
3) Concise Python tutorial There's also a concise Python tutorial I've seen, and I feel good about it.
Learning URL: Concise python tutorial
2.Python The use of the Urllib and URLLIB2 libraries Urllib and URLLIB2 libraries are the most basic libraries for learning Python crawlers, and using this library we can get the content of the Web page and extract the content with regular expressions to get the results we want. I will share this with you in the course of learning.
3.Python Regular Expression The Python regular expression is a powerful weapon for matching strings. Its design idea is to use a descriptive language to define a rule for a string, and any string that conforms to the rule, we think it "matches", otherwise the string is illegal. This will be shared at the back of the post.
4. Reptile Frame Scrapy If you are a Python master, the basic crawler knowledge has been mastered, then look for the Python framework, I chose the framework is the Scrapy framework. What are the powerful features of this framework? Here is an official introduction to it:
HTML, the built-in support for XML source data selection and extraction provides a series of reusable filters (that is, Item loaders) shared between spiders, providing built-in support for intelligent processing of crawling data. Multi-format (JSON, CSV, XML) is provided through feed export, and the built-in support for multi-storage backend (FTP, S3, local file system) provides media pipeline that automatically downloads pictures (or other resources) from crawled data. High scalability. You can customize your functionality by using signals, a well-designed API (middleware, extensions, pipelines). The built-in middleware and extensions provide support for the following features: Cookies and Sessions handle HTTP compression HTTP Authentication http cache user-agent Analog robots.txt crawl depth limits for nonstandard or incorrect encoding declarations in non-English language languages, Provides automatic detection and robust encoding support. Supports the creation of crawlers based on templates. While accelerating the creation of crawlers, keeping the code in large projects more consistent. For more information, see the Genspider command. A scalable state-gathering tool is provided for performance evaluation and failure detection under multiple crawlers. Providing an interactive shell terminal for you to test XPath expressions, writing and debugging crawlers provides great convenience in providing System service, simplifying deployment in production environments and running built-in WEB service, allowing you to monitor and control your machine's built-in Telnet terminal, by The scrapy process hooks into the Python terminal, allowing you to view and debug the crawler logging to provide you with convenient support for capturing errors during the crawl Sitemaps crawling a DNS resolver with caching
Official Document: http://doc.scrapy.org/en/latest/
When we have mastered the basic knowledge, then use this scrapy framework!
Pull so much, as if there is not much useful amount of things, then don't pull!
Let's go to the Reptile tour now!
A summary of the introduction of Python crawler