Hello Everyone, the blogger is studying recently. Python, during the study also encountered some problems, gained some experience, this will be their own learning system to organize down, if you are interested in learning Crawler, you can use these articles as a reference, but also welcome everyone to share learning experience.
Python Version: 2.7,python 3 Please find another blog post.
First, what is a reptile?
Web crawler (also known as Web spider, Network robot, in The FOAF community, more often called the web-chaser, is a program or script that automatically crawls world Wide Web information according to certain rules .
According to my experience, to learn Python crawler, we have to learn a total of the following points:
- Python Basics
- Usage of urllib and URLLIB2 libraries in Python
- Python Regular Expressions
- Python crawler Frame Scrapy
- More advanced features of Python crawlers
1.Python Basic Learning
first, we need to use Python writing crawler, it is sure to understand the basis of Python, towering high-rise, can not forget the foundation, haha, then I would like to share some of the Python tutorials I have seen, small partners can be used as a reference.
1) MU-class network Python tutorial
There have been some basic grammar in the MU class on the Internet to see, the above with some exercises, after learning can be used as exercise, feeling the effect is pretty good, but a little regret is the content is basically the most basic, the beginning of the word, this bar
Learning website: mu class network Python Tutorials
2) Liaoche python tutorial
later, I found the teacher Liao's Python tutorial, that is very easy to understand, it is also very good feeling, if you want to learn more about Python look at this bar.
Learning Website: Liaoche Python Tutorials
3) Concise python tutorial
There's another one I've seen, concise Python tutorial, it feels good to say
Learning Website: Concise Python Tutorials
2.Python urllib and urllib2 library usage
Urllib and Urllib2 Library is the most basic library of learning Python crawler, using this library we can get the content of the Web page, and the content with regular expression to extract the analysis, to get the results we want. I will share this with you in the course of learning.
3.Python Regular Expressions
A python regular expression is a powerful weapon for matching strings. Its design idea is to use a descriptive language to define a rule for a string, and any string that conforms to the rule, we think it "matches", otherwise the string is illegal. This will be shared at the back of the post.
4. Reptile Frame Scrapy
If you are a Python master, basic crawler knowledge has been mastered, then look for the Python framework, I chose the framework is the Scrapy framework. What are the powerful features of this framework? Here is an official introduction to it:
Built-in support for HTML, XML source data selection and extraction
provides a range ofA reusable filter (the Item loaders) shared between spiders provides built-in support for intelligent processing of crawling data.
throughFeed export provides multiple formats (JSON, CSV, XML), multipleStorageback end(FTP, S3, LocalFile System) with built-in support
provides aMedia pipeline, you can automatically download pictures (or other resources) from the crawled data.
High scalability. You can use theSignals, designed API (middleware, extensions, pipelines) to customize the implementation of your functionality.
built -inMiddlewareand extensions provide support for the following features:
Cookies and session Processing
HTTP compression
HTTP Authentication
HTTP Caching
User-agent Simulation
Robots.txt
Crawl Depth Limit
for non-standard or incorrect code declarations in the English language, providing automatic detection and robust encoding support.
supports the creation of crawlers based on templates. While accelerating the creation of crawlers, keeping the code in large projects more consistent. For more information, seeGenspider command.
for multi-crawler performance evaluation, failureDetection, which provides an extensible State Collection Tool .
provideInteractiveShell terminal, for you to test the XPath expression, writing and debugging Crawler provides a great convenience
provideSystem service simplifies deployment and operation in the production environment
built-inWeb service allows you to monitor and control your machine
built-inTelnet terminal, which allows you to view and debug crawlers by hooking into the Python terminal in the scrapy process
Logging provides you with the convenience of capturing errors during crawling
SupportSitemaps Crawl
with the cacheDNS Resolver
Agent selection, currently in use:http://zhimaruanjian.com/, the effect can also
when we have mastered the basics, we'll use this. Scrapy Frame Bar!
Pull so much, as if there is not much useful amount of things, then don't pull!
Let's go to the Reptile tour now!
Introduction to Python crawlers: a review