First, what is a reptile? Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script. According to my experience, to learn python crawlers, There are a few things we want to learn: Python basics python urllib and urllib2 library usage python regular expression Python crawler frame Scrapypython crawler more advanced 1.Python Basic Learning first of all, we're going to use Python to write crawlers, definitely Understand the basics of Python, towering high-rise floor, can't forget the foundation, haha, then I would like to share some of the Python tutorials I have seen, small friends can be used as reference. 1) Mu class Network Python tutorial Once there are some basic grammar is in the Mu class online, with some exercises, after learning can be used as exercise, feeling the effect is pretty good, but a little regret is the content is basically the most basic, getting started, That's it. Learning Web site: mu Lesson network python tutorial 2) Liaoche Python tutorial Later, I found that teacher Liao's Python tutorial, that is very popular, it is very easy to understand, it feels very good, if you want to learn more about Python look at this. Learning Web site: Liaoche Python tutorial 3) Concise Python tutorial and a concise Python tutorial I read, and I feel good about it. Learning Web site: Concise python tutorial 4) Wanghai's lab This is my undergraduate lab seniors, the introduction of the time reference to his article, Self-made a summary, and then these series of articles on his basis added some content. Study site: Wanghai's lab 2. Python urllib and Urllib2 library usage urllib and URLLIB2 Library is the most basic library of learning Python crawler, using this library we can get the content of the Web page, and the content with regular expression to extract the analysis, get the results we want. I will share this with you in the course of learning. 3.Python Regular expression Python regular expressions are a powerful weapon for matching strings. Its design idea is to use a descriptive language to define a rule for a string, and any string that conforms to the rule, we think it "matches", otherwise the string is illegal. This will be shared at the back of the post. 4. Reptile Frame Scrapy If you are a Python master, basic crawler knowledge has been mastered, then look for the Python framework, I chose the framework is the Scrapy framework. What are the powerful features of this framework? Here is its official introduction: HTML, XML source data selection and extraction of built-in support? Provides a range ofA reusable filter (that is, Item loaders) shared between Ider provides built-in support for intelligent processing of crawl data. How to provide built-in support for multi-format (JSON, CSV, XML), multi-storage backend (FTP, S3, local file system) via feed export? Media pipeline is available to automatically download pictures (or other resources) from crawled data. High scalability. You can customize your functionality by using signals, designed APIs (middleware, extensions, pipelines). The built-in middleware and extensions provide support for the following features:? Cookies and session processing? HTTP compression? HTTP authentication? HTTP caching? User-agent simulation? robots.txt? Crawl depth limit? Automatic detection and robust encoding support are provided for non-standard or incorrect coding claims in a language-not-English language. Support for crawler generation based on templates. While accelerating the creation of crawlers, keeping the code in large projects more consistent. For more information, see the Genspider command.? for multi-crawler performance evaluation, failure detection, provides a scalable state-gathering tool.? provide interactive shell terminal, for you to test XPath expression, writing and debugging Crawler provides great convenience? provide System service, Streamline deployment and operation in production environments? Built-in Web service allows you to monitor and control your machine? Built-in Telnet terminal that allows you to view and debug crawlers by hooking into the Python terminal in the scrapy process? Logging is it convenient for you to catch errors during the crawl process? Support for Sitemaps crawling? DNS resolver with cache
A summary of the Python crawler introduction