In-depth analysis of the structure and operation process of the Python crawler framework Scrapy, pythonscrapy

Source: Internet
Author: User
Tags python scrapy

In-depth analysis of the structure and operation process of the Python crawler framework Scrapy, pythonscrapy

Web Crawlers (Spider) are robots crawling on the network. Of course, it is usually not a physical robot, because the network itself is also a virtual thing, so this "robot" is actually a program, and it is not a crawler, it has a certain purpose, and some information will be collected during crawling. For example, Google has a lot of crawlers that collect webpage content and links between them on the Internet; for example, some ulterior motives crawler will collect things on the Internet, such as foo@bar.com or foo [at] bar [dot] com. In addition, there are some custom crawlers that specifically target a website. For example, JavaEye's Robbin wrote several blogs dedicated to malicious Crawlers a while ago (the original link seems to have expired, ), and websites such as niche software or LinuxToy are often caught in the crawl of the entire website by another name. In fact, crawlers are very simple in terms of basic principles. As long as they can access the network and analyze Web pages, most languages now have a convenient Http client library to capture Web pages, while HTML analysis is the simplest and can be done directly using regular expressions. Therefore, it is actually very easy to do a simple web crawler. However, it is very difficult to implement a high-quality spider.

The first part of the crawler is to download the Web page. There are many problems that need to be considered, how to maximize the use of local bandwidth, how to schedule Web requests for different sites to relieve the burden on the other server. In a high-performance Web Crawler system, DNS queries will also become the bottleneck of urgent optimization. In addition, there are also some "rules" that need to be followed (such as robots.txt ). The analysis process after obtaining the web page is also very complex. There are many strange things on the Internet, and all kinds of HTML pages with hundreds of errors. It is almost impossible to analyze them all. In addition, with the popularity of AJAX, how to obtain content dynamically generated by Javascript has become a major challenge. In addition, there are various intentionally or unintentionally occurring Spider traps on the Internet. If you blindly track hyperlinks, it will be stuck in a Trap, and it will never be replaced. For example, this website is said to have been claimed by Google that the number of Unique URLs on the Internet has reached 1 trillion, so this person is proud to announce the second trillion. : D

However, there are not many people who need to do crawlers that are as common as Google. Generally, we do a Crawler to crawl a specific website or a certain type of website, we can analyze the structure of the website to be crawled in advance to make it much easier. By analyzing and selecting valuable links for tracking, you can avoid unnecessary links or Spider traps. If the website structure allows you to select an appropriate path, we can crawl the items we are interested in a certain order. In this way, we can skip the judgment of duplicate URLs.

For example, if we want to crawl the blog text in pongba's blog mindhacks.cn, we can easily find that we are interested in two types of pages:

The article list page, such as the home page or a page with a URL such as/page/\ d +, through Firebug, we can see that the link of each article is in the tag under h1 (note that, the HTML code displayed on the HTML panel of Firebug may differ from the View Source. if Javascript is used to dynamically modify the DOM tree on the webpage, the former is the modified version, in addition, it is normalized by Firebug. For example, attribute is enclosed by quotation marks, and the latter is usually the original content that your spider crawls. If the regular expression is used to analyze the page or the HTML Parser used differs from Firefox, pay special attention to it). In addition, there are links to different list pages in a div whose class is wp-pagenavi.
The article content page. Each blog has such a page, such as/2008/09/11/machine-learning-and-ai-resources/, which contains the complete article content, this is what we are interested in.
Therefore, from the homepage, we can use the link in wp-pagenavi to obtain other article list pages. In particular, we define a path: only the link of follow Next Page, in this way, you can go through the process from start to end in sequence, eliminating the need to judge the need for repeated crawling. In addition, the pages corresponding to links to specific articles on the article list page are the data pages we really want to save.

In this way, it is not difficult to write an ad hoc Crawler to complete this task in script language. However, today's main character is Scrapy, a Crawler Framework written in Python, which is simple and lightweight, it is very convenient, and the official website says it is already in use in actual production, so it is not a toy-level thing. However, there is no Release version yet. You can directly use the source code in their Mercurial repository for installation. However, this item can also be used without installation, which is convenient to be updated at any time. The document is very detailed and I will not repeat it.

Scrapy uses the Asynchronous Network Library Twisted to process network communication. The architecture is clear and contains various middleware interfaces to flexibly meet various requirements. Shows the overall architecture:


The Green Line is the data flow direction. First, from the initial URL, sched will give it to Downloader for download. After the download, schedider will hand it to Spider for analysis. The Spider has two types of analysis results: one is the link that needs to be further crawled, such as the "next page" link analyzed previously, these items will be transmitted back to Scheduler; the other is the data to be saved, they are sent to Item Pipeline, where data is processed (including detailed analysis, filtering, and storage. In addition, various middleware can be installed in the data flow channel for necessary processing.

The specific content will be introduced in the final attachment.

It looks very complicated. In fact, it is very simple to use, just like Rails. First, create a project:

scrapy-admin.py startproject blog_crawl

A blog_crawl directory is created with a scrapy-ctl.py that is the control script for the entire project, and all the code is placed in the subdirectory blog_crawl. To capture mindhacks.cn, create a mindhacks_spider.py file in the spiders directory and define our Spider as follows:

from scrapy.spider import BaseSpider class MindhacksSpider(BaseSpider):  domain_name = "mindhacks.cn"  start_urls = ["http://mindhacks.cn/"]   def parse(self, response):    return [] SPIDER = MindhacksSpider()

Our MindhacksSpider inherits from BaseSpider (usually directly inherits from scrapy with richer functions. contrib. spiders. crawlSpider is easier, but in order to show how data is parse, we still use BaseSpider), The domain_name and start_urls variables are easy to understand what it means, the parse method is the callback function we need to define. After the default request gets the response, this callback function will be called. We need to parse the page here, two results are returned (the link for further crawl and the data to be saved), which makes me feel a bit strange, in its interface definition, these two results are actually returned in a list. It is not clear why the design is like this. Isn't it necessary to separate them in the end? In short, we first write an empty function to return only one empty list. In addition, define a "global" variable SPIDER, which will be instantiated when Scrapy imports the module and automatically located by the Scrapy engine. In this way, you can first run the crawler and try it:

./scrapy-ctl.py crawl mindhacks.cn

There will be a bunch of output, we can see that the http://mindhacks.cn is captured, because this is the initial URL, but because we do not return the URL that needs to be further crawled in the parse function, therefore, the whole crawl process only captures the home page and ends. The next step is to analyze the page. Scrapy provides a very convenient Shell (IPython is required) for us to perform experiments. Run the following command to start the Shell:

./scrapy-ctl.py shell http://mindhacks.cn

It starts the crawler, captures the page specified by the command line, and then enters the shell. According to the prompt, there are many ready-made variables available, one of which is hxs, which is an HtmlXPathSelector, the HTML pages of mindhacks are relatively standard and can be analyzed directly using XPath. From Firebug, we can see that the links to every blog article are under h1, so we can use this XPath expression in Shell to test:

In [1]: hxs.x('//h1/a/@href').extract()Out[1]: [u'http://mindhacks.cn/2009/07/06/why-you-should-do-it-yourself/', u'http://mindhacks.cn/2009/05/17/seven-years-in-nju/', u'http://mindhacks.cn/2009/03/28/effective-learning-and-memorization/', u'http://mindhacks.cn/2009/03/15/preconception-explained/', u'http://mindhacks.cn/2009/03/09/first-principles-of-programming/', u'http://mindhacks.cn/2009/02/15/why-you-should-start-blogging-now/', u'http://mindhacks.cn/2009/02/09/writing-is-better-thinking/', u'http://mindhacks.cn/2009/02/07/better-explained-conflicts-in-intimate-relationship/', u'http://mindhacks.cn/2009/02/07/independence-day/', u'http://mindhacks.cn/2009/01/18/escape-from-your-shawshank-part1/']

This is the URL we need. In addition, we can find the link of the next page, which is in a div together with the links of several other pages, however, the link on the next page does not have the title attribute, so XPath writing

//div[@class="wp-pagenavi"]/a[not(@title)]

However, if you flip a page backward, you will find that the "Previous Page" is also like this. Therefore, you also need to judge that the text on the link is the arrow of the next page u '\ xbb ', it can also be written to XPath, but it seems that this character is a unicode escape character. Due to the unclear encoding reason, the parse function is as follows:

def parse(self, response):  items = []  hxs = HtmlXPathSelector(response)  posts = hxs.x('//h1/a/@href').extract()  items.extend([self.make_requests_from_url(url).replace(callback=self.parse_post)         for url in posts])   page_links = hxs.x('//div[@class="wp-pagenavi"]/a[not(@title)]')  for link in page_links:    if link.x('text()').extract()[0] == u'\xbb':      url = link.x('@href').extract()[0]      items.append(self.make_requests_from_url(url))   return items

The first half is the link to parse the blog body to be crawled, and the second half is the link to the next page. It should be noted that the URLs in the returned list are not in string format, and Scrapy wants to get the Request object, this is more than a string URL, such as a Cookie or callback function. We can see that the callback function is replaced when the Request of the blog body is created, because the default callback function parse is used to parse pages such as the article list, And parse_post is defined as follows:

def parse_post(self, response):  item = BlogCrawlItem()  item.url = unicode(response.url)  item.raw = response.body_as_unicode()  return [item]

It's easy to return a BlogCrawlItem and put the captured data in it. You can do some parsing here, for example, parse the body and title through XPath, but I prefer to do these things later, such as the Item Pipeline or the Offline stage later. BlogCrawlItem is an empty class that Scrapy automatically defines for us to inherit from ScrapedItem. In items. py, here I add something:

from scrapy.item import ScrapedItem class BlogCrawlItem(ScrapedItem):  def __init__(self):    ScrapedItem.__init__(self)    self.url = ''   def __str__(self):    return 'BlogCrawlItem(url: %s)' % self.url

The _ str _ function is defined and only the URL is given, because the default _ str _ function will display all the data, therefore, we can see that the log output in the console is crazy when crawl is running. That is to output the captured webpage content. -.-Bb

In this way, the data is obtained, and only the data storage function is left. We add a Pipeline to implement it. Because Python comes with Sqlite3 support in the standard library, therefore, I use the Sqlite database to store data. Replace pipelines. py with the following code:

import sqlite3from os import path from scrapy.core import signalsfrom scrapy.xlib.pydispatch import dispatcher class SQLiteStorePipeline(object):  filename = 'data.sqlite'   def __init__(self):    self.conn = None    dispatcher.connect(self.initialize, signals.engine_started)    dispatcher.connect(self.finalize, signals.engine_stopped)   def process_item(self, domain, item):    self.conn.execute('insert into blog values(?,?,?)',              (item.url, item.raw, unicode(domain)))    return item   def initialize(self):    if path.exists(self.filename):      self.conn = sqlite3.connect(self.filename)    else:      self.conn = self.create_table(self.filename)   def finalize(self):    if self.conn is not None:      self.conn.commit()      self.conn.close()      self.conn = None   def create_table(self, filename):    conn = sqlite3.connect(filename)    conn.execute("""create table blog           (url text primary key, raw text, domain text)""")    conn.commit()    return conn

In the _ init _ function, use dispatcher to connect two signals to the specified function for initializing and closing the database connection (remember to commit before closing, it does not seem to automatically commit. If you close it directly, it seems that all data is lost dd -. -). When data passes through pipeline, the process_item function will be called. Here we will directly talk about the raw data stored in the database without any processing. If necessary, you can add additional pipelines to extract and filter data, which will not be detailed here.

Finally, we will list our pipelines in settings. py:

ITEM_PIPELINES = ['blog _ crawl. pipelines. SQLiteStorePipeline ']
Run the crawler again. OK!

PS1: Scrapy Components

1. Scrapy Engine (Scrapy Engine)

The Scrapy engine is used to control the data processing process of the entire system and trigger transaction processing. For more details, see the following data processing process.

2. sched)

The scheduler accepts requests from the Scrapy engine and sorts the requests in the queue, and returns them after the Scrapy engine sends the requests.

3. Downloader)

The main responsibility of the download tool is to capture the webpage and return the webpage content to Spiders ).

4. Spiders (SPIDER)

A spider is a class defined by Scrapy users to parse webpages and capture the content returned by URLs. Each spider can process a domain name or a group of domain names. In other words, it is used to define the crawling and parsing rules for a specific website.

5. Item Pipeline (project Pipeline)

The main responsibility of the project pipeline is to process projects extracted from webpages by Spider. Its main task is to clarify, verify, and store data. After the page is parsed by a spider, it will be sent to the project pipeline and processed in several specific order. The components of each project pipeline are a Python class consisting of a simple method. They get the project and execute their methods, and also need to determine whether to continue to execute the next step in the project pipeline or directly discard it for non-processing.

The project pipeline generally performs the following processes:

Clean HTML data and verify the parsed data (check whether the project contains the necessary fields) check whether the parsed data is duplicated (if repeated, delete it) and store the parsed data to the database.

6. Middlewares (middleware)

Middleware is a hook framework between the Scrapy engine and other components, mainly to provide a custom code to expand Scrapy functions.

PS2: Scrapy Data Processing Process

The entire data processing process of Scrapy is controlled by the Scrapy engine. The main operation mode is as follows:

When the engine opens a domain name, the spider processes the domain name and asks the spider to obtain the first crawled URL.
The engine obtains the first URL to be crawled from the spider and then schedules the request as a request in scheduling.
The engine obtains the page for crawling from the scheduling.
The scheduling returns the next crawled URL to the engine. The engine sends them to the downloader through the download middleware.
After the webpage is downloaded by the download loader, the response content is sent to the engine through the download middleware.
The engine receives a response from the download tool and sends it to the spider through the spider middleware for processing.
The spider processes the response and returns the crawled project, and then sends a new request to the engine.
The engine captures the project pipeline and sends a request to the scheduler.
The system repeats the operation next to the second part until there is no request in the scheduling, and then disconnects the connection between the engine and the domain.

Articles you may be interested in:
  • Practice the Python crawler framework Scrapy to capture the TOP250 Douban movie
  • No basic write python crawler: Use Scrapy framework to write Crawlers
  • Installing and configuring Scrapy
  • Install and use the Python crawler framework Scrapy
  • Example of using scrapy to parse js in python
  • Python uses scrapy to download a large page when 
  • How to print scrapy by using Python to capture the Tree Structure
  • Simple learning notes for Python Scrapy crawler framework


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.