Python scrapy allows you to easily customize web crawlers.

Source: Internet
Author: User
Tags xpath python scrapy

Web Crawlers (SPIDER) are robots crawling on the network. Of course, it is usually not a physical robot, because the network itself is also a virtual thing, so this "robot" is actually a program, and it is notChaosCrawling, but for a certain purpose, and some information will be collected during crawling. For example, Google has a lot of crawlers that collect webpage content and links between them on the Internet; for example, some ulterior motives crawler will collect on the Internet such as foo@bar.com or Foo [at] bar [Dot]
Com. In addition, there are some custom crawlers that specifically target a website. For example, javaeye's Robbin wrote several blogs dedicated to malicious Crawlers a while ago (the original link seems to have expired, ), and websites such as niche software or linuxtoy are often caught in the crawl of the entire website by another name. In fact, crawlers are very simple in terms of basic principles. As long as they can access the network and analyze Web pages, most languages are now convenient.
The HTTP client library can capture web pages, while the most simple HTML analysis can be done directly using regular expressions. Therefore, it is very easy to do the simplest web crawler. However, it is very difficult to implement a high-quality spider.

The first part of the crawler is to download the web page. There are many problems that need to be considered, how to maximize the use of local bandwidth, how to schedule Web requests for different sites to relieve the burden on the other server. In a high-performance web crawler system, DNS queries will also become a bottleneck for urgent optimization. In addition, there are also some rules to follow (such as robots.txt ). The analysis process after obtaining the webpage is also very complex, and the Internet
There are a variety of errors and HTML pages. It is almost impossible to analyze them all. In addition, with the popularity of Ajax, in addition, there are various intentionally or unintentionally occurring spider traps on the Internet. If you follow hyperlinks blindly, it will be stuck in a trap. For example, this website is said to be a previous
Google claims that the number of unique URLs on the Internet has reached 1 trillion, so this personIs proud to announce the second trillion.

However, there are not many people who need to do crawlers that are as common as Google. Generally, we do a crawler to crawl a specific website or a certain type of website, we can analyze the structure of the website to be crawled in advance to make it much easier. By analyzing and selecting valuable links for tracking, you can avoid unnecessary links or spider traps. If the website structure allows you to select an appropriate path, we can crawl the items we are interested in a certain order. In this way, we can skip the judgment of duplicate URLs.

For example, if we want to crawl the blog text in pongba's blog mindhacks.cn, we can easily find that we are interested in two types of pages:

  1. The article list page, such as the homepage or URL is/page/\d+/Such a page, through
    Firebug shows that the links to each article areh1UnderaIn the tag (note that the HTML code displayed in the HTML panel of firebug may be slightly different from the View Source Code. If the webpage contains JavaScript to dynamically modify the DOM tree, the former is a modified version and is normalized by firebug. For example, attribute is expanded with quotation marks. The latter is usually the original content that your spider crawls. If the regular expression is used to analyze the page or
    For some differences between HTML Parser and Firefox, special attention must be paid). In addition, when a class iswp-pagenaviOf
    div
    There are links to different list pages.
  2. Article content page, each blog has such a page, such
    /2008/09/11/machine-learning-and-ai-resources/contains the complete article content, which we are interested in.

Therefore, starting from the homepagewp-pagenaviTo obtain other article list pages. In particular, we define a path: only follow
Next Page
In this way, you can go through the process from start to end in order, eliminating the need to judge the repeated crawling troubles. In addition, the pages corresponding to links to specific articles on the article list page are the data pages we really want to save.

In this way, it is not difficult to write an ad hoc crawler in script language to complete this task, but today's main character is
Scrapy, a crawler framework written in Python, is simple, lightweight, and very convenient. It has been used in actual production on the official website, so it is not a toy-level thing. However, there is no release version yet. You can directly use the source code in their mercurial repository for installation. However, this item can also be used without installation, which is convenient to be updated at any time. The document is very detailed and I will not repeat it.

Scrapy uses the Asynchronous Network Library twisted to process network communication. The architecture is clear and contains various middleware interfaces to flexibly meet various requirements. Shows the overall architecture:

The Green Line is the data flow direction. First, from the initial URL, sched will give it to downloader for download. After the download, schedider will hand it to SPIDER for analysis. The spider has two types of analysis results: one is the link that needs to be further crawled, such as the "next page" link analyzed previously, these items will be transmitted back to scheduler; the other is the data to be saved, they are sent to item pipeline, where data is processed (including detailed analysis, filtering, and storage. In addition, various middleware can be installed in the data flow channel for necessary processing.

It looks very complicated. In fact, it is very simple to use, just like rails. First, create a project:

scrapy-admin.py startproject blog_crawl

Createsblog_crawlDirectory, which containsscrapy-ctl.pyIs the control script of the entire project, and all the code is placed in sub-Directoriesblog_crawl. To capture mindhacks.cn
spidersCreatemindhacks_spider.py, Define our spider as follows:

from scrapy.spider import BaseSpider class MindhacksSpider(BaseSpider):    domain_name = "mindhacks.cn"    start_urls = ["http://mindhacks.cn/"]     def parse(self, response):        return [] SPIDER = MindhacksSpider()

OurMindhacksSpiderInherited fromBaseSpider(Generally, it directly inherits from richer functions.
scrapy.contrib.spiders.CrawlSpider
It is more convenient, but to show how data is parse, we still use
BaseSpider
), Variabledomain_nameAndstart_urlsIt is easy to understand what it means, andparseThe method is the callback function we need to define. After the default request gets the response, this callback function will be called. We need to parse the page here, two results are returned (the link for further crawl and the data to be saved), which makes me feel a bit strange, in its interface definition, these two results are actually returned in a list. It is not clear why the design is like this. Isn't it necessary to separate them in the end? In short, we first write an empty function to return only one empty list. Define a global variableSPIDER
It will be instantiated when scrapy imports this module, and is automatically found by the scrapy engine. In this way, you can first run the crawler and try it:

./scrapy-ctl.py crawl mindhacks.cn

There will be a bunch of output, you can see the capturehttp://mindhacks.cnBecause this is the initial URLparseThe function does not return the URL to be further crawled, so the whole crawl process only captures the home page and ends. The next step is to analyze the page. scrapy provides a very convenient shell (ipython is required) for us to perform experiments. Run the following command to start the shell:

./scrapy-ctl.py shell http://mindhacks.cn

It starts the crawler, captures the page specified by the command line, and then enters the shell. According to the prompt, there are many ready-made variables available, one of which ishxs, It isHtmlXPathSelector, Mindhacks HTML pages are relatively standard and can be conveniently used directly.
XPath for analysis. You can see through firebug that the links to every blog article are in
h1So use the following XPath expression in shell to test:

In [1]: hxs.x('//h1/a/@href').extract()Out[1]: [u'http://mindhacks.cn/2009/07/06/why-you-should-do-it-yourself/', u'http://mindhacks.cn/2009/05/17/seven-years-in-nju/', u'http://mindhacks.cn/2009/03/28/effective-learning-and-memorization/', u'http://mindhacks.cn/2009/03/15/preconception-explained/', u'http://mindhacks.cn/2009/03/09/first-principles-of-programming/', u'http://mindhacks.cn/2009/02/15/why-you-should-start-blogging-now/', u'http://mindhacks.cn/2009/02/09/writing-is-better-thinking/', u'http://mindhacks.cn/2009/02/07/better-explained-conflicts-in-intimate-relationship/', u'http://mindhacks.cn/2009/02/07/independence-day/', u'http://mindhacks.cn/2009/01/18/escape-from-your-shawshank-part1/']

This is the URL we need. In addition, we can find the link of the next page, which is located together with the links of several other pages.divBut there is no link to the next page.titleAttribute, so XPath writing

//div[@class="wp-pagenavi"]/a[not(@title)]

However, if you flip a page backward, you will find that the "Previous Page" is also like this. Therefore, you need to determine that the text on the link is the arrow of the next page.u'\xbb'Can also be written to XPath, but it seems that this character is a unicode escape character. It is not clear due to encoding reasons, and it is directly put out for judgment.parseThe function is as follows:

def parse(self, response):    items = []    hxs = HtmlXPathSelector(response)    posts = hxs.x('//h1/a/@href').extract()    items.extend([self.make_requests_from_url(url).replace(callback=self.parse_post)                  for url in posts])     page_links = hxs.x('//div[@class="wp-pagenavi"]/a[not(@title)]')    for link in page_links:        if link.x('text()').extract()[0] == u'\xbb':            url = link.x('@href').extract()[0]            items.append(self.make_requests_from_url(url))     return items

The first half is the link to parse the blog body to be crawled, and the second half is the link to the next page. It should be noted that the URLs in the returned list are not in string format. What scrapy wants to get isRequestObject, which can carry more things than a string URL, such as cookies or callback functions. We can seeRequestThe callback function is replaced because the default callback function
parseIt is used to parse pages such as the article list, andparse_postDefinition:

def parse_post(self, response):    item = BlogCrawlItem()    item.url = unicode(response.url)    item.raw = response.body_as_unicode()    return [item]

It is easy to returnBlogCrawlItemAnd put the captured data in it. You can parse it here. For example, you can use XPath to parse the body and title, but I prefer to do these things later, for example, the item pipeline or the offline stage later.BlogCrawlItemScrapy automatically defines an inherited userScrapedItemIn
items.pyHere I add something:

from scrapy.item import ScrapedItem class BlogCrawlItem(ScrapedItem):    def __init__(self):        ScrapedItem.__init__(self)        self.url = ''     def __str__(self):        return 'BlogCrawlItem(url: %s)' % self.url

Defined__str__Function, only the URL is given, because the default__str__The function will display all the data, so it will see the log output in the console during crawl, that is, output the captured webpage content. -.-Bb

In this way, the data is obtained, and only the data storage function is left. We add a pipeline to implement it. Because Python comes with sqlite3 support in the standard library, therefore, I use the SQLite database to store data. Replace pipelines. py with the following code:

12345678910111213141516171819202122232425262728293031323334353637
import sqlite3from os import path from scrapy.core import signalsfrom scrapy.xlib.pydispatch import dispatcher class SQLiteStorePipeline(object):    filename = 'data.sqlite'     def __init__(self):        self.conn = None        dispatcher.connect(self.initialize, signals.engine_started)        dispatcher.connect(self.finalize, signals.engine_stopped)     def process_item(self, domain, item):        self.conn.execute('insert into blog values(?,?,?)',                           (item.url, item.raw, unicode(domain)))        return item     def initialize(self):        if path.exists(self.filename):            self.conn = sqlite3.connect(self.filename)        else:            self.conn = self.create_table(self.filename)     def finalize(self):        if self.conn is not None:            self.conn.commit()            self.conn.close()            self.conn = None     def create_table(self, filename):        conn = sqlite3.connect(filename)        conn.execute("""create table blog                     (url text primary key, raw text, domain text)""")        conn.commit()        return conn

In__init__In the function, use dispatcher to connect two signals to the specified function for initializing and closing the database connection (in
close
Remembercommit, Does not appear to be automaticcommit, DirectlycloseIt seems that all data is lost DD -.-). When data passes through pipeline,process_itemThe function will be called. Here we will directly talk about storing the raw data in the database without any processing. If necessary, you can add additional pipelines to extract and filter data, which will not be detailed here.

Finallysettings.pyTo list our pipelines:

ITEM_PIPELINES = ['blog_crawl.pipelines.SQLiteStorePipeline']

Run the crawler again. OK! Finally, Let's sum up: a high-quality crawler is an extremely complex project, but if there is a good tool, it is easier to make a dedicated crawler. Scrapy is a lightweight crawler framework that greatly simplifies the process of crawler development. In addition, scrapy
This document is also very detailed. If you think that I have omitted some things that are not clear, we recommend that you read them.
Tutorial

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.