Scrapy easily customized web crawler

Source: Internet
Author: User

a web crawler, Spider, is a robot that crawls on a network Crawler. Of course it is not usually an entity of the robot, because the network itself is a virtual thing, so this "robot" is actually a program, and it is not Disorderlyclimb, but have a certain purpose, and when crawling will collect some information. For example, Google has a large number of crawlers on the Internet to collect Web content and links between them and other information, such as some of the malicious crawler will be on the internet to collect such as [email protected] or foo [at] bar [dot] com Things like that. In addition, there are some custom crawlers, specifically for a particular site, such as a few years ago Javaeye Robbin wrote a few special to deal with the Malicious Crawler blog (the original link seems to have been invalidated, not to give), as well asNiche softwareorLinuxtoySuch a site is often crawl down the entire site, the name of another hanging out. In fact, crawler from the basic principle is very simple, as long as you can access the network and analysis of Web pages, most languages now have a convenient Http client library can crawl Web pages, and HTML analysis of the simplest can be directly used in regular expressions to do, So it's actually a simple thing to be a rudimentary web crawler. But it is very difficult to achieve a high-quality spider.

Two parts of the crawler, one is to download Web pages, there are many issues to consider, how to maximize the use of local bandwidth, how to schedule Web requests for different sites to alleviate the burden of the other server. In a high-performance Web Crawler system, DNS queries are also a bottleneck for optimization, and there are some "Guild regulations" that need to be followed (such as robots.txt). and get the Web page analysis process is also very complex, the Internet is a strange thing, all kinds of errors are full of HTML pages have, want to fully analyze is almost impossible, in addition, with the popularity of AJAX, how to get the Javascript Dynamically generated content has become a big problem; There are also a variety of intentional or unintended spider traps on the Internet, and if you blindly follow hyperlinks, you'll be stuck in a Trap, such as this site, which is said to have been previously claimed by Google The Unique URL number has reached 1 trillion, so this person is a proud to announce the second trillion .

However, in fact, there are not many people need to do like Google's general Crawler, usually we do a Crawler is to climb a certain or a certain type of site, the so-called understanding, Baizhanbudai, we can in advance to crawl the site structure to do some analysis, things become much easier. Through analysis, the selection of valuable links to track, you can avoid a lot of unnecessary links or Spider traps, if the structure of the site allows to choose a suitable path, we can follow the order of the things we are interested in crawling, so that even the URL repeated judgment can be omitted.

For example, if we want to Pongba blog mindhacks.cn inside the blog text crawl down, through observation, it is easy to find that we are interested in two of the pages:

    1. The article List page, such as the homepage, or the URL is /page/\d+/ such a page, through Firebug can see the link of each article is in one h1 under the a tag (note that in the Firebug HTML panel to see the HTML code and View Source may see some discrepancy, if the Web page has Javascript dynamically modify the DOM tree, the former is modified version, and after the Firebug rules, such as attribute are quoted, and so on, and the latter is usually your spide The original content that r crawled to. If you are using regular expressions to analyze the page, or if the HTML Parser and Firefox are used in some way, you need to pay special attention to it, in addition, in a class for a wp-pagenavi div different list of links to the page.
    2. The article content page, each blog has such a page, for example/2008/09/11/machine-learning-and-ai-resources/, contains the complete article content, this is the content which we are interested.

Therefore, we start from the homepage, through wp-pagenavi the link to get other Article List page, in particular, we define a path: only follow Next page link, so you can go through sequentially, from start to finish, eliminating the need to judge the trouble of repeated fetching. In addition, the pages of the article List page to the specific article link to the corresponding page is what we really want to save the data page.

In this case, in fact, the scripting language to write an ad hoc Crawler to complete this task is not difficult, but today's protagonist is Scrapy, which is a Python written Crawler Framework, simple and lightweight, and very convenient, and the official web said that the actual production In use, so it's not a toy level thing. However, there is no release version, you can directly use their Mercurial warehouse crawl source to install. However, this thing can also not be installed directly to use, so it is convenient to update at any time, the document is very detailed, I will not repeat.

Scrapy uses Twisted this asynchronous network library to handle network communication, the structure is clear, and contains a variety of middleware interfaces, you can flexibly complete a variety of requirements. The overall architecture looks like this:

The Green Line is the data flow, first starting from the initial URL, Scheduler will give it to Downloader to download, download will be given to the spider for analysis, spider analysis of the results are two: one is to further crawl the link, such as the previous analysis of the "next page" Links, these things will be sent back to Scheduler, and the other is the data that needs to be saved, and they are delivered to Item Pipeline, which is a place for post-processing (detailed analysis, filtering, storage, etc.) of the data. In addition, in the data flow channel can also install a variety of middleware, to do the necessary processing.

It seems complicated, but it's simple to use, just like Rails, to create a new project first:

scrapy-admin.py Startproject Blog_crawl

A blog_crawl directory is created with a scrapy-ctl.py control script for the entire project, and the code is placed in subdirectories blog_crawl . To be able to crawl mindhacks.cn, we spiders create a new one in the catalog mindhacks_spider.py , which defines our Spider as follows:

From Scrapy.spider import Basespider class Mindhacksspider (basespider):    domain_name = "mindhacks.cn"    Start_ URLs = ["http://mindhacks.cn/"]    def parse (self, Response):        return [] SPIDER = Mindhacksspider ()

We MindhacksSpider inherit from BaseSpider (usually directly inherit from the richer features scrapy.contrib.spiders.CrawlSpider to be more convenient, but in order to show how the data is parse, which is used here BaseSpider ), variables domain_name and start_urls are easy to understand what is meant, and parse the method is We need to define the callback function, the default request to get response will call this callback function, we need to parse the page here, return two results (need to further crawl the link and need to save the data), let me feel somewhat strange, In its interface definition, these two results are mixed in a list to return, it is not clear why this design, and finally do not have to struggle to separate them? Anyway, here we'll write an empty function, returning only an empty list. In addition, define a "global" variable SPIDER , which is instantiated when Scrapy imports the module and is automatically found by the Scrapy engine. So you can run the crawler first, try it:

./scrapy-ctl.py Crawl Mindhacks.cn

There will be a bunch of output that can be seen crawling http://mindhacks.cn , because this is the initial URL, but since we parse don't return URLs that need to be crawled in the function, the entire crawl process only crawls the home page and ends. The next step is to analyze the page, and Scrapy provides a handy shell (requiring IPython) that allows us to experiment and start the shell with the following command:

./scrapy-ctl.py Shell http://mindhacks.cn

It will start crawler, the command line specified this page crawl down, and then into the shell, according to the hint, we have a lot of ready-made variables can be used, one is hxs , it is a HtmlXPathSelector , mindhacks HTML page comparison specification, can be very convenient directly with the X Path for analysis. As you can see through Firebug, the links to each blog post are h1 under, so use this XPath expression test in the Shell:

In [1]: hxs.x ('//h1/a/@href '). Extract () out[1]: [u ' http://mindhacks.cn/2009/07/06/why-you-should-do-it-yourself/', U ' http://mindhacks.cn/2009/05/17/seven-years-in-nju/', U ' http://mindhacks.cn/2009/03/28/ Effective-learning-and-memorization/', U ' http://mindhacks.cn/2009/03/15/preconception-explained/', U '/http mindhacks.cn/2009/03/09/first-principles-of-programming/', U ' http://mindhacks.cn/2009/02/15/ Why-you-should-start-blogging-now/', U ' http://mindhacks.cn/2009/02/09/writing-is-better-thinking/', U '/http mindhacks.cn/2009/02/07/better-explained-conflicts-in-intimate-relationship/', U ' http://mindhacks.cn/2009/02/07 /independence-day/', U ' http://mindhacks.cn/2009/01/18/escape-from-your-shawshank-part1/']

This is the URL we need, in addition, you can find the "next page" link, along with several other pages of the link in one div , but the "next page" link does not have title attributes, so the XPath writing

div[@class = "Wp-pagenavi"]/a[not (@title)]

However, if you turn back a page, you will find that the "previous page" is the same, so you also need to determine the link text is the next page of the arrow u‘\xbb‘ , can also be written to XPath inside, but as if this itself is the Unicode escape character, because the encoding reason is not clear, Directly to the outside to judge, the final parse function is as follows:

Def parse (self, Response):    items = []    hxs = htmlxpathselector (response)    posts = hxs.x ('//h1/a/@href '). Extract ()    items.extend ([Self.make_requests_from_url (URL). Replace (callback=self.parse_post) for                  URL in Posts])    page_links = hxs.x ('//div[@class = "Wp-pagenavi"]/a[not (@title)] ') for    link in page_links:        if Link.x (' text () '). Extract () [0] = = U ' \xbb ':            url = link.x (' @href '). Extract () [0]            items.append (self.make_ Requests_from_url (URL))    return items

The first half is the link that resolves the text of the blog that needs to be crawled, and the second part is the link to the next page. It is important to note that the list returned here is not a single string format URL is finished, scrapy want to get the Request object, which is more than a string format URL can carry more things, such as cookies or callback functions. You can see that we have replaced the callback function when we created the body of the blog, Request because the default callback function parse is specifically used to parse pages such as the list of articles, and is parse_post defined as follows:

def parse_post (self, Response):    item = Blogcrawlitem ()    item.url = Unicode (response.url)    Item.raw = Response.body_as_unicode ()    return [item]

Very simple, return a BlogCrawlItem , put the captured data inside, could have done a bit of parsing here, for example, through XPath to parse the text and the title, but I tend to do these things later, such as Item Pipeline or later Offline stage. BlogCrawlItemis a scrapy automatically help us define a inherited ScrapedItem empty class, in items.py , here I added something:

From Scrapy.item import Scrapeditem class Blogcrawlitem (Scrapeditem):    def __init__ (self):        scrapeditem.__init __ (self)        self.url = "    def __str__ (self):        return ' Blogcrawlitem (URL:%s) '% Self.url

Define the __str__ function, just give the URL, because the default __str__ function will show all the data, so you will see crawl when the console log mad output things, it is crawling to the content of the page output. -.-BB

In this way, the data is taken, finally only the function of storing data, we add a Pipeline to achieve, because Python in the standard library with Sqlite3 support, so I use the Sqlite database to store data. Replace the contents of pipelines.py with the following code:

12345678910111213141516171819202122232425262728293031323334353637
Import sqlite3from OS import path from scrapy.core import signalsfrom scrapy.xlib.pydispatch Import Dispatcher         class Sqlitestorepipeline (object): filename = ' data.sqlite '   def __init__ (self): Self.conn = None Dispatcher.connect (Self.initialize, signals.engine_started) dispatcher.connect (Self.finalize, Signals.engine                           _stopped)   def process_item (self, domain, item): Self.conn.execute (' INSERT INTO blog values (?,?,?) ',        (Item.url, Item.raw, Unicode (domain))) Return item  def Initialize (self): if Path.exists (self.filename): Self.conn = Sqlite3.connect (SE Lf.filename) Else:self.conn = self.create_table (self.filename)   def finalize (self): if s Elf.conn is not None:self.conn.commit () self.conn.close () Self.conn = none  def Create_table (self, filename): conn = sqlite3.connect (filename) conn.execuTe ("" "CREATE Table Blog (URL text primary key, raw text, domain text)" ") Conn.commit () Return conn

In the __init__ function, use dispatcher to connect two signals to the specified function, respectively, to initialize and close the database connection (as close previously remembered commit , it seems to be not automatic commit , directly close as if all the data lost DD-.-) 。 When there is data passing through the pipeline, the process_item function is called, where we directly tell the original data to be stored in the database without any processing. If necessary, you can add additional pipeline, extract data, filter, etc., here is not elaborate.

In the end, settings.py list our pipeline:

Item_pipelines = [' Blog_crawl.pipelines.SQLiteStorePipeline ']

Run again crawler, OK! Finally, a summary: a high-quality crawler is an extremely complex project, but it is easier to make a dedicated crawler if there are good tools. Scrapy is a very lightweight reptile framework that greatly simplifies the process of crawler development. In addition, Scrapy's documentation is also very detailed, if you think my introduction omitted some things not very clear, recommend to see his Tutorial.

Scrapy easily customized web crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.