Scrapy under the network crawler

Source: Internet
Author: User
Tags data structures xpath

Scrapy is a crawl site framework, users need to do is to define the crawl site spider, and in which the rules of grasping, capture the data need to crawl, scrapy management of other complex work, such as concurrent request, after the extraction of data preservation.
Scrapy claims they "stole" the Django inspiration, although the direction of the two can not be linked together, but indeed if the knowledge of Django, the structure of the scrapy will be very kind. Scrapy also has the concept of a project, a project that can contain multiple spiders (spider), crawl data structures that define items, and some configurations.
Scrapy Crawl process: Through the definition of the spider need to crawl the site, and will need to extract the data into the items stored, and then through the pipeline (pipeline) the items inside the data extraction, save to the file or database. scrapy Tutorial

First, create a new project called DMOZ:

Here, refer to the examples in Scrapy tutorial to capture the data on Open Directory Project (DMOZ).

Scrapy Startproject DMOZ

A directory called DMOZ will be created with the following structure:

dmoz/
   scrapy.cfg   
   dmoz/
       __init__.py
       items.py
       pipelines.py settings.py spiders/
           __init__.py
           ...
SCRAPY.CFG: Project configuration file (basically let it be) items.py: The data structure definition file that needs to be extracted pipelines.py: pipe definition, used to further process the data extracted from the items settings.py: put some configuration spiders : Place the Spider directory

then, in items.py, define the data we want to crawl:

From Scrapy.item Import Item, Field

class Dmozitem (item):
   title = field ()
   link = field ()
   desc = field ()

Here we need to get the title, link, description on the DMOZ page, so define a corresponding items structure, unlike the models definition in Django There are so many kinds of field, there is only one is called field (), More complex is that field can accept a default value.

Next, start writing spider:

Spider is just an inherited word Scrapy.spider.BaseSpider python class, with three required definition of member name: First name, this spider identification start_urls: A list of URLs, spider from these pages began to crawl Parse (): A method that needs to be called when the page inside the Start_urls is crawled to parse the content of the page, returning to the next page that needs to be crawled, or returning to the items list (which, see FAQ)

So create a new spider,dmoz_spider.py in the Spiders directory:

Class Dmozspider (Basespider):
   name = "dmoz.org"
   start_urls = [
       "http://www.dmoz.org/Computers/ programming/languages/python/books/",
       " http://www.dmoz.org/Computers/Programming/Languages/Python/ Resources/"
   ]

   def parse (self, response):
       filename = response.url.split ("/") [-2]
       open (filename, ' WB ' ). Write (Response.body)

Next, extract the data into the items, mainly using XPath to extract Web data:

Scrapy has two XPath selectors, Htmlxpathselector and Xmlxpathselector, one for HTML, and one for the Xml,xpath selector with three methods Select (XPath): Returns a list of selectors relative to the currently selected node (an XPath may be selected to multiple nodes) extract (): Returns a string (list) of nodes corresponding to the selector (list) Re (regex): A list of strings (grouped matches) that return a regular expression match

A good way to do this is to test the XPath inside the shell:

Scrapy Shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

Now modify the parse () method to see how to extract the data into the items:

Def parse (self, Response):
      HxS = htmlxpathselector (response)
      sites = hxs.select ('//ul/li ')
      items = []
      For site in sites:
          item = Dmozitem ()
          item[' title '] = Site.select (' A/text () '). Extract ()
          item[' link '] = Site.select (' @href '). Extract ()
          item[' desc ' = Site.select (' text () '). Extract ()
          items.append (item)
      return items

Finally, save the crawled data:

Scrapy provides several options for saving data as a json,csv or XML file, starting with the defined Dmoz_spider (note that his name is dmoz.org) and saving the crawled data as JSON and executing commands in the DMOZ directory

Scrapy Crawl dmoz.org--set Feed_uri=items.json--set

If you need to further process the items data, such as saving directly to the database, use the pipelines FAQ that isn't Included in Manual

constantly crawl how the next link is implemented and how items are saved.

To explain the parse () method, parse can return to the request list, or the items list, and if the request is returned, the request is placed in the next queue that needs to be crawled, and if you return items, The corresponding items can be uploaded to pipelines (or saved directly if using the default feed exporter). So if the next link is returned by the parse () method, then the items are returned to the save. The request object accepts a parameter callback the parse function that specifies the content of the Web page returned by this request (in fact start_urls corresponds to the callback default is the Parse method), so you can specify that parse return to request. Then specify another Parse_item method to return items:

Def parse (self, Response):
    # dosomething return
    [Request (URL, callback=self.parse_item)]
def parse_item ( Self, Response):
    # item[' key ' = value return
    [item]

With respect to the return value of the analytic function, in addition to returning the list, you can actually use the generator, which is equivalent:

Def parse (self, Response):
    # dosomething
    yield Request (URL, callback=self.parse_item)
def parse_item ( Self, Response):
    yield Item

How to pass a value between analytic functions.

A common situation: in the parse to some fields of the item to extract the value, but some other values need to be extracted in the parse_item, this time need to transfer the item in parse to the Parse_item method processing, obviously can not directly to the Parse_ Item set out parameters. The request object accepts a meta parameter, a Dictionary object, and the response object has a meta attribute that can be taken to the meta that the corresponding request passes over. So to solve the above problems can do this:

Def parse (self, Response):
    # item = itemclass ()
    yield Request (URL, meta={' item ': item}, Callback=self.parse_ Item
def parse (self, Response):
    item = response.meta[' item ']
    item[' field ' = value
    yield item

pipelines.py how to use.

Specific reference: http://doc.scrapy.org/topics/item-pipeline.html, only need to enable the defined pipelines component in settings.py, it may be confusing that if the default feed is specified Exporter,piplelines will have any effect on the process of item processing, the answer is pipelines will replace the default feed exporter, all spider items returned in the project (such as Parse_ Item) is eventually passed into the Proccess_item () method defined in pipelines for further processing. Other Tricks

How do I handle extract () return to an empty list?

Because the Extract () method returns a list of strings, an empty list is often encountered if the selector does not get the contents of a node:

item[' field ' = Ex_data[0].strip () If Len (ex_data) > 0 Else '

A better way to handle it:

item[' field ' = '. Join (Ex_data). Strip ()

How do I set a default value for an XPath selection?

When XPath selects text within a node, if the content of the node is empty, XPath does not return an empty string, but instead returns nothing, corresponding to the list is one less item of the corresponding list item, and sometimes the need for such an empty string when the default value. There is a concat function in XPath that can achieve this effect:

Text = Hxs.select (' Concat (//span/text (), "") "). Extract ()

For empty spans, an empty string is returned

Scrapy.log is a very useful debugging tool.

You need to specify the Log_level in the settings.py, default to ' DEBUG ', so grab the time each item gets the content will output to the screen, if the content of the crawl is too much, sometimes some abnormal information will be submerged. So sometimes you need to set a higher level, such as ' WARNING ' so that in spider you can use Log.msg (' info ', log) where needed. WARNING) to output some useful information.

another convenient debugging method to invoke the interactive shell environment in Spider

Insert where you want to interrupt debugging:

From Scrapy.shell import inspect_response
inspect_response (response)

This will interrupt the crawl, into a shell,response for the current crawl URL content.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.