Scrapy Learning Notes (iv)-Scrapy two-way crawl

Source: Internet
Author: User

Summary: Describes a way to use scrapy for two-way crawling (against classified information sites).

The so-called two-way crawl refers to the following situation, I want to a Life classification information of the site to crawl data, such as to crawl the rental information column, I see the page on the index page of the column, at this time I want to crawl the index page in the details of each entry (vertical crawl), Then jump to the next page in the pager (crawl horizontally), then crawl through the details of each entry in the second page, so loop until the last entry.

This defines a two-way crawl:

    • Horizontal direction – from one index page to another index page

    • Straight direction – from one index page to the entry details page

In this section,

The XPath that extracts the index page to the next index page is: '//*[contains (@class, "next")]//@href '

The XPath to extract the index page to the Entry detail page is: '//*[@itemprop = ' url ']/@href '

Source code address of the manual.py file:

https://github.com/Kylinlin/scrapybook/blob/master/ch03%2Fproperties%2Fproperties%2Fspiders%2Fmanual.py

Copy the previous basic.py file as a manual.py file and make the following changes:

    • Import request:from scrapy.http Import Request

    • Modify the spider's name to manual

    • Change startURLs to ' Http://web:9312/properties/index00000.html '

    • Rename the parse function of the raw material to parse_item and create a new parse function with the following code:

#This function extracts the hyperlink for each item Detail page in the index page and the next index page hyperlinkdefParse (self, response):#Get the next index URLs and yield requestsNext_selector = Response.xpath ('//*[contains (@class, "next")]//@href')         forUrlinchnext_selector.extract ():yieldRequest (Urlparse.urljoin (response.url, URL))#the Request () function is not assigned to callback, and the default callback function is the parse function, so this statement is equivalent toyieldRequest (Urlparse.urljoin (response.url, url), callback=parse)#Get Item URLs and yield requestsItem_selector = Response.xpath ('//*[@itemprop = "url"]/@href')         forUrlinchitem_selector.extract ():yieldRequest (urlparse.urljoin (response.url, URL), callback=self.parse_item)

If you run manual directly, you crawl through all of the pages, and now it's just the test phase that tells the spider to stop after crawling a specific number of item, by parameter:-S closespider_itemcount=10

Run command: $ scrapy crawl manual-s closespider_itemcount=10

Its output is as follows:

The spider runs like this: First a request is initiated for the URL in Start_url, and then the Downloader returns a response (the response contains the source code and other information for the Web page). The spider then automatically takes response as an argument to the parse function and invokes it.

This is how the parse function runs:

1. First extract the tag from the response class attribute with the next character (which is the "next page" in the pager), at the first run: ' index_00001.html '.

2. In the first for loop, first build a full URL address (' http://web:9312/scrapybook/properties/index_00001.html '), construct a request object as a parameter, And puts the object into a queue (at which point the object is the first element of the queue).

3. Continue to extract the hyperlinks (for example: ' property_000000.html ') in the respone that attribute itemprop is equal to the URL character's label (the detail page for each entry).

4. In the second for loop, build the full URL address (for example: ' http://web:9312/scrapybook/properties/property_000000. HTML '), and use the URL as a parameter to build a request object, in order to put the object into the previous queue.

5. At this point the queue is like this

Request (http://...index_00001.html)

Request (http://...property_000000.html)

...

Request (http://...property_000029.html)

6. When the last Item Detail page hyperlink (property_000029.html) is placed in the queue, the scheduler begins to process the queue, extracting the last element of the queue from the back to the download and passing the response to the callback function (Parse_ Item) to the first element (index_00001.html), since no callback function is specified, the default callback function is the parse function itself, at which point it is entered in step 1, this time the extracted hyperlink is: ' index_00002.html ', And then just loop it down.

The parse function executes in a similar way:

Next_requests = []for in...    Next_requests.append (Request (...))  for inch ...    Next_requests.append (Request (...)) return next_requests

You can see that the biggest benefit of using a LIFO queue is to start working on an index page as soon as the list of entries in the index page is processed, rather than maintaining an extra-long queue, which saves memory, does it feel like the parse function above is a bit hard to understand, and can actually be a simpler way to For this two-way crawl scenario, you can use the crawl template.

First, create a spider named easy in the command line according to the crawl template.

$ scrapy genspider-t Crawl Easy Web

Open the file

...classEasyspider (crawlspider): Name=' Easy'Allowed_domains= ['Web'] Start_urls= ['http://www.web/'] Rules=(Rule (Linkextractor ( allow=r'items/'), callback='Parse_item', follow=True),)defParse_item (Self, Response): ...

You can see that the above code is automatically generated, note that the spider inherits the Crawlspider class, and the Crawlspider class has already provided the parse function by default, so we do not need to write the parse function, just configure the rules variable to

Rules = (    Rule (linkextractor (restrict_xpaths='//*[contains (@class, "next")]'  ),    Rule (linkextractor (restrict_xpaths='//*[@itemprop = "url"]' ) ,         callback='parse_item' )

Run command: $ scrapy crawl easy-s closespider_itemcount=90

This method has the following differences:

    • The two XPath differs from the previous use in that it does not have two constraint characters, a and href, because linkextrator is specifically used to extract hyperlinks, so the values of a and href in the tags are automatically extracted, Of course, you can extract the hyperlinks in other tags or attributes by modifying the parameters tags and attrs in the Linkextrator function.

    • Also note that the value of the callback here is a string, not a reference to the function.

    • The callback value is set in the Rule () function, and the spider does not track the other hyperlinks in the target page by default (meaning that the crawled page will not use XPaths to extract the information, and the crawler will end up on this page). If you set the value of callback, you can also trace by setting the value of the parameter follow to true, or you can return/yield these hyperlinks in callback specified function.

Scrapy Learning Notes (iv)-Scrapy two-way crawl

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.