Summary: Describes a way to use scrapy for two-way crawling (against classified information sites).
The so-called two-way crawl refers to the following situation, I want to a Life classification information of the site to crawl data, such as to crawl the rental information column, I see the page on the index page of the column, at this time I want to crawl the index page in the details of each entry (vertical crawl), Then jump to the next page in the pager (crawl horizontally), then crawl through the details of each entry in the second page, so loop until the last entry.
This defines a two-way crawl:
In this section,
The XPath that extracts the index page to the next index page is: '//*[contains (@class, "next")]//@href '
The XPath to extract the index page to the Entry detail page is: '//*[@itemprop = ' url ']/@href '
Source code address of the manual.py file:
https://github.com/Kylinlin/scrapybook/blob/master/ch03%2Fproperties%2Fproperties%2Fspiders%2Fmanual.py
Copy the previous basic.py file as a manual.py file and make the following changes:
Import request:from scrapy.http Import Request
Modify the spider's name to manual
Change startURLs to ' Http://web:9312/properties/index00000.html '
Rename the parse function of the raw material to parse_item and create a new parse function with the following code:
#This function extracts the hyperlink for each item Detail page in the index page and the next index page hyperlinkdefParse (self, response):#Get the next index URLs and yield requestsNext_selector = Response.xpath ('//*[contains (@class, "next")]//@href') forUrlinchnext_selector.extract ():yieldRequest (Urlparse.urljoin (response.url, URL))#the Request () function is not assigned to callback, and the default callback function is the parse function, so this statement is equivalent toyieldRequest (Urlparse.urljoin (response.url, url), callback=parse)#Get Item URLs and yield requestsItem_selector = Response.xpath ('//*[@itemprop = "url"]/@href') forUrlinchitem_selector.extract ():yieldRequest (urlparse.urljoin (response.url, URL), callback=self.parse_item)
If you run manual directly, you crawl through all of the pages, and now it's just the test phase that tells the spider to stop after crawling a specific number of item, by parameter:-S closespider_itemcount=10
Run command: $ scrapy crawl manual-s closespider_itemcount=10
Its output is as follows:
The spider runs like this: First a request is initiated for the URL in Start_url, and then the Downloader returns a response (the response contains the source code and other information for the Web page). The spider then automatically takes response as an argument to the parse function and invokes it.
This is how the parse function runs:
1. First extract the tag from the response class attribute with the next character (which is the "next page" in the pager), at the first run: ' index_00001.html '.
2. In the first for loop, first build a full URL address (' http://web:9312/scrapybook/properties/index_00001.html '), construct a request object as a parameter, And puts the object into a queue (at which point the object is the first element of the queue).
3. Continue to extract the hyperlinks (for example: ' property_000000.html ') in the respone that attribute itemprop is equal to the URL character's label (the detail page for each entry).
4. In the second for loop, build the full URL address (for example: ' http://web:9312/scrapybook/properties/property_000000. HTML '), and use the URL as a parameter to build a request object, in order to put the object into the previous queue.
5. At this point the queue is like this
Request (http://...index_00001.html) |
Request (http://...property_000000.html) |
... |
Request (http://...property_000029.html) |
6. When the last Item Detail page hyperlink (property_000029.html) is placed in the queue, the scheduler begins to process the queue, extracting the last element of the queue from the back to the download and passing the response to the callback function (Parse_ Item) to the first element (index_00001.html), since no callback function is specified, the default callback function is the parse function itself, at which point it is entered in step 1, this time the extracted hyperlink is: ' index_00002.html ', And then just loop it down.
The parse function executes in a similar way:
Next_requests = []for in... Next_requests.append (Request (...)) for inch ... Next_requests.append (Request (...)) return next_requests
You can see that the biggest benefit of using a LIFO queue is to start working on an index page as soon as the list of entries in the index page is processed, rather than maintaining an extra-long queue, which saves memory, does it feel like the parse function above is a bit hard to understand, and can actually be a simpler way to For this two-way crawl scenario, you can use the crawl template.
First, create a spider named easy in the command line according to the crawl template.
$ scrapy genspider-t Crawl Easy Web
Open the file
...classEasyspider (crawlspider): Name=' Easy'Allowed_domains= ['Web'] Start_urls= ['http://www.web/'] Rules=(Rule (Linkextractor ( allow=r'items/'), callback='Parse_item', follow=True),)defParse_item (Self, Response): ...
You can see that the above code is automatically generated, note that the spider inherits the Crawlspider class, and the Crawlspider class has already provided the parse function by default, so we do not need to write the parse function, just configure the rules variable to
Rules = ( Rule (linkextractor (restrict_xpaths='//*[contains (@class, "next")]' ), Rule (linkextractor (restrict_xpaths='//*[@itemprop = "url"]' ) , callback='parse_item' )
Run command: $ scrapy crawl easy-s closespider_itemcount=90
This method has the following differences:
The two XPath differs from the previous use in that it does not have two constraint characters, a and href, because linkextrator is specifically used to extract hyperlinks, so the values of a and href in the tags are automatically extracted, Of course, you can extract the hyperlinks in other tags or attributes by modifying the parameters tags and attrs in the Linkextrator function.
Also note that the value of the callback here is a string, not a reference to the function.
The callback value is set in the Rule () function, and the spider does not track the other hyperlinks in the target page by default (meaning that the crawled page will not use XPaths to extract the information, and the crawler will end up on this page). If you set the value of callback, you can also trace by setting the value of the parameter follow to true, or you can return/yield these hyperlinks in callback specified function.
Scrapy Learning Notes (iv)-Scrapy two-way crawl