In the previous blog http://zhouxi2010.iteye.com/blog/1450177
The introduction of the use of Scrapy crawl Web pages, but only to crawl the normal HTML link, the AJAX request for the Web page is not caught, but the actual application of the AJAX request is very common, so here in the record crawl Ajax page method.
is still spiders/book.py:
Java Code class bookspider (crawlspider): ................ ................ ................ def parse_item (self, response): hxs = htmlxpathselector (response) item = bookitem () ........ .......... #这里url是虚构的, need to be modified when used url = "http://test_url/ Callback.php?ajax=true " request = Request (url, callback=self.parse_aJAX) request.meta[' item '] = item #这里将ajax的url找出来, then enough to find the request, the framework execution request received back and then callback yield request Def parse_ajax (self, response): data = response.body # Write a regular match here or select Xpathselector to capture the data that you want, slightly ajaxdata = get_data (data) #由于返回可能是js, Python can be used to simulate the JS interpreter, but here is lazy to use JSON for conversion if ajaxdata: x = ' {' Data ": " ' + ajaxdata.replace ('\ n ', ') + ' "} ' ajaxdata = simplejson.loads (x) [' Data '] else: ajaxdata = ' item = response.meta[' Item '] item[' Ajaxdata '] = ajaxdata for key in item: if isinstance (item[key], unicode): item[key] = item[key].encode (' UTF8 ') #到这里一个Item的全部元素都抓齐了, so return item for save return item