Summary: Describes how to use Scrapy to process JSON APIs and AJAX pages
Sometimes, you will find the page you want to crawl does not exist HTML source code, for example, open http://localhost:9312/static/in the browser, then right-click on the space, select "View page source", as follows:
You'll find a blank.
Notice that a file named Api.json is specified in the Red line, so open the network panel in the browser's debugger and find the label named Api.json
In the red box you find the contents of the original page, which is a simple JSON API, some complex API will ask you to log in first, send a POST request, or return some more interesting data structure. Python provides a library for parsing JSON that can transform JSON data into Python objects through statement json.loads (Response.body)
Source code address of the api.py file:
https://github.com/Kylinlin/scrapybook/blob/master/ch05%2Fproperties%2Fproperties%2Fspiders%2Fapi.py
Copy the manual.py file, rename it to api.py, and make the following changes:
Start_urls = ('http://web:9312/properties/api.json',)
If you need to log in before acquiring this JSON API, use the Start_request () function (refer to the Learning Scrapy Note (v)-Scrapy login website)
- Modify the Parse function
def parse (self, Response): Base_url = " http://web:9312/properties/ " js = Json.loads (response.body) for item in Js:id = item[ " id " "url = base_url + " property_%06d.html " % ID # Span style= "COLOR: #008000" to build a full URL for each entry yield Request (URL, callback= Self.parse_item)
The above JS variable is a list, each element represents an entry and can be verified using the Scrapy Shell tool:
Scrapy Shell Http://web:9312/properties/api.json
Run the Spider:scrapy Crawl API
You can see that a total of 31 request was sent and 30 item was obtained
In the second observation, using the Scrapy shell tool to check the JS variable diagram, in addition to the ID field, you can also get the title field, so you can also get the title field in the Parse function, and transfer the value of the field to Parse_ The item function is populated with item (eliminating the step of using XPath in the Parse_item function to extract the title), and the parse function is modified as follows:
title = item["title"]yield Request (URL, meta={"title" the#meta variable is a dictionary that is used to pass data to the callback function
In the Parse_item function, you can extract this field from the response
L.add_value ('title', response.meta['title'), Mapcompose (Unicode.strip, unicode.title))
Learning Scrapy Notes (vi)-SCRAPY processing JSON API and AJAX pages