Learning Scrapy Notes (vi)-SCRAPY processing JSON API and AJAX pages

Source: Internet
Author: User

Summary: Describes how to use Scrapy to process JSON APIs and AJAX pages

Sometimes, you will find the page you want to crawl does not exist HTML source code, for example, open http://localhost:9312/static/in the browser, then right-click on the space, select "View page source", as follows:

You'll find a blank.

Notice that a file named Api.json is specified in the Red line, so open the network panel in the browser's debugger and find the label named Api.json

In the red box you find the contents of the original page, which is a simple JSON API, some complex API will ask you to log in first, send a POST request, or return some more interesting data structure. Python provides a library for parsing JSON that can transform JSON data into Python objects through statement json.loads (Response.body)

Source code address of the api.py file:

https://github.com/Kylinlin/scrapybook/blob/master/ch05%2Fproperties%2Fproperties%2Fspiders%2Fapi.py

Copy the manual.py file, rename it to api.py, and make the following changes:

    • Modify the spider name to API

    • Modify Start_urls to the URL of the JSON API, as follows

Start_urls = ('http://web:9312/properties/api.json',)

If you need to log in before acquiring this JSON API, use the Start_request () function (refer to the Learning Scrapy Note (v)-Scrapy login website)

    • Modify the Parse function
 def   parse (self, Response): Base_url  =  " http://web:9312/properties/  "  js  = Json.loads (response.body)  for  item in   Js:id = item[ " id   "  "url  = base_url + "  property_%06d.html  " % ID #  Span style= "COLOR: #008000" to build a full URL for each entry  yield  Request (URL, callback= Self.parse_item) 

The above JS variable is a list, each element represents an entry and can be verified using the Scrapy Shell tool:

Scrapy Shell Http://web:9312/properties/api.json

Run the Spider:scrapy Crawl API

You can see that a total of 31 request was sent and 30 item was obtained

In the second observation, using the Scrapy shell tool to check the JS variable diagram, in addition to the ID field, you can also get the title field, so you can also get the title field in the Parse function, and transfer the value of the field to Parse_ The item function is populated with item (eliminating the step of using XPath in the Parse_item function to extract the title), and the parse function is modified as follows:

title = item["title"]yield Request (URL, meta={"title"  the#meta variable is a dictionary that is used to pass data to the callback function

In the Parse_item function, you can extract this field from the response

L.add_value ('title', response.meta['title'),        Mapcompose (Unicode.strip, unicode.title))

Learning Scrapy Notes (vi)-SCRAPY processing JSON API and AJAX pages

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.