Scrapy to crawl Ajax sites

Source: Internet
Author: User

The target site for http://www.ccgp-hubei.gov.cn, after checking the HTML code found that the page has a iframe,iframe content is really useful content of the site, so the first step is to find the real URL.

Take Http://www.ccgp-hubei.gov.cn/fnoticeAction!listFNotice.action as an example, the main structure of this URL is a paginated list, with pages, pages, and so on. View the HTML element for both buttons to see that this is a JS function:

At first I was thinking, since this is the JS function, then I use SCRAPYJS + splash combination to solve, but after the experiment found that there is a problem. It is in the splash script that you can modify Document.title = "Hello" and so on, but after invoking the JS function to go to another link, the return is still the first HTML code, either with the splash:html () It's not even document.body.innetHTML. Do not know whether it is not running JS function success, or splash itself problems, in the online search, found that there are many people encounter this problem, but there is no feasible solution, finally on GitHub a big God told me Splash in the development of Splash:mouse_click () function, let me wait quietly.

Helpless, I changed a way of thinking, open F12, click on the "Next page" button to see what the browser has done, as shown in the figure:

This is the amount of data in the browser post, which attempts to send this data directly to the server:

#-*-coding:utf-8-*-from scrapy.spiders import Spider from scrapy.http import formrequest fr Om Scrapy.shell Import inspect_response class Thirdspider (Spider): name = ' Thirdspider ' download_delay = 0 St 
        Art_urls = [' http://www.ccgp-hubei.gov.cn/fnoticeAction!listFNotice.action '] def parse (self, Response): Formdata = {"queryinfo.begintime1": "", "queryinfo.begintime2": "", "Queryinfo.cgfs": "", "qu ERYINFO.CGLX ":", "queryinfo.endtime1": "", "queryinfo.endtime2": "", "QUERYINFO.FBRMC": "", "Queryin Fo.
                    GGLX ":", "QUERYINFO.QYBM": "", "Queryinfo.title": "", "Queryinfo.curpage": "2", "Queryinfo.pagesize": "A", "Rank": ""} yield formrequest.from_response (response, Formdata=formdata, callback=self . Parse_item) def parse_item (self, Response): Inspect_response (response, self) 

The result is as follows, and you can see that the page number has turned to the second page, and the problem ends.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.