The target site for http://www.ccgp-hubei.gov.cn, after checking the HTML code found that the page has a iframe,iframe content is really useful content of the site, so the first step is to find the real URL.
Take Http://www.ccgp-hubei.gov.cn/fnoticeAction!listFNotice.action as an example, the main structure of this URL is a paginated list, with pages, pages, and so on. View the HTML element for both buttons to see that this is a JS function:
At first I was thinking, since this is the JS function, then I use SCRAPYJS + splash combination to solve, but after the experiment found that there is a problem. It is in the splash script that you can modify Document.title = "Hello" and so on, but after invoking the JS function to go to another link, the return is still the first HTML code, either with the splash:html () It's not even document.body.innetHTML. Do not know whether it is not running JS function success, or splash itself problems, in the online search, found that there are many people encounter this problem, but there is no feasible solution, finally on GitHub a big God told me Splash in the development of Splash:mouse_click () function, let me wait quietly.
Helpless, I changed a way of thinking, open F12, click on the "Next page" button to see what the browser has done, as shown in the figure:
This is the amount of data in the browser post, which attempts to send this data directly to the server:
#-*-coding:utf-8-*-from scrapy.spiders import Spider from scrapy.http import formrequest fr Om Scrapy.shell Import inspect_response class Thirdspider (Spider): name = ' Thirdspider ' download_delay = 0 St
Art_urls = [' http://www.ccgp-hubei.gov.cn/fnoticeAction!listFNotice.action '] def parse (self, Response): Formdata = {"queryinfo.begintime1": "", "queryinfo.begintime2": "", "Queryinfo.cgfs": "", "qu ERYINFO.CGLX ":", "queryinfo.endtime1": "", "queryinfo.endtime2": "", "QUERYINFO.FBRMC": "", "Queryin Fo.
GGLX ":", "QUERYINFO.QYBM": "", "Queryinfo.title": "", "Queryinfo.curpage": "2", "Queryinfo.pagesize": "A", "Rank": ""} yield formrequest.from_response (response, Formdata=formdata, callback=self . Parse_item) def parse_item (self, Response): Inspect_response (response, self)
The result is as follows, and you can see that the page number has turned to the second page, and the problem ends.