Reprint Please specify source: http://www.cnblogs.com/codefish/p/4993809.html
Recently in the group frequently asked about the Ajax and JS processing problems, we all know, now a lot of pages are loaded with dynamic technology, which brings a good page experience, on the other hand, in the crawl or or less to bring considerable trouble, because we know the direct Get home page URL, There is no way to display this content. So how do you deal with these things?
is an intuitive analysis, when crawling data, we generally take into account the mobile phone side of the site, because the mobile phone site to get data relatively easy, especially the WAP protocol site, its paging is mostly not the form of Ajax paging or waterfall flow, so it is relatively easy to crawl more. In addition, after analysis to the request header, we can easily get the AJAX request address, this time intuitive to call this address, to see whether the normal data obtained. Change the browser and then in the call once, see whether the data is normal, if normal, the URL may be public, that in the guarantee of a certain frequency access can easily get the data. Here I use an example to illustrate the analysis request.
One, open the target website to see how it is loaded:
Https://www.abdserotec.com/primary-antibodies-monoclonal-polyclonal.html#productType=Monoclonal%20Antibody
Second, analysis website
When I open the site, it can be obvious that the data is through the drop-down list, after the end to solve the Ajax events to request data, then we actually go to him when the request is what happened
We got the requested address:
Https://api.uk-plc.net/product_tables/v1/abd?filter={%22producttype%22:{%22$in%22:[%22monoclonal%20antibody%22 ]}}&skip=360&limit=40&sort=[[%22specificitysummary%22,1],[%22host%22,1],[%22uniquename%22,1],[% 22format%22,1]]
Then I open it directly in the browser to see:
It's obvious to see a familiar JSON-formatted string
Don't worry, here we need to change the browser to open the API interface just now, why do you want to do this? Because we are now open with a certain request parameters, we change the browser is to clear these parameters, and then to access, if you still get the data, so that the API interface itself is public, and the administrator does not do this interface filter.
Third, further analysis of parameters
OK, this directly indicates that you can access this interface directly, how to page?
Let's see what the parameters are in the URL:
Https://api.uk-plc.net/product_tables/v1/abd?filter={%22producttype%22:{%22$in%22:[%22monoclonal%20antibody%22 ]}}&skip=360&limit=40&sort=[[%22specificitysummary%22,1],[%22host%22,1],[% 22uniquename%22,1],[%22format%22,1]]
skip=360
Limit=40
This is similar to the way C # LINQ is paged, so I can assume this boldly:
Limit is PageCount, the number of pages per page
Skip is to skip the first few pages of data
PageIndex first few pages
The corresponding data to obtain several pages is:
Skip = (pageindex-1) *pagecount
Limit = 40
Verify that the data is still getting
Four, write the code
In this case, I wrote a simple script in Python:
__author__='Bruce'ImportRequestspage_count= 20Page_index= 1url_template='https://api.uk-plc.net/product_tables/v1/abd?filter={%22producttype%22:{%22$in%22:[%22monoclonal% 20antibody%22,%22polyclonal%20antibody%22]}}&skip={page_index}&limit={page_count}&sort=[[% 22specificitysummary%22,1],[%22host%22,1],[%22uniquename%22,1],[%22format%22,1]]'defget_json_from_url (URL): R=requests.get (URL)returnR.json () ['Results']defInit_url_by_parms (page_count=40, Page_index=1): if notPage_countor notPage_index:return "' returnUrl_template.replace ('{Page_index}', str ((page_index-1) * page_count)). Replace ('{Page_count}', str (page_count))if __name__=='__main__': URL= Init_url_by_parms (Page_count=page_count, page_index=Page_index)PrintURL Objs=get_json_from_url (URL)ifOBJS: forObjinchOBJS:Print '####################################' forKvinchObj.items ():PrintK':'V
In addition, friends say how to get the total number of pages? We assume that the amount of data on the existing 40 pages, assuming that the total number of pages is 100, if there is data on page 100th, then visit page No. 200, if you do not get the data, then access to the 100+200/2 page data, and so on, almost log2n times can get the total number of pages, here is the use of dichotomy can be obtained.
Summarize:
This article mainly analyzes the Ajax can directly invoke and analyze the request process, in my opinion, code code through thinking to analyze the problem, more than the hard writing code dead knock to the strong, next time I will analyze the direct call Ajax interface can not solve the situation.
Reprint Please note source: http://www.cnblogs.com/codefish/p/4993809.html
(9) How to do the crawler scrapy under distributed--about the processing of Ajax Crawl (i)