(9) How to do the crawler scrapy under distributed--about the processing of Ajax Crawl (i)

Source: Internet
Author: User

Reprint Please specify source: http://www.cnblogs.com/codefish/p/4993809.html

Recently in the group frequently asked about the Ajax and JS processing problems, we all know, now a lot of pages are loaded with dynamic technology, which brings a good page experience, on the other hand, in the crawl or or less to bring considerable trouble, because we know the direct Get home page URL, There is no way to display this content. So how do you deal with these things?

is an intuitive analysis, when crawling data, we generally take into account the mobile phone side of the site, because the mobile phone site to get data relatively easy, especially the WAP protocol site, its paging is mostly not the form of Ajax paging or waterfall flow, so it is relatively easy to crawl more. In addition, after analysis to the request header, we can easily get the AJAX request address, this time intuitive to call this address, to see whether the normal data obtained. Change the browser and then in the call once, see whether the data is normal, if normal, the URL may be public, that in the guarantee of a certain frequency access can easily get the data. Here I use an example to illustrate the analysis request.

One, open the target website to see how it is loaded:

Https://www.abdserotec.com/primary-antibodies-monoclonal-polyclonal.html#productType=Monoclonal%20Antibody

Second, analysis website

When I open the site, it can be obvious that the data is through the drop-down list, after the end to solve the Ajax events to request data, then we actually go to him when the request is what happened

We got the requested address:

Https://api.uk-plc.net/product_tables/v1/abd?filter={%22producttype%22:{%22$in%22:[%22monoclonal%20antibody%22 ]}}&skip=360&limit=40&sort=[[%22specificitysummary%22,1],[%22host%22,1],[%22uniquename%22,1],[% 22format%22,1]]

Then I open it directly in the browser to see:

It's obvious to see a familiar JSON-formatted string

Don't worry, here we need to change the browser to open the API interface just now, why do you want to do this? Because we are now open with a certain request parameters, we change the browser is to clear these parameters, and then to access, if you still get the data, so that the API interface itself is public, and the administrator does not do this interface filter.

Third, further analysis of parameters

OK, this directly indicates that you can access this interface directly, how to page?

Let's see what the parameters are in the URL:

Https://api.uk-plc.net/product_tables/v1/abd?filter={%22producttype%22:{%22$in%22:[%22monoclonal%20antibody%22 ]}}&skip=360&limit=40&sort=[[%22specificitysummary%22,1],[%22host%22,1],[% 22uniquename%22,1],[%22format%22,1]]

skip=360

Limit=40

This is similar to the way C # LINQ is paged, so I can assume this boldly:

Limit is PageCount, the number of pages per page

Skip is to skip the first few pages of data

PageIndex first few pages

The corresponding data to obtain several pages is:

Skip = (pageindex-1) *pagecount

Limit = 40

Verify that the data is still getting

Four, write the code

In this case, I wrote a simple script in Python:

__author__='Bruce'ImportRequestspage_count= 20Page_index= 1url_template='https://api.uk-plc.net/product_tables/v1/abd?filter={%22producttype%22:{%22$in%22:[%22monoclonal% 20antibody%22,%22polyclonal%20antibody%22]}}&skip={page_index}&limit={page_count}&sort=[[% 22specificitysummary%22,1],[%22host%22,1],[%22uniquename%22,1],[%22format%22,1]]'defget_json_from_url (URL): R=requests.get (URL)returnR.json () ['Results']defInit_url_by_parms (page_count=40, Page_index=1):    if  notPage_countor  notPage_index:return "'    returnUrl_template.replace ('{Page_index}', str ((page_index-1) * page_count)). Replace ('{Page_count}', str (page_count))if __name__=='__main__': URL= Init_url_by_parms (Page_count=page_count, page_index=Page_index)PrintURL Objs=get_json_from_url (URL)ifOBJS: forObjinchOBJS:Print '####################################'             forKvinchObj.items ():PrintK':'V

In addition, friends say how to get the total number of pages? We assume that the amount of data on the existing 40 pages, assuming that the total number of pages is 100, if there is data on page 100th, then visit page No. 200, if you do not get the data, then access to the 100+200/2 page data, and so on, almost log2n times can get the total number of pages, here is the use of dichotomy can be obtained.

Summarize:

This article mainly analyzes the Ajax can directly invoke and analyze the request process, in my opinion, code code through thinking to analyze the problem, more than the hard writing code dead knock to the strong, next time I will analyze the direct call Ajax interface can not solve the situation.

Reprint Please note source: http://www.cnblogs.com/codefish/p/4993809.html

(9) How to do the crawler scrapy under distributed--about the processing of Ajax Crawl (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.