(9) How to do the crawler scrapy under distributed--about the processing of Ajax Crawl (i)

Last Update:2015-11-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint Please specify source: http://www.cnblogs.com/codefish/p/4993809.html

Recently in the group frequently asked about the Ajax and JS processing problems, we all know, now a lot of pages are loaded with dynamic technology, which brings a good page experience, on the other hand, in the crawl or or less to bring considerable trouble, because we know the direct Get home page URL, There is no way to display this content. So how do you deal with these things?

is an intuitive analysis, when crawling data, we generally take into account the mobile phone side of the site, because the mobile phone site to get data relatively easy, especially the WAP protocol site, its paging is mostly not the form of Ajax paging or waterfall flow, so it is relatively easy to crawl more. In addition, after analysis to the request header, we can easily get the AJAX request address, this time intuitive to call this address, to see whether the normal data obtained. Change the browser and then in the call once, see whether the data is normal, if normal, the URL may be public, that in the guarantee of a certain frequency access can easily get the data. Here I use an example to illustrate the analysis request.

One, open the target website to see how it is loaded:

Https://www.abdserotec.com/primary-antibodies-monoclonal-polyclonal.html#productType=Monoclonal%20Antibody

Second, analysis website

When I open the site, it can be obvious that the data is through the drop-down list, after the end to solve the Ajax events to request data, then we actually go to him when the request is what happened

We got the requested address:

Https://api.uk-plc.net/product_tables/v1/abd?filter={%22producttype%22:{%22$in%22:[%22monoclonal%20antibody%22 ]}}&skip=360&limit=40&sort=[[%22specificitysummary%22,1],[%22host%22,1],[%22uniquename%22,1],[% 22format%22,1]]

Then I open it directly in the browser to see:

It's obvious to see a familiar JSON-formatted string

Don't worry, here we need to change the browser to open the API interface just now, why do you want to do this? Because we are now open with a certain request parameters, we change the browser is to clear these parameters, and then to access, if you still get the data, so that the API interface itself is public, and the administrator does not do this interface filter.

Third, further analysis of parameters

OK, this directly indicates that you can access this interface directly, how to page?

Let's see what the parameters are in the URL:

skip=360

Limit=40

This is similar to the way C # LINQ is paged, so I can assume this boldly:

Limit is PageCount, the number of pages per page

Skip is to skip the first few pages of data

PageIndex first few pages

The corresponding data to obtain several pages is:

Skip = (pageindex-1) *pagecount

Limit = 40

Verify that the data is still getting

Four, write the code

In this case, I wrote a simple script in Python:

__author__='Bruce'ImportRequestspage_count= 20Page_index= 1url_template='https://api.uk-plc.net/product_tables/v1/abd?filter={%22producttype%22:{%22$in%22:[%22monoclonal% 20antibody%22,%22polyclonal%20antibody%22]}}&skip={page_index}&limit={page_count}&sort=[[% 22specificitysummary%22,1],[%22host%22,1],[%22uniquename%22,1],[%22format%22,1]]'defget_json_from_url (URL): R=requests.get (URL)returnR.json () ['Results']defInit_url_by_parms (page_count=40, Page_index=1):    if  notPage_countor  notPage_index:return "'    returnUrl_template.replace ('{Page_index}', str ((page_index-1) * page_count)). Replace ('{Page_count}', str (page_count))if __name__=='__main__': URL= Init_url_by_parms (Page_count=page_count, page_index=Page_index)PrintURL Objs=get_json_from_url (URL)ifOBJS: forObjinchOBJS:Print '####################################'             forKvinchObj.items ():PrintK':'V

In addition, friends say how to get the total number of pages? We assume that the amount of data on the existing 40 pages, assuming that the total number of pages is 100, if there is data on page 100th, then visit page No. 200, if you do not get the data, then access to the 100+200/2 page data, and so on, almost log2n times can get the total number of pages, here is the use of dichotomy can be obtained.

Summarize:

This article mainly analyzes the Ajax can directly invoke and analyze the request process, in my opinion, code code through thinking to analyze the problem, more than the hard writing code dead knock to the strong, next time I will analyze the direct call Ajax interface can not solve the situation.

Reprint Please note source: http://www.cnblogs.com/codefish/p/4993809.html

(9) How to do the crawler scrapy under distributed--about the processing of Ajax Crawl (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

(9) How to do the crawler scrapy under distributed--about the processing of Ajax Crawl (i)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

(9) How to do the crawler scrapy under distributed--about the processing of Ajax Crawl (i)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support