Sesame Http:ajax Fruit Extract

Last Update:2018-02-26 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Take Weibo as an example, and then use Python to emulate these Ajax requests and crawl the tweets I've sent.

1. Analysis Request

Open the Ajax XHR filter and swipe the page to load the new microblogging content. As you can see, there will always be AJAX requests issued.

Select one of the requests and analyze its parameter information. Click on the request to go to the details page as shown in 6-11.

It can be found that this is a get type request, the request link is [https://m.weibo.cn/api/container/getindex?type=uid&value=2830678474& containerid=1076032830678474&page=2). The requested parameters are 4: type , value ,, containerid and page .

Then look at other requests, and you can see that they are, type value and are containerid consistent. typealways uid , value the value is the number in the page link, in fact this is the user's id . Besides, there are containerid . It can be found that it is 107603 plus the user id . The value of the change is page , obviously, this parameter is used to control paging, page=1 representing the first page, page=2 representing the second page, and so on.

2. Analyze the response

Then, observe the response content of this request, as shown in 6-12.

This content is in JSON format, and browser developer tools are automatically parsed to make it easier for us to see. It can be seen that the two most critical information is the cardlistInfo cards same as: The former contains a more important information total , observation can be found, it is actually the total number of Weibo, we can estimate the number of pages according to this figure, the latter is a list, which contains 10 elements, Expand one of them and look at it.

It can be found that this element has a more important field mblog . Unfold it and you can see that it contains some information about Weibo, such as attitudes_count (number of likes), ( comments_count number of comments), (number of posts), (Time of reposts_count created_at release), text (tweet text), etc., and they are all formatted content.

This allows us to request an interface to get 10 tweets, and only need to change the parameters when requested page .

In this case, we simply have to do a loop and we can get all the tweets.

3. Practical Walkthrough

Here we use the program to simulate these AJAX requests and crawl all of my top 10 pages of Weibo.

First, define a method to get the results of each request. At the request, page it is a mutable parameter, so we pass it in as a parameter of the method, the relevant code is as follows:

 fromurllib.parse Import Urlencodeimport requestsbase_url='Https://m.weibo.cn/api/container/getIndex?'Headers= {    'Host':'m.weibo.cn',    'Referer':'https://m.weibo.cn/u/2830678474',    'user-agent':'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36',    'X-requested-with':'XMLHttpRequest',}def get_page (page):params= {        'type':'UID',        'value':'2830678474',        'Containerid':'1076032830678474',        'page': page} URL= Base_url + UrlEncode (params)    Try: Response= requests.Get(URL, headers=headers)ifResponse.status_code = = $:            returnResponse.json () except requests. Connectionerror asE:print ('Error', E.args)

First, this defines the base_url first half of the URL that represents the request. Next, construct the parameter dictionary, where type , value and containerid is a fixed parameter, page is a mutable parameter. Next, the calling method converts the urlencode() parameter to the GET request parameter of the URL, which is similar to type=uid&value=2830678474&containerid=1076032830678474&page=2 this form. Then, base_url a new URL is formed with the parameter flattening. Next, we use requests to request this link, adding headers parameters. The status code of the response is then determined, and if it is 200, the method is called directly to json() parse the content back into JSON, otherwise no information is returned. If an exception occurs, its exception information is captured and output.

Then, we need to define an analytic method to extract the desired information from the results, such as this time to save the content of Weibo, id text, likes, comments, and forwards, so we can go through cards and then get mblog the various information in the Assign to a new dictionary to return:

 fromPyquery Import Pyquery aspqdef Parse_page (JSON):ifJson:items= json.Get('Data').Get('Cards')         forIteminchItems:item= Item.Get('Mblog') Weibo={} weibo['ID'] = Item.Get('ID') weibo['text'] = PQ (item.Get('text') . Text () weibo['Attitudes'] = Item.Get('Attitudes_count') weibo['Comments'] = Item.Get('Comments_count') weibo['reposts'] = Item.Get('Reposts_count')            yieldWeibo

Here we use Pyquery to remove the HTML tags from the body.

Finally, to traverse page , a total of 10 pages, will extract the results of the printout can be:

if ' __main__ ' :      for  in range (1):        = get_page (page)        = parse_page (JSON )        for in  results:            print (Result)

In addition, we can add a method to save the results to the MONGODB database:

from == client['Weibo'= db['Weibo ' ]def Save_to_mongo (Result):    if  collection.insert (Result):        print (  'Saved to Mongo')

This completes the implementation of all functions. After running the program, the sample output is as follows:

{'ID':'4134879836735238','text':'Surprised not surprise, thorn not exciting, meaning not unexpectedly, feeling not moved','Attitudes':3,'Comments':1,'reposts':0}saved to mongo{'ID':'4143853554221385','text':'once dreamed of similarities go Tianya, and later passed security to collect away. Share a single fly','Attitudes':5,'Comments':1,'reposts':0}saved to Mongo

For a look at MongoDB, the corresponding data is also saved to MongoDB.

In this way, we successfully crawled down the microblog list by analyzing Ajax and writing crawlers, and finally, the Code address for this section: Https://github.com/Python3WebSpider/WeiboList.

The purpose of this section is to demonstrate the simulation request process for Ajax, and the result of the crawl is not the focus. The program still has a lot to improve the place, such as the dynamic calculation of page numbers, micro-blog to view the full text, if interested, you can try.

Through this example, we mainly learned how to analyze the AJAX request, how to use the program to simulate crawling AJAX requests. After understanding the crawl principle, the next section of the Ajax walkthrough will be more handy.

Sesame Http:ajax Fruit Extract

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More