Sesame Http:ajax Fruit Extract

Source: Internet
Author: User
Tags urlencode

Take Weibo as an example, and then use Python to emulate these Ajax requests and crawl the tweets I've sent.

1. Analysis Request

Open the Ajax XHR filter and swipe the page to load the new microblogging content. As you can see, there will always be AJAX requests issued.

Select one of the requests and analyze its parameter information. Click on the request to go to the details page as shown in 6-11.

It can be found that this is a get type request, the request link is [https://m.weibo.cn/api/container/getindex?type=uid&value=2830678474& containerid=1076032830678474&page=2). The requested parameters are 4: type , value ,, containerid and page .

Then look at other requests, and you can see that they are, type value and are containerid consistent. typealways uid , value the value is the number in the page link, in fact this is the user's id . Besides, there are containerid . It can be found that it is 107603 plus the user id . The value of the change is page , obviously, this parameter is used to control paging, page=1 representing the first page, page=2 representing the second page, and so on.

2. Analyze the response

Then, observe the response content of this request, as shown in 6-12.

This content is in JSON format, and browser developer tools are automatically parsed to make it easier for us to see. It can be seen that the two most critical information is the cardlistInfo cards same as: The former contains a more important information total , observation can be found, it is actually the total number of Weibo, we can estimate the number of pages according to this figure, the latter is a list, which contains 10 elements, Expand one of them and look at it.

It can be found that this element has a more important field mblog . Unfold it and you can see that it contains some information about Weibo, such as attitudes_count (number of likes), ( comments_count number of comments), (number of posts), (Time of reposts_count created_at release), text (tweet text), etc., and they are all formatted content.

This allows us to request an interface to get 10 tweets, and only need to change the parameters when requested page .

In this case, we simply have to do a loop and we can get all the tweets.

3. Practical Walkthrough

Here we use the program to simulate these AJAX requests and crawl all of my top 10 pages of Weibo.

First, define a method to get the results of each request. At the request, page it is a mutable parameter, so we pass it in as a parameter of the method, the relevant code is as follows:

 fromurllib.parse Import Urlencodeimport requestsbase_url='Https://m.weibo.cn/api/container/getIndex?'Headers= {    'Host':'m.weibo.cn',    'Referer':'https://m.weibo.cn/u/2830678474',    'user-agent':'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36',    'X-requested-with':'XMLHttpRequest',}def get_page (page):params= {        'type':'UID',        'value':'2830678474',        'Containerid':'1076032830678474',        'page': page} URL= Base_url + UrlEncode (params)    Try: Response= requests.Get(URL, headers=headers)ifResponse.status_code = = $:            returnResponse.json () except requests. Connectionerror asE:print ('Error', E.args)

First, this defines the base_url first half of the URL that represents the request. Next, construct the parameter dictionary, where type , value and containerid is a fixed parameter, page is a mutable parameter. Next, the calling method converts the urlencode() parameter to the GET request parameter of the URL, which is similar to type=uid&value=2830678474&containerid=1076032830678474&page=2 this form. Then, base_url a new URL is formed with the parameter flattening. Next, we use requests to request this link, adding headers parameters. The status code of the response is then determined, and if it is 200, the method is called directly to json() parse the content back into JSON, otherwise no information is returned. If an exception occurs, its exception information is captured and output.

Then, we need to define an analytic method to extract the desired information from the results, such as this time to save the content of Weibo, id text, likes, comments, and forwards, so we can go through cards and then get mblog the various information in the Assign to a new dictionary to return:

 fromPyquery Import Pyquery aspqdef Parse_page (JSON):ifJson:items= json.Get('Data').Get('Cards')         forIteminchItems:item= Item.Get('Mblog') Weibo={} weibo['ID'] = Item.Get('ID') weibo['text'] = PQ (item.Get('text') . Text () weibo['Attitudes'] = Item.Get('Attitudes_count') weibo['Comments'] = Item.Get('Comments_count') weibo['reposts'] = Item.Get('Reposts_count')            yieldWeibo

Here we use Pyquery to remove the HTML tags from the body.

Finally, to traverse page , a total of 10 pages, will extract the results of the printout can be:

if ' __main__ ' :      for  in range (1):        = get_page (page)        = parse_page (JSON )        for in  results:            print (Result)

In addition, we can add a method to save the results to the MONGODB database:

from == client['Weibo'= db['Weibo ' ]def Save_to_mongo (Result):    if  collection.insert (Result):        print (  'Saved to Mongo')

This completes the implementation of all functions. After running the program, the sample output is as follows:

{'ID':'4134879836735238','text':'Surprised not surprise, thorn not exciting, meaning not unexpectedly, feeling not moved','Attitudes':3,'Comments':1,'reposts':0}saved to mongo{'ID':'4143853554221385','text':'once dreamed of similarities go Tianya, and later passed security to collect away. Share a single fly','Attitudes':5,'Comments':1,'reposts':0}saved to Mongo

For a look at MongoDB, the corresponding data is also saved to MongoDB.

In this way, we successfully crawled down the microblog list by analyzing Ajax and writing crawlers, and finally, the Code address for this section: Https://github.com/Python3WebSpider/WeiboList.

The purpose of this section is to demonstrate the simulation request process for Ajax, and the result of the crawl is not the focus. The program still has a lot to improve the place, such as the dynamic calculation of page numbers, micro-blog to view the full text, if interested, you can try.

Through this example, we mainly learned how to analyze the AJAX request, how to use the program to simulate crawling AJAX requests. After understanding the crawl principle, the next section of the Ajax walkthrough will be more handy.

Sesame Http:ajax Fruit Extract

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.