Sesame HTTP: Ajax result extraction and sesame ajax result Extraction

Source: Internet
Author: User

Sesame HTTP: Ajax result extraction and sesame ajax result Extraction

Taking Weibo as an example, we use Python to simulate these Ajax requests and crawl the Weibo posts I sent.

1. Analyze the request

Open the XHR filter of Ajax and keep sliding the page to load new Weibo content. As you can see, Ajax requests are constantly sent.

Select a request and analyze its parameter information. Click this request to go to the details page, as shown in Figure 6-11.

It can be found that this is a GET type request, the request link is [https://m.weibo.cn/api/container/getIndex? Type = uid & value = 2830678474 & containerid = 1076032830678474 & page = 2 ). There are four request parameters:type,value,containeridAndpage.

Then let's look at other requests and we can find that theirtype,valueAndcontaineridAlways.typeAlwaysuid,valueThe value is the number in the page Link. In fact, this is the user'sid. In additioncontainerid. We can find that it is 107603 + Userid. The changed value ispageObviously, this parameter is used to control paging,page=1Indicates the first page,page=2The second page, and so on.

2. analyze the response

Then, observe the response content of the request, as shown in 6-12.

The content is in JSON format. The browser developer tool automatically parses the content to facilitate our viewing. As you can see, the two most critical parts of the information are:cardlistInfoAndcards: The former contains important information.totalAfter observation, we can find that it is actually the total number of Weibo posts. we can estimate the number of pages based on this number. The latter is a list, which contains 10 elements, show one of them ,.

It can be found that this element has a very important fieldmblog. Expand it to find that it contains some information about Weibo, suchattitudes_count(Number of likes ),comments_count(Number of comments ),reposts_count(Number of forwards ),created_at(Release time ),text(Weibo text), and they are all formatted content.

In this way, we can get 10 Weibo posts by requesting an interface, and only need to change the request.pageParameter.

In this way, we only need to make a loop to get all the Weibo posts.

3. Practical drills

Here we use a program to simulate these Ajax requests and crawl all my first 10 Weibo posts.

First, define a method to obtain the results of each request. In the request,pageIs a variable parameter, So we pass it as a method parameter. The relevant code is as follows:

from urllib.parse import urlencodeimport requestsbase_url = 'https://m.weibo.cn/api/container/getIndex?'headers = {    'Host': 'm.weibo.cn',    'Referer': 'https://m.weibo.cn/u/2830678474',    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',    'X-Requested-With': 'XMLHttpRequest',}def get_page(page):    params = {        'type': 'uid',        'value': '2830678474',        'containerid': '1076032830678474',        'page': page    }    url = base_url + urlencode(params)    try:        response = requests.get(url, headers=headers)        if response.status_code == 200:            return response.json()    except requests.ConnectionError as e:        print('Error', e.args)

First, we definebase_urlTo represent the first half of the request URL. Next, construct the parameter dictionary, wheretype,valueAndcontaineridIs a fixed parameter,pageIs a variable parameter. Next, callurlencode()Method to convert parameters to get Request Parameters of the URL, that is, similartype=uid&value=2830678474&containerid=1076032830678474&page=2This form. Then,base_urlCombine with parameters to form a new URL. Next, we use requests to request this link and addheadersParameters. Then judge the response status code. If it is 200, it is called directly.json()Method to parse the content into JSON, otherwise no information is returned. If an exception occurs, capture and output the exception information.

Then, we need to define a resolution method to extract the desired information from the results.idBody, number of likes, number of comments, and number of forwards can be traversed first.cardsAnd then obtainmblogAnd assign a value to a new dictionary:

from pyquery import PyQuery as pqdef parse_page(json):    if json:        items = json.get('data').get('cards')        for item in items:            item = item.get('mblog')            weibo = {}            weibo['id'] = item.get('id')            weibo['text'] = pq(item.get('text')).text()            weibo['attitudes'] = item.get('attitudes_count')            weibo['comments'] = item.get('comments_count')            weibo['reposts'] = item.get('reposts_count')            yield weibo

Here, we use pyquery to remove the HTML tag from the body.

Finally, traversepage, A total of 10 pages. Print the extracted results and output them:

if __name__ == '__main__':    for page in range(1, 11):        json = get_page(page)        results = parse_page(json)        for result in results:            print(result)

In addition, we can add a method to save the results to the MongoDB database:

from pymongo import MongoClientclient = MongoClient()db = client['weibo']collection = db['weibo']def save_to_mongo(result):    if collection.insert(result):        print('Saved to Mongo')

In this way, all functions are completed. After running the program, the sample output is as follows:

{'Id': '000000', 'text': 'surprise, no stimulus, no surprise, no emotion. ', 'Attitude': 3, 'comments ': 1, 'reposts': 0} Saved to Mongo {'id': '000000', 'text':' once dreamed of walking across the world, then I checked the security and took it away. Share single go high, 'attids': 5, 'comments': 1, 'reposts': 0} Saved to Mongo

Check MongoDB and save the data to MongoDB ,.

In this way, we can analyze Ajax and write crawlers to crawl and remove the Weibo list. Finally, we will provide the Code address for this section: https://github.com/python3webspider/weibolist.

The purpose of this section is to demonstrate the Ajax request simulation process, and the crawling result is not the focus. There are still many improvements to this program, such as dynamic page number calculation and full-text Weibo viewing. If you are interested, try it.

Through this example, we mainly learned how to analyze Ajax requests and how to use programs to simulate Ajax requests. After learning about the capture principle, the next Ajax practice will be more handy.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.