Take Weibo as an example, and then use Python to emulate these Ajax requests and crawl the tweets I've sent.
1. Analysis Request
Open the Ajax XHR filter and swipe the page to load the new microblogging content. As you can see, there will always be AJAX requests issued.
Select one of the requests and analyze its parameter information. Click on the request to go to the details page as shown in 6-11.
It can be found that this is a get type request, the request link is [https://m.weibo.cn/api/container/getindex?type=uid&value=2830678474& containerid=1076032830678474&page=2). The requested parameters are 4: type
, value
,, containerid
and page
.
Then look at other requests, and you can see that they are, type
value
and are containerid
consistent. type
always uid
, value
the value is the number in the page link, in fact this is the user's id
. Besides, there are containerid
. It can be found that it is 107603 plus the user id
. The value of the change is page
, obviously, this parameter is used to control paging, page=1
representing the first page, page=2
representing the second page, and so on.
2. Analyze the response
Then, observe the response content of this request, as shown in 6-12.
This content is in JSON format, and browser developer tools are automatically parsed to make it easier for us to see. It can be seen that the two most critical information is the cardlistInfo
cards
same as: The former contains a more important information total
, observation can be found, it is actually the total number of Weibo, we can estimate the number of pages according to this figure, the latter is a list, which contains 10 elements, Expand one of them and look at it.
It can be found that this element has a more important field mblog
. Unfold it and you can see that it contains some information about Weibo, such as attitudes_count
(number of likes), ( comments_count
number of comments), (number of posts), (Time of reposts_count
created_at
release), text
(tweet text), etc., and they are all formatted content.
This allows us to request an interface to get 10 tweets, and only need to change the parameters when requested page
.
In this case, we simply have to do a loop and we can get all the tweets.
3. Practical Walkthrough
Here we use the program to simulate these AJAX requests and crawl all of my top 10 pages of Weibo.
First, define a method to get the results of each request. At the request, page
it is a mutable parameter, so we pass it in as a parameter of the method, the relevant code is as follows:
fromurllib.parse Import Urlencodeimport requestsbase_url='Https://m.weibo.cn/api/container/getIndex?'Headers= { 'Host':'m.weibo.cn', 'Referer':'https://m.weibo.cn/u/2830678474', 'user-agent':'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36', 'X-requested-with':'XMLHttpRequest',}def get_page (page):params= { 'type':'UID', 'value':'2830678474', 'Containerid':'1076032830678474', 'page': page} URL= Base_url + UrlEncode (params) Try: Response= requests.Get(URL, headers=headers)ifResponse.status_code = = $: returnResponse.json () except requests. Connectionerror asE:print ('Error', E.args)
First, this defines the base_url
first half of the URL that represents the request. Next, construct the parameter dictionary, where type
, value
and containerid
is a fixed parameter, page
is a mutable parameter. Next, the calling method converts the urlencode()
parameter to the GET request parameter of the URL, which is similar to type=uid&value=2830678474&containerid=1076032830678474&page=2
this form. Then, base_url
a new URL is formed with the parameter flattening. Next, we use requests to request this link, adding headers
parameters. The status code of the response is then determined, and if it is 200, the method is called directly to json()
parse the content back into JSON, otherwise no information is returned. If an exception occurs, its exception information is captured and output.
Then, we need to define an analytic method to extract the desired information from the results, such as this time to save the content of Weibo, id
text, likes, comments, and forwards, so we can go through cards
and then get mblog
the various information in the Assign to a new dictionary to return:
fromPyquery Import Pyquery aspqdef Parse_page (JSON):ifJson:items= json.Get('Data').Get('Cards') forIteminchItems:item= Item.Get('Mblog') Weibo={} weibo['ID'] = Item.Get('ID') weibo['text'] = PQ (item.Get('text') . Text () weibo['Attitudes'] = Item.Get('Attitudes_count') weibo['Comments'] = Item.Get('Comments_count') weibo['reposts'] = Item.Get('Reposts_count') yieldWeibo
Here we use Pyquery to remove the HTML tags from the body.
Finally, to traverse page
, a total of 10 pages, will extract the results of the printout can be:
if ' __main__ ' : for in range (1): = get_page (page) = parse_page (JSON ) for in results: print (Result)
In addition, we can add a method to save the results to the MONGODB database:
from == client['Weibo'= db['Weibo ' ]def Save_to_mongo (Result): if collection.insert (Result): print ( 'Saved to Mongo')
This completes the implementation of all functions. After running the program, the sample output is as follows:
{'ID':'4134879836735238','text':'Surprised not surprise, thorn not exciting, meaning not unexpectedly, feeling not moved','Attitudes':3,'Comments':1,'reposts':0}saved to mongo{'ID':'4143853554221385','text':'once dreamed of similarities go Tianya, and later passed security to collect away. Share a single fly','Attitudes':5,'Comments':1,'reposts':0}saved to Mongo
For a look at MongoDB, the corresponding data is also saved to MongoDB.
In this way, we successfully crawled down the microblog list by analyzing Ajax and writing crawlers, and finally, the Code address for this section: Https://github.com/Python3WebSpider/WeiboList.
The purpose of this section is to demonstrate the simulation request process for Ajax, and the result of the crawl is not the focus. The program still has a lot to improve the place, such as the dynamic calculation of page numbers, micro-blog to view the full text, if interested, you can try.
Through this example, we mainly learned how to analyze the AJAX request, how to use the program to simulate crawling AJAX requests. After understanding the crawl principle, the next section of the Ajax walkthrough will be more handy.
Sesame Http:ajax Fruit Extract