Take Taobao reviews for example to explain Python crawl Ajax dynamically generated data (classic)

Take Taobao reviews for example to explain Python crawl Ajax dynamically generated data (classic) _ajax related

Last Update:2017-01-18 Source: Internet

Author: User

Tags comments in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When you learn Python, you will encounter the fact that Web site content is dynamically requested by Ajax and asynchronously refreshes the generated JSON data, and it is not possible to crawl the contents of static Web pages before using Python. So this article is going to tell you about crawling Ajax dynamically generated data in Python.

As for the way to read the contents of static Web pages, interested can view the content of this article.

Here we take a crawl Taobao comments as an example to explain how to do.

There are four main steps here:

An AJAX request link (URL) when you get a comment on Taobao

Second get the JSON data returned by the AJAX request

Three uses Python to parse JSON data

Four Save the results of the resolution

Step One:

Get Taobao comments, Ajax request link (URL) Here I use the Chrome browser to complete. Open Taobao link, search in the search box for a product, such as "shoes", here we choose the first item of goods.

Then jump to a new page. Here because we need to crawl user comments, so we click on the cumulative evaluation.

Then we can see how the user evaluates the product, and then we right-click on the page to select the review element (or open it directly using F12) and select the Network option, as shown in the picture:

In the user comments, we turn to the bottom click on the next page or the second page, we see in the network dynamic add a few items, we choose the beginning of a list_detail_rate.htm?itemid=35648967399.

Then click on this option, we can see the link in the Right option box, we want to copy the link content in the request URL.

We typed in the browser's address bar just now we get the URL link, open we will find that the page returned is the data we need, but it seems very messy, because this is JSON data.

Second get the JSON data returned by the AJAX request

Next, we're going to get the JSON data from the URL. The Python editor I'm using is pycharm, and here's a look at the Python code:

 #-*-coding:utf-8-*-import sys reload (SYS) sys.setdefaultencoding (' Utf-8 ') import requests URL = ' https://rate.tmall.com/list_detail_rate.htm?itemid=35648967399&spuid=226460655&sellerid=1809124267ℴ= 3&currentpage=1&append=0&content=1&tagid=&posi=&picture=&ua= 011uw5tcymnyqwiaiwqrhhbfef8qxthcklnmwc%3d%7cum5ocktyt3zcf0b9qn9gec4%3d% 7cu2xmhdj7g2ahyg8has8wkaymcfq1uz9yjlxyjhi%3d%7cvghxd1llxgvyyvvov2pvaffvwgvhe0z%2frhfmeub4qhxcdkh8sxjccg%3d%3d% 7cvwldfs0rmq47asedjwcpsddnpm4lnba7rijldxijzbk3ytc%3d%7cvmhigcufobgkgimxnwswczalkxcpeikjmwg9hsefjb8%2fbtopwq8% 3d%7cv29phzefp29vbfz2snbkdiaapr0zht0boqi8a1ud%7cwgfbet8rmqszdy8qlxuudjijnqa1yzu%3d%7cwwbaed4qmau% 2baseylbksddaeoga1yzu%3d%7cwmjcejwsmmjxb1d3t3jmc1nmwgjaefhmw2jcfezmwgw6gichkqcngcudibpmgg%3d%3d%7cw2jfykj% 2fx2bafev5wwdfzuv8xgbudebgvxvjciq%3d&isg=82b6a3a1ed52a6996bca2111c9daaee6&_ksts=1440490222698_2142 &callback=jsonp2143 ' #这里的url比较长 content=requests.get (URL). Content

Print content #打印出来的内容就是我们之前在网页中获取到的json数据. Includes comments from users.

The content here is the JSON data we need, and the next step is to parse the JSON data.

Three uses Python to parse JSON data

#-*-coding:utf-8-*-import sys reload (SYS) sys.setdefaultencoding (' Utf-8 ') Import requests import JSON import re url= ' Https://rate.tmall.com/list_detail_rate.htm?itemid=35648967399&spuid=226460655&sellerid=1809124267ℴ=3 &currentpage=1&append=0&content=1&tagid=&posi=&picture=&ua= 011uw5tcymnyqwiaiwqrhhbfef8qxthcklnmwc%3d%7cum5ocktyt3zcf0b9qn9gec4%3d% 7cu2xmhdj7g2ahyg8has8wkaymcfq1uz9yjlxyjhi%3d%7cvghxd1llxgvyyvvov2pvaffvwgvhe0z%2frhfmeub4qhxcdkh8sxjccg%3d%3d% 7cvwldfs0rmq47asedjwcpsddnpm4lnba7rijldxijzbk3ytc%3d%7cvmhigcufobgkgimxnwswczalkxcpeikjmwg9hsefjb8%2fbtopwq8% 3d%7cv29phzefp29vbfz2snbkdiaapr0zht0boqi8a1ud%7cwgfbet8rmqszdy8qlxuudjijnqa1yzu%3d%7cwwbaed4qmau% 2baseylbksddaeoga1yzu%3d%7cwmjcejwsmmjxb1d3t3jmc1nmwgjaefhmw2jcfezmwgw6gichkqcngcudibpmgg%3d%3d%7cw2jfykj% 2fx2bafev5wwdfzuv8xgbudebgvxvjciq%3d&isg=82b6a3a1ed52a6996bca2111c9daaee6&_ksts=1440490222698_2142 &callback=jsonp2143 ' Cont=requests.get (URL). Content rex=re.compile (R ' \w+[(]{1}(.*) [)]{1} ') Content=rex.findall (cont) [0] con=json.loads (content, "GBK") Count=len (con[' ratedetail '] [' ratelist ']) for I In Xrange (count): Print con[' ratedetail ' [' ratelist '][i][' appendcomment '] [' content ']

Analytical:

Here you need to import the package you want, and the re is the package that the regular expression needs, parsing the JSON data requires import JSON

Cont=requests.get (URL). Content #获取网页中json数据

Rex=re.compile (R ' \w+[(]{1} (. *) [)]{1} ') #正则表达式去除cont数据中多余的部分, data becomes the true JSON-formatted data {"A": "B", "C": "D"}

Con=json.loads (Content, "GBK") uses the JSON loads function to convert content into a data format that the JSON library function can handle, "GBK" as the encoding for the data, because the win system defaults to GBK

Count=len (con[' ratedetail '] [' ratelist ']) #获取用户评论的个数 (this is only the current page)

For I in Xrange (count):

Print con[' ratedetail ' [' ratelist '][i][' appendcomment ']

#循环遍历用户的评论 and output (you can also save data on demand, see part fourth)

The difficulty here is finding the path to a user's comment in a cluttered JSON data

Four Save the results of the resolution

Here the user can save the user's comment information to the local, such as in the CSV format.

The above is the whole of this article, I hope you like.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More