When you learn Python, you will encounter the fact that Web site content is dynamically requested by Ajax and asynchronously refreshes the generated JSON data, and it is not possible to crawl the contents of static Web pages before using Python. So this article is going to tell you about crawling Ajax dynamically generated data in Python.
As for the way to read the contents of static Web pages, interested can view the content of this article.
Here we take a crawl Taobao comments as an example to explain how to do.
There are four main steps here:
An AJAX request link (URL) when you get a comment on Taobao
Second get the JSON data returned by the AJAX request
Three uses Python to parse JSON data
Four Save the results of the resolution
Step One:
Get Taobao comments, Ajax request link (URL) Here I use the Chrome browser to complete. Open Taobao link, search in the search box for a product, such as "shoes", here we choose the first item of goods.
Then jump to a new page. Here because we need to crawl user comments, so we click on the cumulative evaluation.
Then we can see how the user evaluates the product, and then we right-click on the page to select the review element (or open it directly using F12) and select the Network option, as shown in the picture:
In the user comments, we turn to the bottom click on the next page or the second page, we see in the network dynamic add a few items, we choose the beginning of a list_detail_rate.htm?itemid=35648967399.
Then click on this option, we can see the link in the Right option box, we want to copy the link content in the request URL.
We typed in the browser's address bar just now we get the URL link, open we will find that the page returned is the data we need, but it seems very messy, because this is JSON data.
Second get the JSON data returned by the AJAX request
Next, we're going to get the JSON data from the URL. The Python editor I'm using is pycharm, and here's a look at the Python code:
#-*-coding:utf-8-*-import sys reload (SYS) sys.setdefaultencoding (' Utf-8 ') import requests URL = ' https://rate.tmall.com/list_detail_rate.htm?itemid=35648967399&spuid=226460655&sellerid=1809124267ℴ= 3¤tpage=1&append=0&content=1&tagid=&posi=&picture=&ua= 011uw5tcymnyqwiaiwqrhhbfef8qxthcklnmwc%3d%7cum5ocktyt3zcf0b9qn9gec4%3d% 7cu2xmhdj7g2ahyg8has8wkaymcfq1uz9yjlxyjhi%3d%7cvghxd1llxgvyyvvov2pvaffvwgvhe0z%2frhfmeub4qhxcdkh8sxjccg%3d%3d% 7cvwldfs0rmq47asedjwcpsddnpm4lnba7rijldxijzbk3ytc%3d%7cvmhigcufobgkgimxnwswczalkxcpeikjmwg9hsefjb8%2fbtopwq8% 3d%7cv29phzefp29vbfz2snbkdiaapr0zht0boqi8a1ud%7cwgfbet8rmqszdy8qlxuudjijnqa1yzu%3d%7cwwbaed4qmau% 2baseylbksddaeoga1yzu%3d%7cwmjcejwsmmjxb1d3t3jmc1nmwgjaefhmw2jcfezmwgw6gichkqcngcudibpmgg%3d%3d%7cw2jfykj% 2fx2bafev5wwdfzuv8xgbudebgvxvjciq%3d&isg=82b6a3a1ed52a6996bca2111c9daaee6&_ksts=1440490222698_2142 &callback=jsonp2143 ' #这里的url比较长 content=requests.get (URL). Content
Print content #打印出来的内容就是我们之前在网页中获取到的json数据. Includes comments from users.
The content here is the JSON data we need, and the next step is to parse the JSON data.
Three uses Python to parse JSON data
#-*-coding:utf-8-*-import sys reload (SYS) sys.setdefaultencoding (' Utf-8 ') Import requests import JSON import re url= ' Https://rate.tmall.com/list_detail_rate.htm?itemid=35648967399&spuid=226460655&sellerid=1809124267ℴ=3 ¤tpage=1&append=0&content=1&tagid=&posi=&picture=&ua= 011uw5tcymnyqwiaiwqrhhbfef8qxthcklnmwc%3d%7cum5ocktyt3zcf0b9qn9gec4%3d% 7cu2xmhdj7g2ahyg8has8wkaymcfq1uz9yjlxyjhi%3d%7cvghxd1llxgvyyvvov2pvaffvwgvhe0z%2frhfmeub4qhxcdkh8sxjccg%3d%3d% 7cvwldfs0rmq47asedjwcpsddnpm4lnba7rijldxijzbk3ytc%3d%7cvmhigcufobgkgimxnwswczalkxcpeikjmwg9hsefjb8%2fbtopwq8% 3d%7cv29phzefp29vbfz2snbkdiaapr0zht0boqi8a1ud%7cwgfbet8rmqszdy8qlxuudjijnqa1yzu%3d%7cwwbaed4qmau% 2baseylbksddaeoga1yzu%3d%7cwmjcejwsmmjxb1d3t3jmc1nmwgjaefhmw2jcfezmwgw6gichkqcngcudibpmgg%3d%3d%7cw2jfykj% 2fx2bafev5wwdfzuv8xgbudebgvxvjciq%3d&isg=82b6a3a1ed52a6996bca2111c9daaee6&_ksts=1440490222698_2142 &callback=jsonp2143 ' Cont=requests.get (URL). Content rex=re.compile (R ' \w+[(]{1}(.*) [)]{1} ') Content=rex.findall (cont) [0] con=json.loads (content, "GBK") Count=len (con[' ratedetail '] [' ratelist ']) for I In Xrange (count): Print con[' ratedetail ' [' ratelist '][i][' appendcomment '] [' content ']
Analytical:
Here you need to import the package you want, and the re is the package that the regular expression needs, parsing the JSON data requires import JSON
Cont=requests.get (URL). Content #获取网页中json数据
Rex=re.compile (R ' \w+[(]{1} (. *) [)]{1} ') #正则表达式去除cont数据中多余的部分, data becomes the true JSON-formatted data {"A": "B", "C": "D"}
Con=json.loads (Content, "GBK") uses the JSON loads function to convert content into a data format that the JSON library function can handle, "GBK" as the encoding for the data, because the win system defaults to GBK
Count=len (con[' ratedetail '] [' ratelist ']) #获取用户评论的个数 (this is only the current page)
For I in Xrange (count):
Print con[' ratedetail ' [' ratelist '][i][' appendcomment ']
#循环遍历用户的评论 and output (you can also save data on demand, see part fourth)
The difficulty here is finding the path to a user's comment in a cluttered JSON data
Four Save the results of the resolution
Here the user can save the user's comment information to the local, such as in the CSV format.
The above is the whole of this article, I hope you like.