Take Taobao reviews for example to explain Python crawl Ajax dynamically generated data (classic) _ajax related

Source: Internet
Author: User
Tags comments in python

When you learn Python, you will encounter the fact that Web site content is dynamically requested by Ajax and asynchronously refreshes the generated JSON data, and it is not possible to crawl the contents of static Web pages before using Python. So this article is going to tell you about crawling Ajax dynamically generated data in Python.

As for the way to read the contents of static Web pages, interested can view the content of this article.

Here we take a crawl Taobao comments as an example to explain how to do.

There are four main steps here:

An AJAX request link (URL) when you get a comment on Taobao

Second get the JSON data returned by the AJAX request

Three uses Python to parse JSON data

Four Save the results of the resolution

Step One:

Get Taobao comments, Ajax request link (URL) Here I use the Chrome browser to complete. Open Taobao link, search in the search box for a product, such as "shoes", here we choose the first item of goods.

Then jump to a new page. Here because we need to crawl user comments, so we click on the cumulative evaluation.

Then we can see how the user evaluates the product, and then we right-click on the page to select the review element (or open it directly using F12) and select the Network option, as shown in the picture:

In the user comments, we turn to the bottom click on the next page or the second page, we see in the network dynamic add a few items, we choose the beginning of a list_detail_rate.htm?itemid=35648967399.

Then click on this option, we can see the link in the Right option box, we want to copy the link content in the request URL.

We typed in the browser's address bar just now we get the URL link, open we will find that the page returned is the data we need, but it seems very messy, because this is JSON data.

Second get the JSON data returned by the AJAX request

Next, we're going to get the JSON data from the URL. The Python editor I'm using is pycharm, and here's a look at the Python code:

 #-*-coding:utf-8-*-import sys reload (SYS) sys.setdefaultencoding (' Utf-8 ') import requests URL = ' https://rate.tmall.com/list_detail_rate.htm?itemid=35648967399&spuid=226460655&sellerid=1809124267ℴ= 3&currentpage=1&append=0&content=1&tagid=&posi=&picture=&ua= 011uw5tcymnyqwiaiwqrhhbfef8qxthcklnmwc%3d%7cum5ocktyt3zcf0b9qn9gec4%3d% 7cu2xmhdj7g2ahyg8has8wkaymcfq1uz9yjlxyjhi%3d%7cvghxd1llxgvyyvvov2pvaffvwgvhe0z%2frhfmeub4qhxcdkh8sxjccg%3d%3d% 7cvwldfs0rmq47asedjwcpsddnpm4lnba7rijldxijzbk3ytc%3d%7cvmhigcufobgkgimxnwswczalkxcpeikjmwg9hsefjb8%2fbtopwq8% 3d%7cv29phzefp29vbfz2snbkdiaapr0zht0boqi8a1ud%7cwgfbet8rmqszdy8qlxuudjijnqa1yzu%3d%7cwwbaed4qmau% 2baseylbksddaeoga1yzu%3d%7cwmjcejwsmmjxb1d3t3jmc1nmwgjaefhmw2jcfezmwgw6gichkqcngcudibpmgg%3d%3d%7cw2jfykj% 2fx2bafev5wwdfzuv8xgbudebgvxvjciq%3d&isg=82b6a3a1ed52a6996bca2111c9daaee6&_ksts=1440490222698_2142 &callback=jsonp2143 ' #这里的url比较长 content=requests.get (URL). Content 

Print content #打印出来的内容就是我们之前在网页中获取到的json数据. Includes comments from users.

The content here is the JSON data we need, and the next step is to parse the JSON data.

Three uses Python to parse JSON data

#-*-coding:utf-8-*-import sys reload (SYS) sys.setdefaultencoding (' Utf-8 ') Import requests import JSON import re url= ' Https://rate.tmall.com/list_detail_rate.htm?itemid=35648967399&spuid=226460655&sellerid=1809124267ℴ=3 &currentpage=1&append=0&content=1&tagid=&posi=&picture=&ua= 011uw5tcymnyqwiaiwqrhhbfef8qxthcklnmwc%3d%7cum5ocktyt3zcf0b9qn9gec4%3d% 7cu2xmhdj7g2ahyg8has8wkaymcfq1uz9yjlxyjhi%3d%7cvghxd1llxgvyyvvov2pvaffvwgvhe0z%2frhfmeub4qhxcdkh8sxjccg%3d%3d% 7cvwldfs0rmq47asedjwcpsddnpm4lnba7rijldxijzbk3ytc%3d%7cvmhigcufobgkgimxnwswczalkxcpeikjmwg9hsefjb8%2fbtopwq8% 3d%7cv29phzefp29vbfz2snbkdiaapr0zht0boqi8a1ud%7cwgfbet8rmqszdy8qlxuudjijnqa1yzu%3d%7cwwbaed4qmau% 2baseylbksddaeoga1yzu%3d%7cwmjcejwsmmjxb1d3t3jmc1nmwgjaefhmw2jcfezmwgw6gichkqcngcudibpmgg%3d%3d%7cw2jfykj% 2fx2bafev5wwdfzuv8xgbudebgvxvjciq%3d&isg=82b6a3a1ed52a6996bca2111c9daaee6&_ksts=1440490222698_2142 &callback=jsonp2143 ' Cont=requests.get (URL). Content rex=re.compile (R ' \w+[(]{1}(.*) [)]{1} ') Content=rex.findall (cont) [0] con=json.loads (content, "GBK") Count=len (con[' ratedetail '] [' ratelist ']) for I In Xrange (count): Print con[' ratedetail ' [' ratelist '][i][' appendcomment '] [' content ']

Analytical:

Here you need to import the package you want, and the re is the package that the regular expression needs, parsing the JSON data requires import JSON

Cont=requests.get (URL). Content #获取网页中json数据

Rex=re.compile (R ' \w+[(]{1} (. *) [)]{1} ') #正则表达式去除cont数据中多余的部分, data becomes the true JSON-formatted data {"A": "B", "C": "D"}

Con=json.loads (Content, "GBK") uses the JSON loads function to convert content into a data format that the JSON library function can handle, "GBK" as the encoding for the data, because the win system defaults to GBK

Count=len (con[' ratedetail '] [' ratelist ']) #获取用户评论的个数 (this is only the current page)

For I in Xrange (count):

Print con[' ratedetail ' [' ratelist '][i][' appendcomment ']

#循环遍历用户的评论 and output (you can also save data on demand, see part fourth)

The difficulty here is finding the path to a user's comment in a cluttered JSON data

Four Save the results of the resolution

Here the user can save the user's comment information to the local, such as in the CSV format.

The above is the whole of this article, I hope you like.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.