Python開發爬蟲之動態網頁抓取篇：爬取部落格評論資料

最後更新：2018-04-14 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：抓取 san window ati 檢查 amp 網頁抓取實踐 string

以爬取《Python 網路爬蟲：從入門到實踐》一書作者的個人部落格評論為例。網址：http://www.santostang.com/2017/03/02/hello-world/

1）“抓包”：找到真實的資料地址

右鍵點擊“檢查”，點擊“network”，選擇“js”。重新整理一下頁面，選中頁面重新整理時返回的資料list?callback....這個js檔案。右邊再選中Header。

其中，Request URL即是真實的資料地址。

在此狀態下滾動滑鼠滾輪可發現User-Agent。

2）相關代碼：

import requestsimport jsonheaders={‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36‘}link="https://api-zero.livere.com/v1/comments/list?callback=jQuery112405600294326674093_1523687034324&limit=10&offset=2&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1523687034329"r=requests.get(link,headers=headers)# 擷取 json 的 stringjson_string = r.textjson_string = json_string[json_string.find(‘{‘):-2]json_data=json.loads(json_string)comment_list=json_data[‘results‘][‘parents‘]for eachone in comment_list:    message=eachone[‘content‘]    print(message)

據觀察，在真實的資料地址中的offset是頁數。

爬取所有頁面的評論：

import requestsimport jsondef single_page_comment(link):    headers={‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36‘}       r=requests.get(link,headers=headers)    # 擷取 json 的 string    json_string = r.text    json_string = json_string[json_string.find(‘{‘):-2]    json_data=json.loads(json_string)    comment_list=json_data[‘results‘][‘parents‘]    for eachone in comment_list:        message=eachone[‘content‘]        print(message)        for page in range(1,4):    link1="https://api-zero.livere.com/v1/comments/list?callback=jQuery112405600294326674093_1523687034324&limit=10&offset="    link2="&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1523687034329"    page_str=str(page)    link=link1+page_str+link2    print(link)    single_page_comment(link)

參考書目：唐松，來自《Python 網路爬蟲：從入門到實踐》

Python開發爬蟲之動態網頁抓取篇：爬取部落格評論資料

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python開發爬蟲之動態網頁抓取篇：爬取部落格評論資料

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support