Crawl B-Station video commentary with Python and create a word cloud

Source: Internet
Author: User

Python as a crawler weapon, with a lot of powerful third-party library is inseparable, today said crawl B station video commentary, in fact, focus on the analysis of the comments into nested dictionaries, in which to remove the desired content. Layers of nesting, dazzling, analysis should be meticulous! The steps are divided into the following points:

  1. F12 Enter developer options
    Go to the B station you want to watch the video page, for example, I see a video of the bite of the cat, after entering the developer options, pull down the video comments, then the comments are loaded, at the moment in the developer options Network there can be seen from the Web site to get a lot of information, carefully search, Discover what we want, such as:

    You can see the content of the comment area, click the request URL in the message header (https://api.bilibili.com/x/v2/reply?callback=jQuery172048896660782015544_1512700122908 &jsonp=jsonp&pn=1&type=1&oid=11022534&sort=0&_=1512700148066), copy and paste to view in the browser, you can see a page of comment content, Remove unnecessary URL content, the remainder is: https://api.bilibili.com/x/v2/reply?pn=1&type=1&oid=11022534, where PN is the first page of comments, OID is the AV number of the video.
  2. Parse Gets the content dictionary format, the inclusion relationship of the nested content
  3. Code get content, write local file
    1 ImportRequests2 ImportJSON3 defgethtml (HTML):4Count=15Fi=open ('Bilibili.txt','W', encoding='Utf-8')6      while(True):7url=html+Str (count)8Url=requests.get (URL)9         ifurl.status_code==200:TenCont=json.loads (Url.text) One         Else: A              Break -Lengthrpy = Len (cont['Data']['Replies']) -         ifCount==1: the             Try: -Lengthhot=len (cont['Data']['Hots']) -                  forIinchRange (lengthhot): -                     #Top Reviews Content +hotmsg=cont['Data']['Hots'][i]['content']['message'] -Fi.write (hotmsg +'\ n') +Leng=len (cont['Data']['Hots'][i]['Replies']) A                      forJinchRange (Leng): at                         #Popular Comments Reply content -hotmsgrp=cont['Data']['Hots'][i]['Replies'][j]['content']['message'] -Fi.write (hotmsgrp+'\ n') -             except: -                 Pass -         iflengthrpy!=0: in              forIinchRange (lengthrpy): -commsg=cont['Data']['Replies'][i]['content']['message'] toFi.write (commsg +'\ n') +                 #print (' Comment: ', cont[' data ' [' replies '][i][' content '] [' message ']) -Leng=len (cont['Data']['Replies'][i]['Replies']) the                  forJinchRange (Leng): *commsgrp=cont['Data']['Replies'][i]['Replies'][j]['content']['message'] $Fi.write (Commsgrp +'\ n')Panax Notoginseng         Else: -              Break the         Print("page%d write succeeded! "%count) +Count + = 1 A fi.close () the     Print(Count-1,'Page comments written successfully! ') +  -Url="https://api.bilibili.com/x/v2/reply?type=1&oid=" $Av=input ("input your URL:") $html=url+av+'&pn=' -gethtml (HTML)
    Get Comment content

  4. Draw a word cloud
    The process of drawing word cloud is divided into: reading the cloud text of the word, using the stutter (third-party library Jieba) to separate the high-frequency words from the text, set the background image for the word cloud (can be omitted), view and save the drawing words
    The code is as follows:
    1  fromWordcloudImportWordcloud,imagecolorgenerator2 ImportMatplotlib.pyplot as Plt3  fromPILImportImage4 ImportNumPy as NP5  fromOsImportPath6 ImportJieba7 8Lj=path.dirname (__file__)#Current file path9Text=open (Path.join (LJ,'Bilibili.txt'), encoding='Utf-8'). Read ()#text to readTenJieba.add_word ('Bite the Cat.') OneJieba.add_word ('Meow sauce')#add words that stutter can't tell Ajbtext=' '. Join (Jieba.cut (text)) -Imgmask=np.array (Image.open (Path.join (LJ,'Msk.png')))#read the background picture -Wc=Wordcloud ( theBackground_color=' White', -max_words=500, -Font_path='MSYH.TTC',#Chinese is not supported by default -Mask=imgmask,#set a background picture +Random_state=30#how many color schemes are generated - ). Generate (Jbtext) +Imagecolorgenerator (Imgmask)#Create a word cloud color based on a picture A #plt.imshow (WC) at #plt.axis (' off ') - #plt.show () -Wc.to_file (Path.join (LJ,'Bilidm.png')) - Print('Save the word cloud picture Successfully! ')
    Word Cloud Rendering

In the rare case of code, Python can do such amazing work, only life is too short, I use Python.

Originality is not easy, respect copyright. Reprint Please specify source:http://www.cnblogs.com/xsmile/

Crawl B-Station video commentary with Python and create a word cloud

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.