Crawl micro-blogging data with Python and generate word clouds

Source: Internet
Author: User
Tags what header

Very early wrote an article how to use micro-blog data to make the word cloud picture, before the incomplete, and only use their own data, now re-organized, any micro-blog data can be produced, put in today should be compared to the occasion.

The annual abuse Wang Festival, is to continue to squat in the corner to eat dog food silently or take the initiative to say goodbye single Wang to join the ranks of the dog food to see you, Tanabata send what is the mind, program apes can try a special way to express your feelings for the goddess. One idea is to use the word cloud to showcase her past tweets. This article teaches you how to use Python to quickly create a cloud of mind words, even python small white can be divided into minutes.

Preparatory work

This environment is based on Python3, theoretically Python2.7 is also feasible, first install the necessary third-party dependency package:

# requirement.txtjieba==0.38matplotlib==2.0.2numpy==1.13.1pyparsing==2.2.0requests==2.18.4scipy== 0.19.1wordcloud==1.3.1

The Requirement.txt file contains several dependent packages above, and if the PIP installation fails, it is recommended to use Anaconda installation

Pip Install-r requirement.txt
First step: Analyze URLs

Open the Weibo mobile URL https://m.weibo.cn/searchs, find the Goddess's Weibo ID, go to her Weibo homepage, and analyze the process of sending the request to the browser.

Open the Chrome browser debugging function, select the Network menu, observe the interface to get the microblog data is Https://m.weibo.cn/api/container/getIndex, followed by a series of parameters, Some of these parameters are based on the user changes, and some are fixed, first extracted.

Uid=1192515960&luicode=10000011&lfid=100103type%3d3%26q%3d%e6%9d%8e%e5%86%b0%e5%86%b0&featurecode= 20000320&type=user&containerid=1076031192515960

Again to analyze the return results of the interface, the return data is a JSON dictionary structure, total is the number of Weibo, each specific microblogging content encapsulated in the cards array, the specific content field is the text field inside. A lot of interference information has been hidden away.

{"Cardlistinfo":{"Containerid":"1076031192515960", "Total": 4754, "page": 2 }, "cards": [ { "Card_type"  : 9, "Mblog": { "created_at": "08-26", "Idstr": " 4145069944506080 ", " text ": " Swiss day tour ends successfully ... ",      }}  
Step two: Build the request header and query parameters

After analyzing the Web page, we began to use the requests simulation browser construction Crawler to obtain data, because here to get the user's data without logging on to Weibo, so we do not need to construct cookie information, only the basic request header can be, specifically need to what header information can be obtained from the browser, You first construct the required request parameters, including the request headers and query parameters.

Headers={"Host":"M.weibo.cn","Referer":"https://m.weibo.cn/u/1705822647","User-agent":"Mozilla/5.0 (IPhone; CPU iPhone os 9_1 like Mac os X applewebkit/601.1.46 (khtml, like Gecko) ""Version/9.0 mobile/13b143 safari/601.1",}params = { " UID ": " {uid} "" Luicode ":  "20000174"  "Featurecode" :  " type ": " UID "  "value" :  "1705822647"  "Containerid ": " {Containerid} "" page ": span class= "S2" > "{page}" }            
    • UID is the ID of the Weibo user
    • Containerid is not meant to be, but is also relevant to a specific user parameters
    • Page Paging parameters
Step three: Construct a simple crawler

Through the returned data can be queried to total Weibo, crawl data directly using the method provided by requests to convert the JSON data into a Python Dictionary object, extract all the values of the text field and put it in the blogs list, extract the text before the simple filter, remove the useless letter Interest. By the way, the data is written to the file, allowing the next conversion to no longer repeat crawls.

DefFetch_data(Uid=None,container_id=None):"""Crawl the data and save it to a CSV file: return:"""Page=0Total=4754Blogs=[]ForIInchRange(0,Total//10):Params[' UID ']=UidParams[' Page ']=Str(Page)Params[' Containerid ']=container_idRes=Requests.Get(Url,Params=Params,Headers=HEADERS)Cards=Res.Json().Get("Cards")ForCardInchCards:# The content of each Weibo textIfCard.Get("Card_type")==9:Text=Card.Get("Mblog").Get("Text")Text=Clean_html(Text)Blogs.Append(Text)Page+=1Print("Crawl {page} page, currently fetching {count}" Weibo ".Format(page=pagecount =len (blogs With codecs. Open ( ' weibo1.txt '   ' W '  encoding= ' utf-8 ' ) as  F: f. Write ( "\n" . Join (blogs        

Fourth step: participle processing and building word cloud

Crawler all the data, the first participle, here is the stuttering participle, in accordance with the Chinese context of the sentence word processing, the word filter out the word in the process, after processing to find a reference map, and then according to the reference map through the words assembled into a diagram.

DefGenerate_image():Data=[]Jieba.Analyse.Set_stop_words("./stopwords.txt")WithCodecs.Open("Weibo1.txt",' R ',Encoding="Utf-8")AsF:ForTextInchF.ReadLines():Data.Extend(Jieba.Analyse.Extract_tags(Text,TopK=20))Data=" ".Join(Data)Mask_img=Imread('./52f90c9a5131c.jpg ',Flatten=True)Wordcloud=Wordcloud(Font_path=' MSYH.TTC ',Background_color=' White ',Mask=Mask_img).Generate(Data)plt. Imshow (wordcloud. Recolor (color_func=grey_color_func random_state=3 interpolation= "bilinear"  plt< Span class= "O". axis ( ' off ' ) plt savefig (dpi< Span class= "o" >=1600)          

Eventually:

The full code can be retrieved "Qixi" in the public number (Zen of Python).

Crawl micro-blogging data with Python and generate word clouds

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.