Very early wrote an article how to use micro-blog data to make the word cloud picture, before the incomplete, and only use their own data, now re-organized, any micro-blog data can be produced, put in today should be compared to the occasion.
The annual abuse Wang Festival, is to continue to squat in the corner to eat dog food silently or take the initiative to say goodbye single Wang to join the ranks of the dog food to see you, Tanabata send what is the mind, program apes can try a special way to express your feelings for the goddess. One idea is to use the word cloud to showcase her past tweets. This article teaches you how to use Python to quickly create a cloud of mind words, even python small white can be divided into minutes.
Preparatory work
This environment is based on Python3, theoretically Python2.7 is also feasible, first install the necessary third-party dependency package:
# requirement.txtjieba==0.38matplotlib==2.0.2numpy==1.13.1pyparsing==2.2.0requests==2.18.4scipy== 0.19.1wordcloud==1.3.1
The Requirement.txt file contains several dependent packages above, and if the PIP installation fails, it is recommended to use Anaconda installation
Pip Install-r requirement.txt
First step: Analyze URLs
Open the Weibo mobile URL https://m.weibo.cn/searchs, find the Goddess's Weibo ID, go to her Weibo homepage, and analyze the process of sending the request to the browser.
Open the Chrome browser debugging function, select the Network menu, observe the interface to get the microblog data is Https://m.weibo.cn/api/container/getIndex, followed by a series of parameters, Some of these parameters are based on the user changes, and some are fixed, first extracted.
Uid=1192515960&luicode=10000011&lfid=100103type%3d3%26q%3d%e6%9d%8e%e5%86%b0%e5%86%b0&featurecode= 20000320&type=user&containerid=1076031192515960
Again to analyze the return results of the interface, the return data is a JSON dictionary structure, total is the number of Weibo, each specific microblogging content encapsulated in the cards array, the specific content field is the text field inside. A lot of interference information has been hidden away.
{"Cardlistinfo":{"Containerid":"1076031192515960", "Total": 4754, "page": 2 }, "cards": [ { "Card_type" : 9, "Mblog": { "created_at": "08-26", "Idstr": " 4145069944506080 ", " text ": " Swiss day tour ends successfully ... ", }}
Step two: Build the request header and query parameters
After analyzing the Web page, we began to use the requests simulation browser construction Crawler to obtain data, because here to get the user's data without logging on to Weibo, so we do not need to construct cookie information, only the basic request header can be, specifically need to what header information can be obtained from the browser, You first construct the required request parameters, including the request headers and query parameters.
Headers={"Host":"M.weibo.cn","Referer":"https://m.weibo.cn/u/1705822647","User-agent":"Mozilla/5.0 (IPhone; CPU iPhone os 9_1 like Mac os X applewebkit/601.1.46 (khtml, like Gecko) ""Version/9.0 mobile/13b143 safari/601.1",}params = { " UID ": " {uid} "" Luicode ": "20000174" "Featurecode" : " type ": " UID " "value" : "1705822647" "Containerid ": " {Containerid} "" page ": span class= "S2" > "{page}" }
- UID is the ID of the Weibo user
- Containerid is not meant to be, but is also relevant to a specific user parameters
- Page Paging parameters
Step three: Construct a simple crawler
Through the returned data can be queried to total Weibo, crawl data directly using the method provided by requests to convert the JSON data into a Python Dictionary object, extract all the values of the text field and put it in the blogs list, extract the text before the simple filter, remove the useless letter Interest. By the way, the data is written to the file, allowing the next conversion to no longer repeat crawls.
DefFetch_data(Uid=None,container_id=None):"""Crawl the data and save it to a CSV file: return:"""Page=0Total=4754Blogs=[]ForIInchRange(0,Total//10):Params[' UID ']=UidParams[' Page ']=Str(Page)Params[' Containerid ']=container_idRes=Requests.Get(Url,Params=Params,Headers=HEADERS)Cards=Res.Json().Get("Cards")ForCardInchCards:# The content of each Weibo textIfCard.Get("Card_type")==9:Text=Card.Get("Mblog").Get("Text")Text=Clean_html(Text)Blogs.Append(Text)Page+=1Print("Crawl {page} page, currently fetching {count}" Weibo ".Format(page=pagecount =len (blogs With codecs. Open ( ' weibo1.txt ' ' W ' encoding= ' utf-8 ' ) as F: f. Write ( "\n" . Join (blogs
Fourth step: participle processing and building word cloud
Crawler all the data, the first participle, here is the stuttering participle, in accordance with the Chinese context of the sentence word processing, the word filter out the word in the process, after processing to find a reference map, and then according to the reference map through the words assembled into a diagram.
DefGenerate_image():Data=[]Jieba.Analyse.Set_stop_words("./stopwords.txt")WithCodecs.Open("Weibo1.txt",' R ',Encoding="Utf-8")AsF:ForTextInchF.ReadLines():Data.Extend(Jieba.Analyse.Extract_tags(Text,TopK=20))Data=" ".Join(Data)Mask_img=Imread('./52f90c9a5131c.jpg ',Flatten=True)Wordcloud=Wordcloud(Font_path=' MSYH.TTC ',Background_color=' White ',Mask=Mask_img).Generate(Data)plt. Imshow (wordcloud. Recolor (color_func=grey_color_func random_state=3 interpolation= "bilinear" plt< Span class= "O". axis ( ' off ' ) plt savefig (dpi< Span class= "o" >=1600)
Eventually:
The full code can be retrieved "Qixi" in the public number (Zen of Python).
Crawl micro-blogging data with Python and generate word clouds