A long time ago wrote an article how to use micro-blogging data to make the word cloud picture out, before the incomplete, and can only use their own data, now reorganize, anyone's micro-blogging data can be produced, even Python small white can be done in minutes. preparatory work
This environment is based on Python3, theoretically Python2.7 is also feasible, first installs the necessary third party to rely on the package:
# requirement.txt
jieba==0.38
matplotlib==2.0.2
numpy==1.13.1
pyparsing==2.2.0
requests== 2.18.4
scipy==0.19.1
wordcloud==1.3.1
The Requirement.txt file contains a few of the above dependencies, and it is recommended that you use the Anaconda installation if the PIP installation fails
Pip Install-r requirement.txt
First Step: Analysis of Web sites
Open Weibo mobile web site Https://m.weibo.cn/searchs, find the Goddess's microblog ID, enter her microblog homepage, and analyze the process of sending a request to the browser
Turn on the Chrome browser's debugging features, select the Network menu, and observe that the interface to get the microblogging data is Https://m.weibo.cn/api/container/getIndex, followed by a series of parameters, Some of these parameters are based on user changes, some are fixed, first extracted.
uid=1192515960&
luicode=10000011&
lfid=100103type%3d3%26q%3d%e6%9d%8e%e5%86%b0%e5%86%b0&
featurecode=20000320&
type=user&
containerid=1076031192515960
To analyze the return results of the interface, the return data is a JSON dictionary structure, total is the number of tweets, each specific micro-blog content encapsulated in the cards array, the specific content field is inside the text field. A lot of interference information has been hidden away.
{"
Cardlistinfo": {"
Containerid": "1076031192515960",
"Total": 4754,
"page": 2
},
"cards" : [
{
"Card_type": 9,
"Mblog": {
"created_at": "08-26",
"Idstr": "4145069944506080",
" Text ":" Swiss day tour successful end ... ",}
}]
}
Step Two: Build the request header and query parameters
After analyzing the Web page, we started to use the requests simulation browser constructs the crawler to obtain the data, because obtains the user's data without logging on the microblog, therefore we do not need constructs the cookie information, only needs the basic request header to be possible, specifically needs which header information also can obtain from the browser, The required request parameters, including the request headers and query parameters, are constructed first.
headers = {"
Host": "m.weibo.cn",
"Referer": "https://m.weibo.cn/u/1705822647",
"user-agent": "mozilla/ 5.0 (IPhone; CPU iPhone os 9_1 like Mac os X applewebkit/601.1.46 (khtml, like Gecko) "version/9.0 mobile/13b143 safari/601.1
" c4/>}
params = {"UID": "{uid}",
"Luicode": "20000174",
"Featurecode": "20000320",
"type": "UID",
' value ': ' 1705822647 ', '
containerid ': ' {containerid} ',
' page ': ' {page} '}
UID is Weibo user ID Containerid Although not what it means, but also a specific user-related parameters page paging parameters
Step Three: Construct a simple reptile
Through the returned data can be queried total micro-bo, crawl data directly using the method provided by requests to convert JSON data into a Python Dictionary object, extract all the value of the text field and put it in the blogs list, extract the text before the simple filter, remove the useless letter Interest. By the way, the data is written to the file to facilitate the next conversion without repeated crawling.
def fetch_data (Uid=none, Container_id=none): ""
crawl data and save to CSV file: return: "" "
page = 0
Total = 4754
blogs = [] for
I in range (0, total//):
params[' uid '] = uid
params[' page '] = str (page)
params[' containerid '] = container_id
res = requests.get (URL, params=params, headers=headers)
cards = Res.json (). Get ("cards")
for card in cards:
# The body content of each tweet
if Card.get ("card_type") = = 9:
Text = Card.get ("Mblog"). Get ("text")
text = clean_html (text)
blogs.append (text)
page + + 1
print ("Crawl" Page}, which currently captures a total of {count} tweets. Format (page=page, Count=len (blogs)) with
Codecs.open (' Weibo1.txt ', ' W ', encoding= ') Utf-8 ') as F:
f.write ("\ n". Join (Blogs))
Fourth Step: word processing and construction of CI cloud
Reptile all the data, the first participle, here is a stuttering participle, according to the Chinese context of the sentence segmentation processing, Word segmentation process filter out the stop word, after processing to find a reference map, and then according to the reference map through the words assembled into a map.
Def generate_image ():
data = []
jieba.analyse.set_stop_words ("./stopwords.txt") with
Codecs.open (" Weibo1.txt ", ' R ', encoding=" Utf-8 ") as F: for
text in F.readlines ():
data.extend (Jieba.analyse.extract_tags ( Text, topk=20))
data = "". Join (data)
mask_img = Imread ('./52f90c9a5131c.jpg ', flatten=true)
Wordcloud = Wordcloud (
font_path= ' MSYH.TTC ',
background_color= ' white ',
mask=mask_img
). Generate (data)
plt.imshow (Wordcloud.recolor (Color_func=grey_color_func, random_state=3),
interpolation= "bilinear")
plt.axis (' off ')
plt.savefig ('./heart2.jpg ', dpi=1600)
This is the original image:
Final Effect Diagram:
The full code can reply to "Qixi" in the public number (Zen of Python) to get
Public number: The Zen of Python