The annual dog abuse Day just past, friends circle all kinds of sun, sun, Sun, sun Food, Su-eun love. What the programmer is drying, the programmer is working overtime. But the gift is still indispensable, what good to send? As a programmer, I prepared a special gift, with the previous tweet data to create a "love", I think she must be moved to cry. Ha ha
Preparatory work
After the idea began to act, the nature of the first thought is to use Python, the general idea is to crawl the micro-blog data down, the data after cleaning processing and then the word processing, the processing of the data to the words cloud tools, with scientific computing tools and drawing tools to produce images, involving the toolkit has:
Requests used for network requests crawl micro-bo data, stuttering participle of Chinese word processing, word cloud processing library wordcloud, image processing Library Pillow, scientific computing tools NumPy, similar to MATLAB 2D drawing library Matplotlib
Tool installation
When these toolkits are installed, different system platforms may have the same error, and Wordcloud,requests,jieba can be installed online via the normal Pip method.
pip install wordcloudpip install requestspip install jieba
Installing Pillow,numpy,matplotlib on the Windows platform directly with the PIP online installation can cause a variety of problems, and one recommended way is to download a third-party platform called Python Extension Packages for Windows 1 Installation of the. WHL file. You can choose to download and install cp27 corresponding PYTHON2.7,AMD64 corresponding to the 64-bit system according to your system environment. Install after downloading to local
pip install Pillow-4.0.0-cp27-cp27m-win_amd64.whlpip install scipy-0.18.0-cp27-cp27m-win_amd64.whlpip install numpy-1.11.3+mkl-cp27-cp27m-win_amd64.whlpip install matplotlib-1.5.3-cp27-cp27m-win_amd64.whl
Other platforms can be resolved by Google on error. or directly based on the Anaconda development, it is a branch of Python, built in a lot of scientific computing, machine learning modules.
Get Data
Sina Weibo official API is a slag, only to obtain the latest release of the user 5 data, back to the second, the use of crawler to crawl data, before crawling to assess the difficulty, to see if someone wrote well, in GitHub stroll around, basically did not meet the demand. It gave me some ideas, so I decided to write my own crawler. Use the http://m.weibo.cn/mobile URL to crawl data. Discovery interface http://m.weibo.cn/index/my?format=cards&page=1 can be paged to get micro-blog data, and the data returned is in JSON format, so much easier, but the interface needs to log on the cookie information, By logging in to your account, you can find the cookie information in your Chrome browser.
Implementation code:
def Fetch_weibo (): API = "Http://m.weibo.cn/index/my?format=cards &page=%s "for i in range (1, Span class= "Hljs-number" >102): Response = requests.get (url=api% i, cookies=cookies) data = Response.json () [0] groups = Data.get (" Card_group ") or [] for group in groups:text = Group.get ( " Mblog "). Get (" text ") Text = Text.encode (" utf-8 ") Text = cleanring (text). Strip () yield text
To view the total number of pages in Weibo is 101, considering the one-time return of a list object too much memory, the function uses yield to return a generator, in addition to the text for data cleansing, such as the removal of punctuation, HTML tags, "forward microblogging" such as the word.
Save data
After data acquisition, we want to save it offline for the next reuse and avoid repeated crawls. Use the CSV format to save to the Weibo.csv file for next use. Data saved to the CSV file may be garbled when opened, it doesn't matter, with notepad++ view is not garbled.
DefWrite_csv (texts): with codecs.open ( Weibo.csv ', ' W ') as f:writer = csv. Dictwriter (F, Fieldnames=[ "text"]) Writer.writeheader () for Text in texts:writer.writerow ({ "text": text}) def read_csv (): Span class= "Hljs-keyword" >with codecs.open ( ' weibo.csv ', ' R ') Span class= "Hljs-keyword" >as f:reader = csv. Dictreader (f) for row in Reader: Yield Row[ ' text ']
word processing
Each microblog read from the Weibo.csv file is processed and handed to Wordcloud to generate the word cloud. Stuttering participle is suitable for most Chinese usage scenarios, using Stop thesaurus Stopwords.txt to filter out useless information (for example:, then, because, etc.).
def word_segment(texts): jieba.analyse.set_stop_words("stopwords.txt") for text in texts: tags = jieba.analyse.extract_tags(text, topK=20) yield " ".join(tags)
Create a picture
After the data word processing, can give wordcloud processing, Wordcloud according to the data inside the frequency of the words appear, weight by the column display keyword font size. Create a square image,
Yes, the resulting picture is no beauty, after all, is to give people to take out the hand to show off right, then we find an art-rich image as a template, copy out a beautiful picture. I found a "heart" pattern on the Internet:
Generate Picture code:
def generate_img(texts): data = " ".join(text for text in texts) mask_img = imread(‘./heart-mask.jpg‘, flatten=True) wordcloud = WordCloud( font_path=‘msyh.ttc‘, background_color=‘white‘, mask=mask_img ).generate(data) plt.imshow(wordcloud) plt.axis(‘off‘) plt.savefig(‘./heart.jpg‘, dpi=600)
Note that when processing, you need to specify a Chinese font for matplotlib, otherwise it will be garbled, find the font folder: C:\Windows\Fonts\Microsoft Yahei UI Copy the font, copy to the matplotlib installation directory: \ C Python27\lib\site-packages\matplotlib\mpl-data\fonts\ttf under
That's almost it.
When I was proud to send this picture to her, there was this conversation:
What is it?
Me: Love, do it with my own hands.
So professional, so moved ah, your eyes only Python, without me (laughter)
Me: There's python in the heart.
I seem to have said something wrong, hahaha.
The full code can be downloaded in the public number reply "H".
This article starts with the public number "a programmer's micro-station" (id:vttalk), sharing Python dry and temperature-based content
Blog Address: https://foofish.net/python-heart.html
Create a "heart" with Python based on Weibo data