Use Python to crawl Weibo data to generate word cloud image instance code,

Source: Internet
Author: User

Use Python to crawl Weibo data to generate word cloud image instance code,

Preface

I wrote an article about how to use Weibo data to create Word cloud images. I did not write it completely, and I can only use my own data. Now I have reorganized it, any Weibo data can be produced. The annual abuse festival is about whether to continue squatting in the corner and eat dog food silently or take the initiative to say goodbye to the single dog food and add it to the ranks of dog food, if the Chinese Valentine's Day sends a message, you can try to express your thoughts on the goddess in a special way. One idea is to display the words cloud after her previous Weibo posts. This article teaches you how to use Python to quickly create a word cloud. Even Python can be created in minutes. I won't talk much about it below. Let's take a look at the detailed introduction.

Preparations

This environment is based on Python3. In theory, Python2.7 is also feasible. Install the necessary third-party dependency package first:

# requirement.txtjieba==0.38matplotlib==2.0.2numpy==1.13.1pyparsing==2.2.0requests==2.18.4scipy==0.19.1wordcloud==1.3.1

The requirement.txt file contains the preceding dependency packages. If pip installation fails, we recommend that you use Anaconda for installation.

pip install -r requirement.txt

Step 1: analyze the website

Open Weibo mobile terminal web site https://m.weibo.cn/searchs, find the goddess of microblogging ID, go to her microblogging home page, analysis browser send request process

Open Chrome browser debugging function, select the Network menu, observe the interface to get microblogging data is https://m.weibo.cn/api/container/getIndex, followed by a series of parameters, some parameters here are based on user changes, some are fixed and extracted first.

uid=1192515960&luicode=10000011&lfid=100103type%3D3%26q%3D%E6%9D%8E%E5%86%B0%E5%86%B0&featurecode=20000320&type=user&containerid=1076031192515960

Then, analyze the returned results of the interface. The returned data is a JSON dictionary structure, with total representing the total number of Weibo entries. The specific Weibo content is encapsulated in the cards array, the specific content field is the text field. Many interference information is hidden.

{"CardlistInfo": {"containerid": "1076031192515960", "total": 4754, "page": 2}, "cards": [{"card_type": 9, "mblog": {"created_at": "08-26", "idstr": "4145069944506080", "text": "Switzerland day tour ends successfully... ",}}]}

Step 2: Construct request headers and query parameters

After analyzing the webpage, we started to use requests to simulate a browser to construct a crawler to obtain data. Because the crawler does not need to log on to Weibo to obtain user data, we do not need to construct cookie information, you only need the basic request header. You can obtain the required header information from the browser. First, construct the required request parameters, including the request header and query parameters.

headers = { "Host": "m.weibo.cn", "Referer": "https://m.weibo.cn/u/1705822647", "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) "   "Version/9.0 Mobile/13B143 Safari/601.1",}params = {"uid": "{uid}",  "luicode": "20000174",  "featurecode": "20000320",  "type": "uid",  "value": "1705822647",  "containerid": "{containerid}",  "page": "{page}"}
  • Uid is the id of a Weibo user.
  • Although containerid does not mean anything, it is also a parameter related to a specific user.
  • Page paging Parameters

Step 3: Construct a simple Crawler

The returned data can be used to query the total number of Weibo posts. Crawling data directly converts json data into a Python dictionary object using the method provided by requests, extract the values of all text fields and put them in the blogs list. Perform simple filtering before extracting the text to Remove useless information. By the way, data is written to the file, so that the next conversion will not be repeated.

Def fetch_data (uid = None, container_id = None): "capture data and save it to the CSV file: return: "" page = 0 total = 4754 blogs = [] for I in range (0, total // 10 ): params ['uid'] = uid params ['page'] = str (page) params ['containerid'] = container_id res = requests. get (url, params = params, headers = HEADERS) cards = res. json (). get ("cards") for card in cards: # The body content of each microblog if card. get ("card_type") = 9: text = card. get ("mblog "). get ("text") text = clean_html (text) blogs. append (text) page + = 1 print ("Capture page {page}. Currently, a total of {count} Weibo posts are crawled ". format (page = page, count = len (blogs) with codecs.open('weibo1.txt ', 'w', encoding = 'utf-8') as f: f. write ("\ n ". join (blogs ))

Step 4: Word Segmentation and word cloud Construction

After crawling all the data, you can perform word segmentation first. Here, you can perform word segmentation based on the Chinese context and filter out the stopword during word segmentation, after processing, find a reference chart and assemble it into a graph based on the reference chart.

def generate_image(): data = [] jieba.analyse.set_stop_words("./stopwords.txt") with codecs.open("weibo1.txt", 'r', encoding="utf-8") as f: for text in f.readlines():  data.extend(jieba.analyse.extract_tags(text, topK=20)) data = " ".join(data) mask_img = imread('./52f90c9a5131c.jpg', flatten=True) wordcloud = WordCloud(  font_path='msyh.ttc',  background_color='white',  mask=mask_img ).generate(data) plt.imshow(wordcloud.recolor(color_func=grey_color_func, random_state=3),   interpolation="bilinear") plt.axis('off') plt.savefig('./heart2.jpg', dpi=1600)

Finally:

The complete sample code is as follows:

#-*-Coding: UTF-8-*-import codecsimport reimport jieba. analyseimport matplotlib. pyplot as pltimport requestsfrom scipy. misc import imreadfrom wordcloud import WordCloud _ author _ = 'liuzhijun' headers = {"Host": "m.weibo.cn", "Referer": "https://m.weibo.cn/u/1705822647", "User-Agent ": "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko)" "Version/9.0 Mo Bile/13B143 Safari/601.1 ",} def clean_html (raw_html): pattern = re. compile (R' <. *?> | Forward Weibo | //: | Repost |, |? |. |, | Share an image | Reply @.*?: | //@. * ') Text = re. sub (pattern, '', raw_html) return texturl =" https://m.weibo.cn/api/container/getIndex "params = {" uid ":" {uid} "," luicode ":" 20000174 "," featurecode ": "20000320", "type": "uid", "value": "1705822647", "containerid": "{containerid}", "page ": "{page}"} def fetch_data (uid = None, container_id = None): "" capture data and save it to a CSV file: return: "" page = 0 total = 4754 blogs = [] for I in range (0, total // 10 ): params ['uid'] = uid params ['page'] = str (page) params ['containerid'] = container_id res = requests. get (url, params = params, headers = headers) cards = res. json (). get ("cards") for card in cards: # The body content of each microblog if card. get ("card_type") = 9: text = card. get ("mblog "). get ("text") text = clean_html (text) blogs. append (text) page + = 1 print ("Capture page {page}. Currently, a total of {count} Weibo posts are crawled ". format (page = page, count = len (blogs) with codecs.open('weibo1.txt ', 'w', encoding = 'utf-8') as f: f. write ("\ n ". join (blogs) def grey_color_func (word, font_size, position, orientation, random_state = None, ** kwargs): s = "hsl (0, 0% %, % d %) "% 0 return sdef generate_image (): data = [] jieba. analyze. set_stop_words (". /stopwords.txt ") with codecs. open ("weibo1.txt", 'R', encoding = "UTF-8") as f: for text in f. readlines (): data. extend (jieba. analyze. extract_tags (text, topK = 20) data = "". join (data) mask_img = imread ('. /52f90c9a5425c.jpg ', flatten = True) wordcloud = WordCloud (font_path = 'msyh. ttc ', background_color = 'white', mask = mask_img ). generate (data) plt. imshow (wordcloud. recolor (color_func = grey_color_func, random_state = 3), interpolation = "bilinear") plt. axis ('off') plt. savefig ('. /heart2.jpg ', dpi = 1600) if _ name _ =' _ main _ ': fetch_data ("1192515960", "1076031192515960") generate_image ()

Summary

The above is all the content of this article. I hope the content of this article will help you in your study or work. If you have any questions, please leave a message, thank you for your support.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.