Python Crawl microblogging data generation word cloud picture __python

Source: Internet
Author: User

A long time ago wrote an article how to use micro-blogging data to make the word cloud picture out, before the incomplete, and can only use their own data, now reorganize, anyone's micro-blogging data can be produced, even Python small white can be done in minutes. preparatory work

This environment is based on Python3, theoretically Python2.7 is also feasible, first installs the necessary third party to rely on the package:

# requirement.txt
jieba==0.38
matplotlib==2.0.2
numpy==1.13.1
pyparsing==2.2.0
requests== 2.18.4
scipy==0.19.1
wordcloud==1.3.1

The Requirement.txt file contains a few of the above dependencies, and it is recommended that you use the Anaconda installation if the PIP installation fails

Pip Install-r requirement.txt
First Step: Analysis of Web sites

Open Weibo mobile web site Https://m.weibo.cn/searchs, find the Goddess's microblog ID, enter her microblog homepage, and analyze the process of sending a request to the browser

Turn on the Chrome browser's debugging features, select the Network menu, and observe that the interface to get the microblogging data is Https://m.weibo.cn/api/container/getIndex, followed by a series of parameters, Some of these parameters are based on user changes, some are fixed, first extracted.

uid=1192515960&
luicode=10000011&
lfid=100103type%3d3%26q%3d%e6%9d%8e%e5%86%b0%e5%86%b0&
featurecode=20000320&
type=user&
containerid=1076031192515960

To analyze the return results of the interface, the return data is a JSON dictionary structure, total is the number of tweets, each specific micro-blog content encapsulated in the cards array, the specific content field is inside the text field. A lot of interference information has been hidden away.

{"
    Cardlistinfo": {"
        Containerid": "1076031192515960",
        "Total": 4754,
        "page": 2
    },
    "cards" : [
        {
            "Card_type": 9,
            "Mblog": {
                "created_at": "08-26",
                "Idstr": "4145069944506080",
                " Text ":" Swiss day tour successful end ... ",}
        }]
}
Step Two: Build the request header and query parameters

After analyzing the Web page, we started to use the requests simulation browser constructs the crawler to obtain the data, because obtains the user's data without logging on the microblog, therefore we do not need constructs the cookie information, only needs the basic request header to be possible, specifically needs which header information also can obtain from the browser, The required request parameters, including the request headers and query parameters, are constructed first.

headers = {"
    Host": "m.weibo.cn",
    "Referer": "https://m.weibo.cn/u/1705822647",
    "user-agent": "mozilla/ 5.0 (IPhone; CPU iPhone os 9_1 like Mac os X applewebkit/601.1.46 (khtml, like Gecko) "version/9.0 mobile/13b143 safari/601.1
                  " c4/>}

params = {"UID": "{uid}",
          "Luicode": "20000174",
          "Featurecode": "20000320",
          "type": "UID",
          ' value ': ' 1705822647 ', '
          containerid ': ' {containerid} ',
          ' page ': ' {page} '}
UID is Weibo user ID Containerid Although not what it means, but also a specific user-related parameters page paging parameters Step Three: Construct a simple reptile

Through the returned data can be queried total micro-bo, crawl data directly using the method provided by requests to convert JSON data into a Python Dictionary object, extract all the value of the text field and put it in the blogs list, extract the text before the simple filter, remove the useless letter Interest. By the way, the data is written to the file to facilitate the next conversion without repeated crawling.

def fetch_data (Uid=none, Container_id=none): ""
    crawl data and save to CSV file: return: "" "
    page = 0
    Total = 4754
    blogs = [] for
    I in range (0, total//):
        params[' uid '] = uid
        params[' page '] = str (page) 
  params[' containerid '] = container_id
        res = requests.get (URL, params=params, headers=headers)
        cards = Res.json (). Get ("cards")

        for card in cards:
            # The body content of each tweet
            if Card.get ("card_type") = = 9:
                Text = Card.get ("Mblog"). Get ("text")
                text = clean_html (text)
                blogs.append (text)
        page + + 1
        print ("Crawl" Page}, which currently captures a total of {count} tweets. Format (page=page, Count=len (blogs)) with
        Codecs.open (' Weibo1.txt ', ' W ', encoding= ') Utf-8 ') as F:
            f.write ("\ n". Join (Blogs))

Fourth Step: word processing and construction of CI cloud

Reptile all the data, the first participle, here is a stuttering participle, according to the Chinese context of the sentence segmentation processing, Word segmentation process filter out the stop word, after processing to find a reference map, and then according to the reference map through the words assembled into a map.

Def generate_image ():
    data = []
    jieba.analyse.set_stop_words ("./stopwords.txt") with

    Codecs.open (" Weibo1.txt ", ' R ', encoding=" Utf-8 ") as F: for
        text in F.readlines ():
            data.extend (Jieba.analyse.extract_tags ( Text, topk=20))
        data = "". Join (data)
        mask_img = Imread ('./52f90c9a5131c.jpg ', flatten=true)
        Wordcloud = Wordcloud (
            font_path= ' MSYH.TTC ',
            background_color= ' white ',
            mask=mask_img
        ). Generate (data)
        plt.imshow (Wordcloud.recolor (Color_func=grey_color_func, random_state=3),
                   interpolation= "bilinear")
        plt.axis (' off ')
        plt.savefig ('./heart2.jpg ', dpi=1600)

This is the original image:

Final Effect Diagram:

The full code can reply to "Qixi" in the public number (Zen of Python) to get
Public number: The Zen of Python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.