Python Crawl microblogging data generation word cloud picture _

Python Crawl microblogging data generation word cloud picture __python

Last Update:2018-07-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A long time ago wrote an article how to use micro-blogging data to make the word cloud picture out, before the incomplete, and can only use their own data, now reorganize, anyone's micro-blogging data can be produced, even Python small white can be done in minutes. preparatory work

This environment is based on Python3, theoretically Python2.7 is also feasible, first installs the necessary third party to rely on the package:

# requirement.txt
jieba==0.38
matplotlib==2.0.2
numpy==1.13.1
pyparsing==2.2.0
requests== 2.18.4
scipy==0.19.1
wordcloud==1.3.1

The Requirement.txt file contains a few of the above dependencies, and it is recommended that you use the Anaconda installation if the PIP installation fails

Pip Install-r requirement.txt

First Step: Analysis of Web sites

Open Weibo mobile web site Https://m.weibo.cn/searchs, find the Goddess's microblog ID, enter her microblog homepage, and analyze the process of sending a request to the browser

Turn on the Chrome browser's debugging features, select the Network menu, and observe that the interface to get the microblogging data is Https://m.weibo.cn/api/container/getIndex, followed by a series of parameters, Some of these parameters are based on user changes, some are fixed, first extracted.

uid=1192515960&
luicode=10000011&
lfid=100103type%3d3%26q%3d%e6%9d%8e%e5%86%b0%e5%86%b0&
featurecode=20000320&
type=user&
containerid=1076031192515960

To analyze the return results of the interface, the return data is a JSON dictionary structure, total is the number of tweets, each specific micro-blog content encapsulated in the cards array, the specific content field is inside the text field. A lot of interference information has been hidden away.

{"
    Cardlistinfo": {"
        Containerid": "1076031192515960",
        "Total": 4754,
        "page": 2
    },
    "cards" : [
        {
            "Card_type": 9,
            "Mblog": {
                "created_at": "08-26",
                "Idstr": "4145069944506080",
                " Text ":" Swiss day tour successful end ... ",}
        }]
}

Step Two: Build the request header and query parameters

After analyzing the Web page, we started to use the requests simulation browser constructs the crawler to obtain the data, because obtains the user's data without logging on the microblog, therefore we do not need constructs the cookie information, only needs the basic request header to be possible, specifically needs which header information also can obtain from the browser, The required request parameters, including the request headers and query parameters, are constructed first.

headers = {"
    Host": "m.weibo.cn",
    "Referer": "https://m.weibo.cn/u/1705822647",
    "user-agent": "mozilla/ 5.0 (IPhone; CPU iPhone os 9_1 like Mac os X applewebkit/601.1.46 (khtml, like Gecko) "version/9.0 mobile/13b143 safari/601.1
                  " c4/>}

params = {"UID": "{uid}",
          "Luicode": "20000174",
          "Featurecode": "20000320",
          "type": "UID",
          ' value ': ' 1705822647 ', '
          containerid ': ' {containerid} ',
          ' page ': ' {page} '}

UID is Weibo user ID Containerid Although not what it means, but also a specific user-related parameters page paging parameters Step Three: Construct a simple reptile

Through the returned data can be queried total micro-bo, crawl data directly using the method provided by requests to convert JSON data into a Python Dictionary object, extract all the value of the text field and put it in the blogs list, extract the text before the simple filter, remove the useless letter Interest. By the way, the data is written to the file to facilitate the next conversion without repeated crawling.

def fetch_data (Uid=none, Container_id=none): ""
    crawl data and save to CSV file: return: "" "
    page = 0
    Total = 4754
    blogs = [] for
    I in range (0, total//):
        params[' uid '] = uid
        params[' page '] = str (page) 
  params[' containerid '] = container_id
        res = requests.get (URL, params=params, headers=headers)
        cards = Res.json (). Get ("cards")

        for card in cards:
            # The body content of each tweet
            if Card.get ("card_type") = = 9:
                Text = Card.get ("Mblog"). Get ("text")
                text = clean_html (text)
                blogs.append (text)
        page + + 1
        print ("Crawl" Page}, which currently captures a total of {count} tweets. Format (page=page, Count=len (blogs)) with
        Codecs.open (' Weibo1.txt ', ' W ', encoding= ') Utf-8 ') as F:
            f.write ("\ n". Join (Blogs))

Fourth Step: word processing and construction of CI cloud

Reptile all the data, the first participle, here is a stuttering participle, according to the Chinese context of the sentence segmentation processing, Word segmentation process filter out the stop word, after processing to find a reference map, and then according to the reference map through the words assembled into a map.

Def generate_image ():
    data = []
    jieba.analyse.set_stop_words ("./stopwords.txt") with

    Codecs.open (" Weibo1.txt ", ' R ', encoding=" Utf-8 ") as F: for
        text in F.readlines ():
            data.extend (Jieba.analyse.extract_tags ( Text, topk=20))
        data = "". Join (data)
        mask_img = Imread ('./52f90c9a5131c.jpg ', flatten=true)
        Wordcloud = Wordcloud (
            font_path= ' MSYH.TTC ',
            background_color= ' white ',
            mask=mask_img
        ). Generate (data)
        plt.imshow (Wordcloud.recolor (Color_func=grey_color_func, random_state=3),
                   interpolation= "bilinear")
        plt.axis (' off ')
        plt.savefig ('./heart2.jpg ', dpi=1600)

This is the original image:

Final Effect Diagram:

The full code can reply to "Qixi" in the public number (Zen of Python) to get
Public number: The Zen of Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawl microblogging data generation word cloud picture __python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Crawl microblogging data generation word cloud picture __python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support