Customizing word clouds with Python

Source: Internet
Author: User

First, Experiment introduction 1.1 experimental content

In the Internet age, people access to information in a variety of ways, a lot of information poured into people's eyes. How to extract the key information from the voluminous information and filter out the rubbish information has always been the concern of the modern people. In this era of information explosion, we have to update their knowledge reserves every moment, and the network is the best learning platform. With the ability to filter and process information, learning efficiency will be improved. "The word cloud" is the birth of this. "Word cloud" is a high frequency in the network text "keyword" to be visually prominent, the formation of "keywords cloud" or "keyword rendering", thus filtering out a lot of meaningless information, so that the browser as long as a glance swept the word cloud pictures can enjoy the article or Web content of the main thrust. Not only that, a beautifully crafted word cloud picture, can play a picture of the effect of thousands of words, in the report or the PPT appropriate use of the word cloud, will make the expression more clear and full, for the speaker to express the meaning of the extra points. This experiment will use Python the wordcloud expansion pack to create the word cloud and create a picture to save. and describes how to improve the wordcloud expansion pack so that it can display Chinese characters, and finally describes how to use their favorite pictures to customize the word cloud image outline.

1.2 Experimental Knowledge points
    • Basic steps and principles for making a word cloud
    • PythonCode to achieve word cloud production
    • wordcloudUse of expansion packs
    • Using a custom image to create a word cloud, analyze the keywords of the "three-body" I, II, III
1.3 Experimental environment

The experiment is ubuntu14.04 completed, because Python of the cross-platform features, the code of the experiment can also be run on the Windows Mac system, only the font part of the corresponding processing can be.

    • python2.7
    • XfceTerminal
1.4 Suitable for crowd

This course is difficult for the general, belonging to the primary level of the curriculum, suitable for Python the basic users, familiar with the python basic knowledge and deepen the consolidation.

1.5 Code acquisition

You can use the following command to download the code into the lab building environment, as a reference to learn from the comparison.

$ wget http://labfile.oss.aliyuncs.com/courses/756/simple.py$ wget http://labfile.oss.aliyuncs.com/courses/756/my_word_cloud.py
Second, the principle of experiment

The principle of the word cloud is to input the text data frequency statistics, according to the frequency of the occurrence of different words, according to different proportions of the words, generate pictures. The high frequency of the vocabulary shows the large, low-frequency vocabulary shown in small. The text data can be local data, but also the crawler is dynamically obtained from the network.

III. Preparation of development

Open the Xfce terminal, enter the Code directory, create a work folder, and use it as the working directory for the course. Download and install the expansion pack required for your experiment. If you usually want to experiment on your own computer, whether it is still, it is Windows Linux Mac strongly recommended to install Anaconda , this is a Python scientific calculation package, which contains almost all of the usual expansion pack, without their own laborious installation, the software by Python The father of the first maintenance, three platforms simultaneously updated.

cd work$ sudo apt-get update$ sudo apt-get install python-dev$ sudo pip install numpy$ sudo apt-get install python-matplotlib$ sudo apt-get install python-pil

Download the novel "Three Body" I, II, III.

$ wget http://labfile.oss.aliyuncs.com/courses/756/santi.txt$ wget http://labfile.oss.aliyuncs.com/courses/756/santi2.txt$ wget http://labfile.oss.aliyuncs.com/courses/756/santi3.txt

Install the wordcloud expansion pack.

$ sudo pip install wordcloud
Iv. Experiment Step 5.1 run a simple project to test that the expansion pack is installed properly

Before dealing with the three bodies, let's run the official sample program to ensure that the expansion pack is installed properly and the program works. workCreate a new script in the directory python , named simple.py ,

$ gedit simple.py

The code is as follows:

#!/usr/bin/env python"" "Minimal example===============generating a square wordcloud from the US constitution using default arguments." "from os import pathfrom Wordcloud import Wordcloudd = Path.dirname (__file__) # Read the whole Text.text = Open (Path.join (d,  ' constitution.txt ')). Read () # Generate a word cloud Imagewordcloud = Wordcloud (). Generate (text) # Display the generated Image:# the matplotlib way:import matplotlib.pyplot as pltplt.imshow (Wordcloud) plt.axis ( "off") # lower Max_font_sizewordcloud = Wordcloud (Max_font_size=40). Generate (text) plt.figure () plt.imshow (Wordcloud) plt.axis ( "Off") plt.show ()  

The code is visible and the program runs to search for text files under the path where the script is located “constitution.txt” , so we need to put this text under the folder before running the script work . Download the text using the following command:

$ wget http://labfile.oss.aliyuncs.com/courses/756/constitution.txt

workunder Folders, start the console, as shown in the following:

To run the script in the console:

$ python simple.py

If the expansion pack is installed properly, the program will output the following window:

So far, we have got an English word cloud.

5.2 Solve Chinese display problem

We have successfully installed the wordcloud expansion pack and successfully ran a sample file. But this sample file has a lot of problems, first of all, the display of English characters, in the face of Chinese colleagues or bosses to do reports and share, the use of English word cloud is obviously inappropriate, and many of the text itself is Chinese vocabulary, can not be made into the English word cloud, the outer contour of the word cloud square, relatively stiff, no aesthetic. We hit solve the above problems. First of all, we first solve the problem of Chinese display. Let's try it out, if you don't do anything, just type a novel into the simple.py file and see what the output looks like. We only need to modify the source file to become the following, note the file header to declare the file with the utf8 encoding:

#!/usr/bin/env python#-*-Coding:utf-8-*-"" "Minimal example===============generating a square wordcloud from the US constitution using default arguments." "From OSimport pathfrom wordcloud import Wordcloudd = Path.dirname (__file__) # Read the whole text. #text = Open (Path.join (d, ' constitution.txt ')). Read () Text = open (u "Santi.txt"). Read () Span class= "Hljs-comment" ># Generate a word cloud Imagewordcloud = Wordcloud (). Generate (text) # Display the generated image:# the matplotlib way:import matplotlib.pyplot as pltplt.imshow (Wordcloud) plt.axis ( "off") # lower Max_font_sizewordcloud = Wordcloud (Max_font_size=40). Generate (text) plt.figure () plt.imshow (Wordcloud) plt.axis ( "off") Plt.show ()               

Run the program in the console and get the result:

Are you excited to think you've succeeded? Take a closer look at the word cloud, a Chinese novel, the highest statistics out of the word incredibly is 3K and CPU ? This is not a computer magazine! So the result must be a problem, the less we need to know that this novel is the word cloud should be the Chinese character.

There are two of the above bizarre results, the first is the three body of the novel's Code is not utf8 , the novel is the use of Chinese characters gbk encoded. So we use it directly

text = open(u"santi.txt").read()

This program to read the file, in addition to English words, the other read out are garbled, wordcloud do not know these garbled, nature can not show its frequency. Okay, so let's change the code now, you know how to fix it, right, and change it like this:

text = open(u"santi.txt").read().decode(‘gbk‘)

And look at what this output has become, we are bold to guess that this will not be the biggest in English this time. Yes, this time the output is like this:

What's going on here? Looking closely at the discovery of an English language, it means that we have correctly identified the Chinese characters. So these boxes appear more frequently than in English, they must be Chinese characters ah! But how do Chinese characters show up like this? This is because wordcloud there is no font for displaying Chinese characters! We all know that the Ubuntu system is made by foreigners, so the support for Chinese is certainly not as good as English, especially the font and so on, less. But wordcloud also a foreign development of the word cloud library, both goods are not considered to display the problem of Chinese. However, since the python Chinese can be displayed, then wordcloud it should be possible. Let's find a way to solve the problem. Looking at our code carefully, we find that the key to generating the word cloud is this sentence:

wordcloud = WordCloud(max_font_size=40).generate(text)

Let's take a closer look at this class, with a key parameter in the arguments passed to it:

    font_path : string        or TTF).        Defaults to DroidSansMono path on a Linux machine. If you are on        another OS or don‘t have this font, you need to adjust this path.

Well, we found that you can specify a font file to give it, instead of the default font display word cloud. The problem turns into a font file that we want to find linux under support Kanji. I'm looking for a ubuntu system installed in the font DroidSansFallbackFull.ttf , in order to prevent the other linux system does not have this font to cause trouble, simply put the font file in the work folder and let it follow the source file, so that there will not be found in the situation. Download the font file using the following command:

$ wget http://labfile.oss.aliyuncs.com/courses/756/DroidSansFallbackFull.ttf

Place the font file in python the folder where the script is located, and then modify the source code to find our own font file first:

"DroidSansFallbackFull.ttf")

Then in our source program, instantiate wordcloud the two places of the class, specifying the wordcloud use of our own font file:

wordcloud = WordCloud(font_path=font).generate(text)wordcloud = WordCloud(font_path=font,max_font_size=40).generate(text)

OK, let's look at the effect after the changes are done:

OK, we are looking forward to the goal achieved!

5.3 Custom Word Cloud

We often see people on the Internet the word cloud is grotesque, like this:

So look at our own square word cloud, is not feeling too good? Are you embarrassed to take the shot? It doesn't matter, we can also do an irregular edge of the word cloud! In order to achieve the effect of a custom word cloud, we need a picture mask , the mask role of this is to provide a space for our word cloud, so that our word cloud is only shown in this space, which is similar to the above word cloud effect. Our mask picture is the helmet of Star Wars soldier, grown like this:

To do the experiment, please download it here:

$ wget http://labfile.oss.aliyuncs.com/courses/756/stormtrooper_mask.png

To do this, we need to modify our code, add the image mask , and modify the code as follows:

#!/usr/bin/python#-*-Coding:utf-8-*-"" "Using custom Colors====================using The Recolor method and custom coloring functions." "Import NumPyAs NPFrom PILImport ImageFrom OSImport pathImport Matplotlib.pyplotAs PltImport RandomImport OSFrom WordcloudImport Wordcloud, Stopwordsfont=os.path.join (Os.path.dirname (__file__),"Droidsansfallbackfull.ttf")DefGrey_color_func(Word, font_size, position, orientation, Random_state=none, **kwargs):Return"HSL (0, 0%%,%d%%)"% Random.randint (60,D = path.dirname (__file__) mask = Np.array (Image.open (Path.join (D,"Stormtrooper_mask.png")) Text = open (U "Santi.txt"). Read (). Decode (' GBK ')# preprocessing the text a little bittext = Text.replace (U "Cheng said",U "Cheng") Text = Text.replace (U "Cheng and",U "Cheng") Text = Text.replace (U "Cheng asked",u "Cheng") # adding movie script Specific stopwordsstopwords = set (Stopwords) Stopwords.add ( "int") Stopwords.add ( "ext") WC = Wordcloud (font_path=font,max_words=2000, Mask=mask, stopwords=stopwords, Margin=10, Random_state=1). Generate (text) # store default colored imagedefault_colors = Wc.to_ Array () Plt.title ( "Custom colors") plt.imshow (Wc.recolor (Color_func=grey_color_func, Random_state=3)) Wc.to_file ( "A_new_hope.png") Plt.axis ( "Off") Plt.figure () plt.title (u "three-body-word frequency Statistics") Plt.imshow (default_colors ) Plt.axis ( "Off") plt.show ()         

This code is a little bit of changes in the display of the word cloud, for example, "Cheng said" should be "Cheng" the frequency of words, and should not be independent calculation, so the program did a simple replacement. In the test environment may not be able to input Chinese characters, so everyone run this code experience, later on their own computer installed Chinese input method can be modified Chinese characters. The result of this code operation is as follows:

We used the Samurai helmet of Star Wars as the word cloud shape, how does it look good?

However, our task is not over. We said we want to customize the word cloud, so even this mask picture is customized according to our own preferences. In order to ubuntu use our favorite pictures as mask pictures below, we need to install the gimp software:

$ sudo apt-get install gimp

Then I feel free to surf the internet Baidu A beautiful picture, this is this one:

Please use the following command to get the format of the beauty picture jpg :

$ wget http://labfile.oss.aliyuncs.com/courses/756/04.jpg

What are the requirements of the picture? Well, the best is the background is solid color, so good processing, of course, if there are PS experts to help, the background is nothing. We use the gimp software to process this image, except the girl's background to white, it can be used as a mask image. OK, let's start with the production. First, under the Layers page, right-click this image to add a alpha layer:

Then, using the Magic wand tool in the left-hand toolbar, draw it casually on the girl and select the girl:

Then, in the blank area, right-click, edit, fill the picture with white as the background color:

When we're done, we'll show you the picture:

Note that when exporting the png format, choose the following settings:

Then we get this mask picture. I renamed it as 04.png we modified the code to use this image as a mask picture:

"04.png")))

Please use the following command to get the picture 04.png :

$ wget http://labfile.oss.aliyuncs.com/courses/756/04.png

Look at the results of our operation:

Well, so far we've got a custom word cloud.

V. Summary of the Experiment

We wordcloud implemented a custom word cloud using an extension package, which solves the problem of wordcloud not displaying Chinese by default, and further uses the gimp software to realize the mask function of custom images. Finally, using the custom word cloud analysis of the novel "three-body" of the word frequency, the production of words cloud.

Customizing word clouds with Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.