First, Experiment introduction 1.1 experimental content
In the Internet age, people access to information in a variety of ways, a lot of information poured into people's eyes. How to extract the key information from the voluminous information and filter out the rubbish information has always been the concern of the modern people. In this era of information explosion, we have to update their knowledge reserves every moment, and the network is the best learning platform. With the ability to filter and process information, learning efficiency will be improved. "The word cloud" is the birth of this. "Word cloud" is a high frequency in the network text "keyword" to be visually prominent, the formation of "keywords cloud" or "keyword rendering", thus filtering out a lot of meaningless information, so that the browser as long as a glance swept the word cloud pictures can enjoy the article or Web content of the main thrust. Not only that, a beautifully crafted word cloud picture, can play a picture of the effect of thousands of words, in the report or the PPT
appropriate use of the word cloud, will make the expression more clear and full, for the speaker to express the meaning of the extra points. This experiment will use Python
the wordcloud
expansion pack to create the word cloud and create a picture to save. and describes how to improve the wordcloud
expansion pack so that it can display Chinese characters, and finally describes how to use their favorite pictures to customize the word cloud image outline.
1.2 Experimental Knowledge points
- Basic steps and principles for making a word cloud
Python
Code to achieve word cloud production
wordcloud
Use of expansion packs
- Using a custom image to create a word cloud, analyze the keywords of the "three-body" I, II, III
1.3 Experimental environment
The experiment is ubuntu14.04
completed, because Python
of the cross-platform features, the code of the experiment can also be run on the Windows
Mac
system, only the font part of the corresponding processing can be.
1.4 Suitable for crowd
This course is difficult for the general, belonging to the primary level of the curriculum, suitable for Python
the basic users, familiar with the python
basic knowledge and deepen the consolidation.
1.5 Code acquisition
You can use the following command to download the code into the lab building environment, as a reference to learn from the comparison.
$ wget http://labfile.oss.aliyuncs.com/courses/756/simple.py$ wget http://labfile.oss.aliyuncs.com/courses/756/my_word_cloud.py
Second, the principle of experiment
The principle of the word cloud is to input the text data frequency statistics, according to the frequency of the occurrence of different words, according to different proportions of the words, generate pictures. The high frequency of the vocabulary shows the large, low-frequency vocabulary shown in small. The text data can be local data, but also the crawler is dynamically obtained from the network.
III. Preparation of development
Open the Xfce
terminal, enter the Code
directory, create a work
folder, and use it as the working directory for the course. Download and install the expansion pack required for your experiment. If you usually want to experiment on your own computer, whether it is still, it is Windows
Linux
Mac
strongly recommended to install Anaconda
, this is a Python
scientific calculation package, which contains almost all of the usual expansion pack, without their own laborious installation, the software by Python
The father of the first maintenance, three platforms simultaneously updated.
cd work$ sudo apt-get update$ sudo apt-get install python-dev$ sudo pip install numpy$ sudo apt-get install python-matplotlib$ sudo apt-get install python-pil
Download the novel "Three Body" I, II, III.
$ wget http://labfile.oss.aliyuncs.com/courses/756/santi.txt$ wget http://labfile.oss.aliyuncs.com/courses/756/santi2.txt$ wget http://labfile.oss.aliyuncs.com/courses/756/santi3.txt
Install the wordcloud
expansion pack.
$ sudo pip install wordcloud
Iv. Experiment Step 5.1 run a simple project to test that the expansion pack is installed properly
Before dealing with the three bodies, let's run the official sample program to ensure that the expansion pack is installed properly and the program works. work
Create a new script in the directory python
, named simple.py
,
$ gedit simple.py
The code is as follows:
#!/usr/bin/env python"" "Minimal example===============generating a square wordcloud from the US constitution using default arguments." "from os import pathfrom Wordcloud import Wordcloudd = Path.dirname (__file__) # Read the whole Text.text = Open (Path.join (d, ' constitution.txt ')). Read () # Generate a word cloud Imagewordcloud = Wordcloud (). Generate (text) # Display the generated Image:# the matplotlib way:import matplotlib.pyplot as pltplt.imshow (Wordcloud) plt.axis ( "off") # lower Max_font_sizewordcloud = Wordcloud (Max_font_size=40). Generate (text) plt.figure () plt.imshow (Wordcloud) plt.axis ( "Off") plt.show ()
The code is visible and the program runs to search for text files under the path where the script is located “constitution.txt”
, so we need to put this text under the folder before running the script work
. Download the text using the following command:
$ wget http://labfile.oss.aliyuncs.com/courses/756/constitution.txt
work
under Folders, start the console, as shown in the following:
To run the script in the console:
$ python simple.py
If the expansion pack is installed properly, the program will output the following window:
So far, we have got an English word cloud.
5.2 Solve Chinese display problem
We have successfully installed the wordcloud
expansion pack and successfully ran a sample file. But this sample file has a lot of problems, first of all, the display of English characters, in the face of Chinese colleagues or bosses to do reports and share, the use of English word cloud is obviously inappropriate, and many of the text itself is Chinese vocabulary, can not be made into the English word cloud, the outer contour of the word cloud square, relatively stiff, no aesthetic. We hit solve the above problems. First of all, we first solve the problem of Chinese display. Let's try it out, if you don't do anything, just type a novel into the simple.py
file and see what the output looks like. We only need to modify the source file to become the following, note the file header to declare the file with the utf8
encoding:
#!/usr/bin/env python#-*-Coding:utf-8-*-"" "Minimal example===============generating a square wordcloud from the US constitution using default arguments." "From OSimport pathfrom wordcloud import Wordcloudd = Path.dirname (__file__) # Read the whole text. #text = Open (Path.join (d, ' constitution.txt ')). Read () Text = open (u "Santi.txt"). Read () Span class= "Hljs-comment" ># Generate a word cloud Imagewordcloud = Wordcloud (). Generate (text) # Display the generated image:# the matplotlib way:import matplotlib.pyplot as pltplt.imshow (Wordcloud) plt.axis ( "off") # lower Max_font_sizewordcloud = Wordcloud (Max_font_size=40). Generate (text) plt.figure () plt.imshow (Wordcloud) plt.axis ( "off") Plt.show ()
Run the program in the console and get the result:
Are you excited to think you've succeeded? Take a closer look at the word cloud, a Chinese novel, the highest statistics out of the word incredibly is 3K
and CPU
? This is not a computer magazine! So the result must be a problem, the less we need to know that this novel is the word cloud should be the Chinese character.
There are two of the above bizarre results, the first is the three body of the novel's Code is not utf8
, the novel is the use of Chinese characters gbk
encoded. So we use it directly
text = open(u"santi.txt").read()
This program to read the file, in addition to English words, the other read out are garbled, wordcloud
do not know these garbled, nature can not show its frequency. Okay, so let's change the code now, you know how to fix it, right, and change it like this:
text = open(u"santi.txt").read().decode(‘gbk‘)
And look at what this output has become, we are bold to guess that this will not be the biggest in English this time. Yes, this time the output is like this:
What's going on here? Looking closely at the discovery of an English language, it means that we have correctly identified the Chinese characters. So these boxes appear more frequently than in English, they must be Chinese characters ah! But how do Chinese characters show up like this? This is because wordcloud
there is no font for displaying Chinese characters! We all know that the Ubuntu
system is made by foreigners, so the support for Chinese is certainly not as good as English, especially the font and so on, less. But wordcloud
also a foreign development of the word cloud library, both goods are not considered to display the problem of Chinese. However, since the python
Chinese can be displayed, then wordcloud
it should be possible. Let's find a way to solve the problem. Looking at our code carefully, we find that the key to generating the word cloud is this sentence:
wordcloud = WordCloud(max_font_size=40).generate(text)
Let's take a closer look at this class, with a key parameter in the arguments passed to it:
font_path : string or TTF). Defaults to DroidSansMono path on a Linux machine. If you are on another OS or don‘t have this font, you need to adjust this path.
Well, we found that you can specify a font file to give it, instead of the default font display word cloud. The problem turns into a font file that we want to find linux
under support Kanji. I'm looking for a ubuntu
system installed in the font DroidSansFallbackFull.ttf
, in order to prevent the other linux
system does not have this font to cause trouble, simply put the font file in the work
folder and let it follow the source file, so that there will not be found in the situation. Download the font file using the following command:
$ wget http://labfile.oss.aliyuncs.com/courses/756/DroidSansFallbackFull.ttf
Place the font file in python
the folder where the script is located, and then modify the source code to find our own font file first:
"DroidSansFallbackFull.ttf")
Then in our source program, instantiate wordcloud
the two places of the class, specifying the wordcloud
use of our own font file:
wordcloud = WordCloud(font_path=font).generate(text)wordcloud = WordCloud(font_path=font,max_font_size=40).generate(text)
OK, let's look at the effect after the changes are done:
OK, we are looking forward to the goal achieved!
5.3 Custom Word Cloud
We often see people on the Internet the word cloud is grotesque, like this:
So look at our own square word cloud, is not feeling too good? Are you embarrassed to take the shot? It doesn't matter, we can also do an irregular edge of the word cloud! In order to achieve the effect of a custom word cloud, we need a picture mask
, the mask
role of this is to provide a space for our word cloud, so that our word cloud is only shown in this space, which is similar to the above word cloud effect. Our mask
picture is the helmet of Star Wars soldier, grown like this:
To do the experiment, please download it here:
$ wget http://labfile.oss.aliyuncs.com/courses/756/stormtrooper_mask.png
To do this, we need to modify our code, add the image mask
, and modify the code as follows:
#!/usr/bin/python#-*-Coding:utf-8-*-"" "Using custom Colors====================using The Recolor method and custom coloring functions." "Import NumPyAs NPFrom PILImport ImageFrom OSImport pathImport Matplotlib.pyplotAs PltImport RandomImport OSFrom WordcloudImport Wordcloud, Stopwordsfont=os.path.join (Os.path.dirname (__file__),"Droidsansfallbackfull.ttf")DefGrey_color_func(Word, font_size, position, orientation, Random_state=none, **kwargs):Return"HSL (0, 0%%,%d%%)"% Random.randint (60,D = path.dirname (__file__) mask = Np.array (Image.open (Path.join (D,"Stormtrooper_mask.png")) Text = open (U "Santi.txt"). Read (). Decode (' GBK ')# preprocessing the text a little bittext = Text.replace (U "Cheng said",U "Cheng") Text = Text.replace (U "Cheng and",U "Cheng") Text = Text.replace (U "Cheng asked",u "Cheng") # adding movie script Specific stopwordsstopwords = set (Stopwords) Stopwords.add ( "int") Stopwords.add ( "ext") WC = Wordcloud (font_path=font,max_words=2000, Mask=mask, stopwords=stopwords, Margin=10, Random_state=1). Generate (text) # store default colored imagedefault_colors = Wc.to_ Array () Plt.title ( "Custom colors") plt.imshow (Wc.recolor (Color_func=grey_color_func, Random_state=3)) Wc.to_file ( "A_new_hope.png") Plt.axis ( "Off") Plt.figure () plt.title (u "three-body-word frequency Statistics") Plt.imshow (default_colors ) Plt.axis ( "Off") plt.show ()
This code is a little bit of changes in the display of the word cloud, for example, "Cheng said" should be "Cheng" the frequency of words, and should not be independent calculation, so the program did a simple replacement. In the test environment may not be able to input Chinese characters, so everyone run this code experience, later on their own computer installed Chinese input method can be modified Chinese characters. The result of this code operation is as follows:
We used the Samurai helmet of Star Wars as the word cloud shape, how does it look good?
However, our task is not over. We said we want to customize the word cloud, so even this mask picture is customized according to our own preferences. In order to ubuntu
use our favorite pictures as mask
pictures below, we need to install the gimp
software:
$ sudo apt-get install gimp
Then I feel free to surf the internet Baidu A beautiful picture, this is this one:
Please use the following command to get the format of the beauty picture jpg
:
$ wget http://labfile.oss.aliyuncs.com/courses/756/04.jpg
What are the requirements of the picture? Well, the best is the background is solid color, so good processing, of course, if there are PS
experts to help, the background is nothing. We use the gimp
software to process this image, except the girl's background to white, it can be used as a mask image. OK, let's start with the production. First, under the Layers page, right-click this image to add a alpha
layer:
Then, using the Magic wand tool in the left-hand toolbar, draw it casually on the girl and select the girl:
Then, in the blank area, right-click, edit, fill the picture with white as the background color:
When we're done, we'll show you the picture:
Note that when exporting the png
format, choose the following settings:
Then we get this mask picture. I renamed it as 04.png
we modified the code to use this image as a mask
picture:
"04.png")))
Please use the following command to get the picture 04.png
:
$ wget http://labfile.oss.aliyuncs.com/courses/756/04.png
Look at the results of our operation:
Well, so far we've got a custom word cloud.
V. Summary of the Experiment
We wordcloud
implemented a custom word cloud using an extension package, which solves the problem of wordcloud
not displaying Chinese by default, and further uses the gimp
software to realize the mask
function of custom images. Finally, using the custom word cloud analysis of the novel "three-body" of the word frequency, the production of words cloud.
Customizing word clouds with Python