Python makes a word cloud for fiction

Source: Internet
Author: User

  Leisure time like to read novels, think of the novel to do a word cloud, show the main content of the novel. The development language is Python, the main use of the library has Wordcloud, Jieba, scipy. The code is very simple, first use the Jieba.cut () function to do participle, generate a space-delimited string, and then create a new Wordcloud class, Save as a picture.

1 #Coding:utf-82 ImportSYS3 ImportJieba4 ImportMatplotlib.pyplot as Plt5  fromWordcloudImportWordcloud,imagecolorgenerator6  fromScipy.miscImportImread7  fromDatetimeImportdatetime8 9NOVEL=SYS.ARGV[1]#' Assz.txt 'TenIMGMASK=SYS.ARGV[2]#' assz.jpg ' Onet=DateTime.Now () Aresimg="Word_"+novel.split ('.') [0]+"_"+str (T.month) +str (t.day) +str (t.hour) +str (t.minute) +str (T.second) +". jpg" -  -novletext=Open (novel). Read () thehmseg=jieba.cut (Novletext) -  -Seg_space=' '. Join (HMSEG) -  +Alice_color=Imread (Imgmask)
-#wordcloud默认不支持中文, the font_path here need to point to the Chinese font, or the resulting word cloud is all garbled
 +Fwc=wordcloud (font_path='MSYH.TTC', max_words=700,background_color=' White', Mask=alice_color,max_font_size=100,font_step=1). Generate (Seg_space) AImagecolor=imagecolorgenerator (Alice_color) atPlt.imshow (Fwc.recolor (color_func=imagecolor)) -Plt.axis ("off") - plt.show () -Fwc.to_file (RESIMG)

The results are as follows

  The result is not ideal, one is the role of the name is divided, such as "Lucien" is divided into "Road West", "en" or "Road", "Sean", the second is "such", "so", "they" such as the common words too much, cover the other words, so that people can not determine the content of the novel

Therefore, before generating the word cloud, we have to make a filter table, remove the common words "such", "like", "they", and not participate in the word cloud display. Here I chose the "Broken Sky" "back to the past to become a cat" "Arcane" "the Catalogue of the Battle" "The First World" 5 book, find the word frequency and sort, take the highest frequency of each book 1500 words, if a word in these 7,500 words appear two times (not included) above, it is considered to be high-frequency common words, written in the filter table.

1 #Coding:utf-82 ImportOS3 ImportJieba4 5 defFF (DD):6     returnDd[1]7 8 defArray2dic (arr):9segdict={}Ten      forSegincharr: One         ifLen (SEG) <2: A             Continue -         ifSeginchsegdict: -Segdict[seg]+=1 the         Else: -Segdict[seg]=1 -     returnsegdict -  +novels=['bucket breaking sky. txt','go back to the past and become a cat .','Assz.txt','Mytl.txt','Yszz.txt'] -freq=[] +  forNovelinchNovels: Amaotext=Open (novel). Read () atseglist=jieba.cut (Maotext) -segdict=array2dic (seglist) -  -C=1 -Segsort=sorted (Segdict.items (), key=ff,reverse=True) -      forIteminchSegsort: in         #print (item[0]+ "+str (item[1])) - freq.append (item[0]) to         ifc==1500: +              Break -C+=1 the  *freqdict=array2dic (freq) $Freqsort=sorted (Freqdict.items (), key=ff,reverse=True)Panax NotoginsengK=1 -F=open ('Filter3.txt','w+') the  forIteminchFreqsort: +     ifItem[1]>3: AF.write (item[0]+"  ") the     ifk%5==0: +F.write ("\ n") -K+=1 $ f.close () $ Print('OK')

At the same time, before the word segmentation, add new words to ensure accurate segmentation. The modified code is as follows

1 #Coding:utf-82 ImportSYS3 ImportJieba4 ImportMatplotlib.pyplot as Plt5  fromWordcloudImportWordcloud,imagecolorgenerator6  fromScipy.miscImportImread7  fromDatetimeImportdatetime8 9Jieba.add_word ('Lucien')TenJieba.add_word ('So horrible .') One  A defCustomfilter (segs): -Filter=open ('Filter.txt'). Read () -resseg="" the      forSeginchSegs: -         ifSeg not inchFilter: -resseg+=' '+seg -     returnresseg +  -NOVEL=SYS.ARGV[1]#' Assz.txt ' +IMGMASK=SYS.ARGV[2]#' assz.jpg ' At=DateTime.Now () atresimg="Word_"+novel.split ('.') [0]+"_"+str (T.month) +str (t.day) +str (t.hour) +str (t.minute) +str (T.second) +". jpg" -  -novletext=Open (novel). Read () -hmseg=jieba.cut (Novletext) -  -Seg_space=Customfilter (hmseg) in  -Alice_color=Imread (imgmask) to  +Fwc=wordcloud (font_path='MSYH.TTC', max_words=700,background_color=' White', Mask=alice_color,max_font_size=100,font_step=1). Generate (Seg_space) -Imagecolor=imagecolorgenerator (Alice_color) thePlt.imshow (Fwc.recolor (color_func=imagecolor)) *Plt.axis ("off") $ plt.show ()Panax NotoginsengFwc.to_file (RESIMG)
Result Code

The result is a little better than before.

  From the word cloud can see a lot of interesting laws, such as: There is a female master of the novel, the name of female owners often second only to the protagonist. Like Lucien and Natasha, Shengen and Vivian. But full-time Chen Guo in the word frequency to enjoy the treatment of female master, hand-picked female master su mu Orange to look carefully to see.

Python makes a word cloud for fiction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.