Leisure time like to read novels, think of the novel to do a word cloud, show the main content of the novel. The development language is Python, the main use of the library has Wordcloud, Jieba, scipy. The code is very simple, first use the Jieba.cut () function to do participle, generate a space-delimited string, and then create a new Wordcloud class, Save as a picture.
1 #Coding:utf-82 ImportSYS3 ImportJieba4 ImportMatplotlib.pyplot as Plt5 fromWordcloudImportWordcloud,imagecolorgenerator6 fromScipy.miscImportImread7 fromDatetimeImportdatetime8 9NOVEL=SYS.ARGV[1]#' Assz.txt 'TenIMGMASK=SYS.ARGV[2]#' assz.jpg ' Onet=DateTime.Now () Aresimg="Word_"+novel.split ('.') [0]+"_"+str (T.month) +str (t.day) +str (t.hour) +str (t.minute) +str (T.second) +". jpg" - -novletext=Open (novel). Read () thehmseg=jieba.cut (Novletext) - -Seg_space=' '. Join (HMSEG) - +Alice_color=Imread (Imgmask)
-#wordcloud默认不支持中文, the font_path here need to point to the Chinese font, or the resulting word cloud is all garbled
+Fwc=wordcloud (font_path='MSYH.TTC', max_words=700,background_color=' White', Mask=alice_color,max_font_size=100,font_step=1). Generate (Seg_space) AImagecolor=imagecolorgenerator (Alice_color) atPlt.imshow (Fwc.recolor (color_func=imagecolor)) -Plt.axis ("off") - plt.show () -Fwc.to_file (RESIMG)
The results are as follows
The result is not ideal, one is the role of the name is divided, such as "Lucien" is divided into "Road West", "en" or "Road", "Sean", the second is "such", "so", "they" such as the common words too much, cover the other words, so that people can not determine the content of the novel
Therefore, before generating the word cloud, we have to make a filter table, remove the common words "such", "like", "they", and not participate in the word cloud display. Here I chose the "Broken Sky" "back to the past to become a cat" "Arcane" "the Catalogue of the Battle" "The First World" 5 book, find the word frequency and sort, take the highest frequency of each book 1500 words, if a word in these 7,500 words appear two times (not included) above, it is considered to be high-frequency common words, written in the filter table.
1 #Coding:utf-82 ImportOS3 ImportJieba4 5 defFF (DD):6 returnDd[1]7 8 defArray2dic (arr):9segdict={}Ten forSegincharr: One ifLen (SEG) <2: A Continue - ifSeginchsegdict: -Segdict[seg]+=1 the Else: -Segdict[seg]=1 - returnsegdict - +novels=['bucket breaking sky. txt','go back to the past and become a cat .','Assz.txt','Mytl.txt','Yszz.txt'] -freq=[] + forNovelinchNovels: Amaotext=Open (novel). Read () atseglist=jieba.cut (Maotext) -segdict=array2dic (seglist) - -C=1 -Segsort=sorted (Segdict.items (), key=ff,reverse=True) - forIteminchSegsort: in #print (item[0]+ "+str (item[1])) - freq.append (item[0]) to ifc==1500: + Break -C+=1 the *freqdict=array2dic (freq) $Freqsort=sorted (Freqdict.items (), key=ff,reverse=True)Panax NotoginsengK=1 -F=open ('Filter3.txt','w+') the forIteminchFreqsort: + ifItem[1]>3: AF.write (item[0]+" ") the ifk%5==0: +F.write ("\ n") -K+=1 $ f.close () $ Print('OK')
At the same time, before the word segmentation, add new words to ensure accurate segmentation. The modified code is as follows
1 #Coding:utf-82 ImportSYS3 ImportJieba4 ImportMatplotlib.pyplot as Plt5 fromWordcloudImportWordcloud,imagecolorgenerator6 fromScipy.miscImportImread7 fromDatetimeImportdatetime8 9Jieba.add_word ('Lucien')TenJieba.add_word ('So horrible .') One A defCustomfilter (segs): -Filter=open ('Filter.txt'). Read () -resseg="" the forSeginchSegs: - ifSeg not inchFilter: -resseg+=' '+seg - returnresseg + -NOVEL=SYS.ARGV[1]#' Assz.txt ' +IMGMASK=SYS.ARGV[2]#' assz.jpg ' At=DateTime.Now () atresimg="Word_"+novel.split ('.') [0]+"_"+str (T.month) +str (t.day) +str (t.hour) +str (t.minute) +str (T.second) +". jpg" - -novletext=Open (novel). Read () -hmseg=jieba.cut (Novletext) - -Seg_space=Customfilter (hmseg) in -Alice_color=Imread (imgmask) to +Fwc=wordcloud (font_path='MSYH.TTC', max_words=700,background_color=' White', Mask=alice_color,max_font_size=100,font_step=1). Generate (Seg_space) -Imagecolor=imagecolorgenerator (Alice_color) thePlt.imshow (Fwc.recolor (color_func=imagecolor)) *Plt.axis ("off") $ plt.show ()Panax NotoginsengFwc.to_file (RESIMG)
Result Code
The result is a little better than before.
From the word cloud can see a lot of interesting laws, such as: There is a female master of the novel, the name of female owners often second only to the protagonist. Like Lucien and Natasha, Shengen and Vivian. But full-time Chen Guo in the word frequency to enjoy the treatment of female master, hand-picked female master su mu Orange to look carefully to see.
Python makes a word cloud for fiction