Three types of Word breakers are supported:
Precision mode, try to cut the sentence most accurately, suitable for text analysis;
The whole mode, the sentence all can be the word words are scanned out, the speed is very fast, but can not solve the ambiguity;
Search engine mode, on the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle.
key words : hmm hidden Markov model
Three participle modes:
#-*-coding:utf-8-*-ImportJieba#jieba.initialize ()seg_list= Jieba.cut ("Long Live the People's Republic! ", Cut_all=false)#Precision Mode (default)Print(" | ". Join (seg_list)) Seg_list= Jieba.cut ("Long Live the People's Republic! ", cut_all=true)#Full ModePrint(" | ". Join (seg_list)) Seg_list= Jieba.cut_for_search ("Long Live the People's Republic! ")#Search engine ModePrint(" | ". Join (Seg_list))
Results:
China | Hooray |!
China | Chinese People | China | Chinese | People | People's Republic of | Republic | Republic | Viva | |
China | Chinese | People | Republic | Republic | China | Hooray |!
Results can be maintained directly as List
Seg_list = Jieba.cut (" long Live The People's Republic of China!" # default precision mode print# This returns the generator = jieba.lcut (" Long Live the People's Republic! print= Jieba.lcut_for_search ("Long livethe People's Republic of China!" ")print(seg_list)
Results:
<generator Object Tokenizer.cut at 0x0000000003972150>
[' People's Republic ', ' Hooray ', '! ‘]
[' Chinese ', ' Chinese ', ' People ', ' republic ', ' republics ', ' People's Republic ', ' Long live ', '! ‘]
"Custom word breaker dictionary (For additional additions)"
The default word breaker is JIEBA.DT. You can use a custom dictionary to add words that are not in the thesaurus, and the text must be UTF-8 encoded. The dictionary format is the same as dict.txt, one word occupies a line; each line is divided into three parts: words, Word frequency (can be omitted), part of speech (can be omitted), separated by a space.
In the test file dict.txt, I only added a word "and country":
Jieba.load_userdict ("c:/users/huangzecheng/desktop/dict.txt"= Jieba.cut ( " Long Live the People's Republic! ", Cut_all=true)# like full mode print(""
Results:
China | Chinese People | China | Chinese | People | People's Republic of | Republic | Republic | Montenegro | Viva | |
"Add Delete word breaker"
The custom dictionary that is loaded is defined with the default word breaker. You can also use several functions to add, remove, and prohibit a word from being divided.
Add_word (Word, freq=none, Tag=none)
Del_word (Word)
Suggest_freq (segment, tune=true)
Jieba.add_word ('Chinese People')Print(" | ". Join (Jieba.cut ("Long Live the People's Republic! ", cut_all=True))) Results: China| Chinese People | Chinese People | China | Chinese | People | People's Republic of | Republic | Republic | Montenegro | Viva | |Jieba.del_word ('Republic') Print(" | ". Join (Jieba.cut ("Long Live the People's Republic! ", cut_all=True))) Results: China| Chinese People | Chinese People | China | Chinese | People | People's Republic of | Republic | Montenegro | Viva | |Jieba.add_word ('Republic') Jieba.suggest_freq ('Long live the country', tune=True)Print(" | ". Join (Jieba.cut ("Long Live the People's Republic! ", cut_all=True))) Results: China| Chinese People | Chinese People | China | Chinese | People | People's Republic of | Republic | Montenegro | Long Live China | Viva | |
"Most used participle"
Import Jieba.analyse
Jieba.analyse.extract_tags (sentence, topk=20, Withweight=false, allowpos= ())
Sentence: The text to be extracted
TopK: The default value is 20 for returning several keywords with the most TF/IDF weights
Withweight: For whether to return the keyword weight value, the default value is False
Allowpos: Include only words with the specified part of speech, default value is empty, that is, do not filter
test: The first 5 characters to go to the most occurrences of a text
str ="TopK to return several TF/IDF weights the most important keyword, the default value is:"Str= str +"withweight to return the keyword weight value, the default value is False"Str= str +"Allowpos only includes words with the specified part of speech, the default value is empty, i.e. not filtered"Str= str +"Jieba.analyse.TFIDF (idf_path=none) New TFIDF instance, Idf_path as IDF frequency file"Tags= Jieba.analyse.extract_tags (str, topk=5)Print(" | ". Join (tags) result: Default value| TFIDF | IDF | IDF | Path
It said that the dictionary has 3 parts: words, word frequency (can be omitted), part of speech (can be omitted). Textrank can be run to filter different parts of speech.
Jieba.analyse.textrank (sentence, topk=20, Withweight=false, allowpos= (' ns ', ' n ', ' vn ', ' V '))
tags = jieba.analyse.textrank (str, topk=5, Withweight=false, allowpos= ('ns' ) N ' ' vn ' ' v ' Print(""| Key Words | Part of Speech | Frequency | New
"Take word and part of speech"
Words = Jieba.posseg.cut (" long Live The People's Republic of China!" ") for in words: print('%s%s ' % (word, flag)) results: Viva m!, NS, People's Republic of China X
"Take word and start and end position"
Words = Jieba.tokenize (" long Live The People's Republic of China!" ") for in words: print("word%s\t\t Start:%d \t\t end:%d" % (w[0],w[1],w[2])) Result: Word People's Republic of China start:0 End:7 Word Hooray 7 end:9Word! 9 end:10
Words = Jieba.tokenize ("Long Live the People's Republic! ", mode='Search')#Search Mode forWinchwords:Print("word%s\t\t start:%d \t\t end:%d"% (w[0],w[1],w[2]) Result: Word Chinese start:0 end:2word Chinese start:1 End:3Word people start:2 End:4Word Republic start:4 End:6Word Republic start:4 End:7Word People's Republic of start:0 end:7word Viva Start:7 End:9Word! Start:9 End:10
Simple example: Read a text field, Word breaker, and custom drawing from a SQL Server database
#-*-coding:utf-8-*-ImportpymssqlImportJiebaImportJieba.analyseImportMatplotlib.pyplot as Plt fromWordcloudImportWordcloud fromScipy.miscImportImreadhost="localhost"User="KK"passwd="KK"dbname="HZC"Conn=NoneTry: Conn=Pymssql.connect (Host=Host, User=User, Password=passwd, Database=dbname) cur=conn.cursor () Cur.execute ("select col from Jieba;") Rows=Cur.fetchall () Tagsall=u"" #tagsall = open (' Filepath.txt ', ' R '). Read () forRowinchRows:tags= Jieba.analyse.extract_tags (row[0], topk=20) Tagsjoin= u" ". Join (tags) tagsall= Tagsall +" "+Tagsjoin#print (Tagsjoin) #Http://labfile.oss.aliyuncs.com/courses/756/DroidSansFallbackFull.ttfWc_cfg =Wordcloud (Font_path="D:/python35/tools/whl/droidsansfallbackfull.ttf",#FontMask= Imread ("D:/python35/tools/whl/bg.png"),#background template (black and white only)Background_color=" White",#Background Colormax_words=1000,#maximum number of wordsMode="RGBA",#Transparent background (background_color not empty). Default RGBWIDTH=500,#widthheight=400,#Heightmax_font_size=100#Font Size) WC=wc_cfg.generate (Tagsall) plt.imshow (WC) Plt.axis ("if") plt.show ()finally: ifconn:conn.close ()
Python participle and word cloud plotting