Installing Jieba
PIP3 Install Jieba
The JIEBA supports three types of participle modes:
Precise mode: Cut the sentence to the most precise, suitable for text analysis
Full mode: All words in the sentence can be scanned out, very fast, but can not solve the ambiguity
Search engine mode: On the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle
The Jieba.cut method has three parameters, the first parameter is a string that needs to be participle, the second cut_all parameter is used to control whether the whole mode is used, and the third hmm parameter is used to control whether a HMM model
#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jiebaword_list = Jieba.cut ("The geographical feature of Fujian is the mountains and the sea, 90% of the land area is the mountainous hilly area, Called the eight Hills one water one sub-field ") print (" Default mode: "+"/". Join (Word_list)) # Cut_all=false is the default option word_list_1 = Jieba.cut (" Fujian's geographical characteristics are the mountains and the sea, 90% of the land area for the mountainous hilly area, known as the eight Mountain one water one sub-field ", Cut_all=true) print (" Full mode: "+"/". Join (Word_list_1)) # Cut_all=true
Run results
The Jieba.cut_for_search method has two parameters, the first argument is a string that requires a word breaker, and the second hmm parameter controls whether the HMM model is used
This method is suitable for the search engine to construct the inverted index participle
#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jiebaword_list_2 = Jieba.cut_for_search ("The geographical features of Fujian are the mountains and the sea, The nine-to-one land area is a mountainous hilly region, known as the eight Mountain water one sub-field ") Print (" Search Mode: "+"/". Join (Word_list_2))
Run results
The types returned by Jieba.cut and Jieba.cut_for_search are an iterative generator
Jieba.lcut and Jieba.lcut_for_search return list types
Keyword extraction
Based on TF-IDF
The Jieba.analyse.extract_tags method has three parameters, the first argument is a string, the second parameter TopK to return a few TF/IDF the most weighted keyword, the default value is 20, the third parameter withweight to return the keyword weight value, The default value is False, and the fourth parameter Allowpos only includes words of the specified part of speech, the default value is null, i.e. not filtered
#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jieba.analysestring = "The geographical features of Fujian are the mountains and the sea, 90% of the land area is mountainous hilly area, known as the Eight Mountains one water one sub-field" A = Jieba.analyse.extract_tags (String, topk=20, Withweight=false, allowpos=0) print (a)
Run results
Based on Textrank
Jieba.analyse.textrank (sentence, topk=20, Withweight=false, allowpos= (' ns ', ' n ', ' vn ', ' V '))
#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jieba.analysestring = "The geographical features of Fujian are the mountains and the sea, 90% of the land area is mountainous hilly area, known as the Eight Mountains one water one sub-field" A = Jieba.analyse.textrank (String, topk=20, Withweight=false, allowpos= (' ns ', ' n ', ' vn ', ' V ')) print (a)
Run results
POS Labeling
#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jieba.possegstrings = Jieba.posseg.cut ("The geographical features of Fujian are the mountains and the sea, The nine-to-one land area is a mountainous hilly region, known as an eight-mountain water-field ") for word, flag in strings: print ("%s%s "% (word, flag))
Run results
List of parts of speech
n noun nr names nr1 Chinese surnames nr2 Chinese names NRJ Japanese names NRF transliteration names ns place names NSF transliteration place names NT institutions Group name NZ Other proper nouns nl noun idioms ng noun morphemes t Time term TG time word of speech morpheme s quarter word (home, outside, inside, West ...) F locality v Verb vd auxiliary verb vn noun verb vshi verb "is" vyou verb "has" VF trend verb VX form verb vi intransitive verb (inner verb) VL verb idiom VG verb morpheme a Adjective ad secondary word an noun word ag adjective morpheme al adjective idioms b distinguishing words (main, whole, all ...) BL distinguishes part of speech Idioms Z State word r pronoun RR personal pronoun rz demonstrative pronoun rzt time demonstrative pronoun rzs quarter demonstrative pronoun rzv predicate part of speech demonstrative pronoun ry interrogative pronoun ryt time interrogative pronoun rys quarter interrogative pronoun r YV predicate part-of-speech interrogative pronoun rg word morpheme m numeral MQ number words Q quantifier QV verb quantifier qt time quantifier D adverb p preposition PBA preposition "put" pbei preposition "by" C conjunctions cc parallel conjunctions U auxiliary uzhe ule Uguo ude1 the bottom ude2 to ude3 Usuo to Udeng and so on and so on and so on and so on Uyy the same as general Udh the words of the ULS Uzhi Ulian ("Even elementary school") e interjection y modal word (delete yg) o quasi-sound word h prefix k suffix x string xx non-morpheme word Xu URL urlw punctuation wkz opening parenthesis, full Angle: (([{"〖〈 Half-width: ([{<wky closing parenthesis, full-width:)]}" 〗〉 half-width:)] {>wyz-quote, full-width: "'" wyy closing quote, full-width: "'" WJ Stop, Full width: 。 WW question mark, full angle:? Half-width:? WT exclamation mark, full angle:! Half-width:!WD comma, full-width:, half-width:, WF semicolon, full-width:; half-width:; wn comma, full-width:, WM-colon, full-width:: Half-width:: WS-Ellipsis, full-width: ... wp dash, full angle:-half angle:-------w b percent semicolon, full angle:%‰ Half angle:%WH unit symbol, full width: ¥$£°℃ half angle: $
Loading dictionaries
Jieba.load_userdict (file_name), file_name is the path to a file class object or a custom dictionary
file_name If a file is opened as a path or binary, the file must be UTF-8 encoded
A word occupies a line, each line is divided into three parts: words, Word frequency (can be omitted), part of speech (can be omitted), separated by a space, the order can not be reversed
Example:
Innovation Office 3 I cloud computing 5 Catherine NZ
Python Module-Jieba