Python Module-Jieba

Last Update:2018-09-09 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Installing Jieba

PIP3 Install Jieba

The JIEBA supports three types of participle modes:

Precise mode: Cut the sentence to the most precise, suitable for text analysis

Full mode: All words in the sentence can be scanned out, very fast, but can not solve the ambiguity

Search engine mode: On the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle

The Jieba.cut method has three parameters, the first parameter is a string that needs to be participle, the second cut_all parameter is used to control whether the whole mode is used, and the third hmm parameter is used to control whether a HMM model

#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jiebaword_list = Jieba.cut ("The geographical feature of Fujian is the mountains and the sea, 90% of the land area is the mountainous hilly area, Called the eight Hills one water one sub-field ") print (" Default mode: "+"/". Join (Word_list)) # Cut_all=false is the default option word_list_1 = Jieba.cut (" Fujian's geographical characteristics are the mountains and the sea, 90% of the land area for the mountainous hilly area, known as the eight Mountain one water one sub-field ", Cut_all=true) print (" Full mode: "+"/". Join (Word_list_1)) # Cut_all=true

Run results

The Jieba.cut_for_search method has two parameters, the first argument is a string that requires a word breaker, and the second hmm parameter controls whether the HMM model is used

This method is suitable for the search engine to construct the inverted index participle

#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jiebaword_list_2 = Jieba.cut_for_search ("The geographical features of Fujian are the mountains and the sea, The nine-to-one land area is a mountainous hilly region, known as the eight Mountain water one sub-field ") Print (" Search Mode: "+"/". Join (Word_list_2))

Run results

The types returned by Jieba.cut and Jieba.cut_for_search are an iterative generator

Jieba.lcut and Jieba.lcut_for_search return list types

Keyword extraction

Based on TF-IDF

The Jieba.analyse.extract_tags method has three parameters, the first argument is a string, the second parameter TopK to return a few TF/IDF the most weighted keyword, the default value is 20, the third parameter withweight to return the keyword weight value, The default value is False, and the fourth parameter Allowpos only includes words of the specified part of speech, the default value is null, i.e. not filtered

#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jieba.analysestring = "The geographical features of Fujian are the mountains and the sea, 90% of the land area is mountainous hilly area, known as the Eight Mountains one water one sub-field" A = Jieba.analyse.extract_tags (String, topk=20, Withweight=false, allowpos=0) print (a)

Run results

Based on Textrank

Jieba.analyse.textrank (sentence, topk=20, Withweight=false, allowpos= (' ns ', ' n ', ' vn ', ' V '))

#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jieba.analysestring = "The geographical features of Fujian are the mountains and the sea, 90% of the land area is mountainous hilly area, known as the Eight Mountains one water one sub-field" A = Jieba.analyse.textrank (String, topk=20, Withweight=false, allowpos= (' ns ', ' n ', ' vn ', ' V ')) print (a)

Run results

POS Labeling

#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jieba.possegstrings = Jieba.posseg.cut ("The geographical features of Fujian are the mountains and the sea, The nine-to-one land area is a mountainous hilly region, known as an eight-mountain water-field ") for word, flag in strings:    print ("%s%s "% (word, flag))

Run results

List of parts of speech

n noun nr names nr1 Chinese surnames nr2 Chinese names NRJ Japanese names NRF transliteration names ns place names NSF transliteration place names NT institutions Group name NZ Other proper nouns nl noun idioms ng noun morphemes t Time term TG time word of speech morpheme s quarter word (home, outside, inside, West ...)     F locality v Verb vd auxiliary verb vn noun verb vshi verb "is" vyou verb "has" VF trend verb VX form verb vi intransitive verb (inner verb) VL verb idiom VG verb morpheme a Adjective ad secondary word an noun word ag adjective morpheme al adjective idioms b distinguishing words (main, whole, all ...) BL distinguishes part of speech Idioms Z State word r pronoun RR personal pronoun rz demonstrative pronoun rzt time demonstrative pronoun rzs quarter demonstrative pronoun rzv predicate part of speech demonstrative pronoun ry interrogative pronoun ryt time interrogative pronoun rys quarter interrogative pronoun r YV predicate part-of-speech interrogative pronoun rg word morpheme m numeral MQ number words Q quantifier QV verb quantifier qt time quantifier D adverb p preposition PBA preposition "put" pbei preposition "by" C conjunctions cc parallel conjunctions U auxiliary uzhe ule Uguo ude1 the bottom ude2 to ude3 Usuo to Udeng and so on and so on and so on and so on Uyy the same as general Udh the words of the ULS Uzhi Ulian ("Even elementary school") e interjection y modal word (delete yg) o quasi-sound word h prefix k suffix x string xx non-morpheme word Xu URL urlw punctuation wkz opening parenthesis, full Angle: (([{"〖〈 Half-width: ([{<wky closing parenthesis, full-width:)]}" 〗〉 half-width:)] {>wyz-quote, full-width: "'" wyy closing quote, full-width: "'" WJ Stop, Full width: 。 WW question mark, full angle:? Half-width:? WT exclamation mark, full angle:! Half-width:!WD comma, full-width:, half-width:, WF semicolon, full-width:; half-width:; wn comma, full-width:, WM-colon, full-width:: Half-width:: WS-Ellipsis, full-width: ... wp dash, full angle:-half angle:-------w b percent semicolon, full angle:%‰ Half angle:%WH unit symbol, full width: ￥$￡°℃ half angle: $

Loading dictionaries

Jieba.load_userdict (file_name), file_name is the path to a file class object or a custom dictionary

file_name If a file is opened as a path or binary, the file must be UTF-8 encoded

A word occupies a line, each line is divided into three parts: words, Word frequency (can be omitted), part of speech (can be omitted), separated by a space, the order can not be reversed

Example:

Innovation Office 3 I cloud computing 5 Catherine NZ

Python Module-Jieba

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More