Python Module-Jieba

Source: Internet
Author: User
Tags idf

Installing Jieba

PIP3 Install Jieba

The JIEBA supports three types of participle modes:

Precise mode: Cut the sentence to the most precise, suitable for text analysis

Full mode: All words in the sentence can be scanned out, very fast, but can not solve the ambiguity

Search engine mode: On the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle

The Jieba.cut method has three parameters, the first parameter is a string that needs to be participle, the second cut_all parameter is used to control whether the whole mode is used, and the third hmm parameter is used to control whether a HMM model

#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jiebaword_list = Jieba.cut ("The geographical feature of Fujian is the mountains and the sea, 90% of the land area is the mountainous hilly area, Called the eight Hills one water one sub-field ") print (" Default mode: "+"/". Join (Word_list)) # Cut_all=false is the default option word_list_1 = Jieba.cut (" Fujian's geographical characteristics are the mountains and the sea, 90% of the land area for the mountainous hilly area, known as the eight Mountain one water one sub-field ", Cut_all=true) print (" Full mode: "+"/". Join (Word_list_1)) # Cut_all=true

Run results

The Jieba.cut_for_search method has two parameters, the first argument is a string that requires a word breaker, and the second hmm parameter controls whether the HMM model is used

This method is suitable for the search engine to construct the inverted index participle

#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jiebaword_list_2 = Jieba.cut_for_search ("The geographical features of Fujian are the mountains and the sea, The nine-to-one land area is a mountainous hilly region, known as the eight Mountain water one sub-field ") Print (" Search Mode: "+"/". Join (Word_list_2))

Run results

The types returned by Jieba.cut and Jieba.cut_for_search are an iterative generator

Jieba.lcut and Jieba.lcut_for_search return list types

Keyword extraction

Based on TF-IDF

The Jieba.analyse.extract_tags method has three parameters, the first argument is a string, the second parameter TopK to return a few TF/IDF the most weighted keyword, the default value is 20, the third parameter withweight to return the keyword weight value, The default value is False, and the fourth parameter Allowpos only includes words of the specified part of speech, the default value is null, i.e. not filtered

#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jieba.analysestring = "The geographical features of Fujian are the mountains and the sea, 90% of the land area is mountainous hilly area, known as the Eight Mountains one water one sub-field" A = Jieba.analyse.extract_tags (String, topk=20, Withweight=false, allowpos=0) print (a)

Run results

Based on Textrank

Jieba.analyse.textrank (sentence, topk=20, Withweight=false, allowpos= (' ns ', ' n ', ' vn ', ' V '))

#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jieba.analysestring = "The geographical features of Fujian are the mountains and the sea, 90% of the land area is mountainous hilly area, known as the Eight Mountains one water one sub-field" A = Jieba.analyse.textrank (String, topk=20, Withweight=false, allowpos= (' ns ', ' n ', ' vn ', ' V ')) print (a)

Run results

POS Labeling

#-*-Coding:utf-8-*-__author__ = "MuT6 sch01ar" Import jieba.possegstrings = Jieba.posseg.cut ("The geographical features of Fujian are the mountains and the sea, The nine-to-one land area is a mountainous hilly region, known as an eight-mountain water-field ") for word, flag in strings:    print ("%s%s "% (word, flag))

Run results

List of parts of speech

n noun nr names nr1 Chinese surnames nr2 Chinese names NRJ Japanese names NRF transliteration names ns place names NSF transliteration place names NT institutions Group name NZ Other proper nouns nl noun idioms ng noun morphemes t Time term TG time word of speech morpheme s quarter word (home, outside, inside, West ...)     F locality v Verb vd auxiliary verb vn noun verb vshi verb "is" vyou verb "has" VF trend verb VX form verb vi intransitive verb (inner verb) VL verb idiom VG verb morpheme a Adjective ad secondary word an noun word ag adjective morpheme al adjective idioms b distinguishing words (main, whole, all ...) BL distinguishes part of speech Idioms Z State word r pronoun RR personal pronoun rz demonstrative pronoun rzt time demonstrative pronoun rzs quarter demonstrative pronoun rzv predicate part of speech demonstrative pronoun ry interrogative pronoun ryt time interrogative pronoun rys quarter interrogative pronoun r YV predicate part-of-speech interrogative pronoun rg word morpheme m numeral MQ number words Q quantifier QV verb quantifier qt time quantifier D adverb p preposition PBA preposition "put" pbei preposition "by" C conjunctions cc parallel conjunctions U auxiliary uzhe ule Uguo ude1 the bottom ude2 to ude3 Usuo to Udeng and so on and so on and so on and so on Uyy the same as general Udh the words of the ULS Uzhi Ulian ("Even elementary school") e interjection y modal word (delete yg) o quasi-sound word h prefix k suffix x string xx non-morpheme word Xu URL urlw punctuation wkz opening parenthesis, full Angle: (([{"〖〈 Half-width: ([{<wky closing parenthesis, full-width:)]}" 〗〉 half-width:)] {>wyz-quote, full-width: "'" wyy closing quote, full-width: "'" WJ Stop, Full width: 。 WW question mark, full angle:? Half-width:? WT exclamation mark, full angle:! Half-width:!WD comma, full-width:, half-width:, WF semicolon, full-width:; half-width:; wn comma, full-width:, WM-colon, full-width:: Half-width:: WS-Ellipsis, full-width: ... wp dash, full angle:-half angle:-------w b percent semicolon, full angle:%‰ Half angle:%WH unit symbol, full width: ¥$£°℃ half angle: $
Loading dictionaries

Jieba.load_userdict (file_name), file_name is the path to a file class object or a custom dictionary

file_name If a file is opened as a path or binary, the file must be UTF-8 encoded

A word occupies a line, each line is divided into three parts: words, Word frequency (can be omitted), part of speech (can be omitted), separated by a space, the order can not be reversed

Example:

Innovation Office 3 I cloud computing 5 Catherine NZ

Python Module-Jieba

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.