Jieba participle of natural language processing

Last Update:2018-02-24 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

English participle can use space, Chinese is different, some of the principles behind the word, say, first of all, the python commonly used jieba this tool.

First of all, be careful not to use Jieba when doing exercises. Py name the file, otherwise it will appear

Jieba has no attribute named cut ... And so on, if you delete the jieba.py you created, there was an error because the Jieba.pyc file was not deleted.

(1) Basic participle function and usage

Three modes of the next participle are introduced first:

Precision mode: Suitable for the most accurate separation of sentences, suitable for text analysis;

Full mode: All the words in the sentence can be scanned out, fast, but can not solve the ambiguity;

Search engine mode: On the basis of the precise mode, the long words are sliced again, and the recall rate is improved, which is applicable to the search engine participle.

The structure returned by Jieba.cut and Jieba.cut_for_search is an iterative generator that can be used for loops to get every word

The jieba.cut method accepts three input parameters:

A string that requires a word breaker
The Cut_all parameter is used to control whether full mode is used
Hmm parameters are used to control the use of HMM models

Jieba.cut_for_search method accepts two parameters

A string that requires a word breaker
Whether to use a HMM model.

1 ImportJieba2Seg_list = Jieba.cut ("I love learning natural Language processing", cut_all=True)3 Print("Full Mode:"+"/ ". Join (Seg_list))#Full Mode4 5Seg_list = Jieba.cut ("I love natural language processing", cut_all=False)6 Print("Default Mode:"+"/ ". Join (Seg_list))#Precision Mode7 8Seg_list = Jieba.cut ("He graduated from Shanghai Jiaotong University and studied at the Deep Study Institute of Baidu.")#The default is precision mode9 Print(", ". Join (seg_list))Ten  OneSeg_list = Jieba.cut_for_search ("Xiao Ming graduated from the Institute of Chinese Academy of Sciences, after a study at Harvard University")#Search engine Mode A Print(", ". Join (Seg_list))

View Code

jieba.lcut and Jieba.lcut_for_search directly back to list

1 ImportJieba2Result_lcut = Jieba.lcut ("Xiao Ming graduated from the Institute of Chinese Academy of Sciences, after a study at Harvard University")3Result_lcut_for_search = Jieba.lcut ("Xiao Ming graduated from the Institute of Chinese Academy of Sciences, after a study at Harvard University", cut_all=True)4 Print('Result_lcut:', Result_lcut)5 Print('Result_lcut_for_search:', Result_lcut_for_search)6 7 Print(" ". Join (result_lcut))8 Print(" ". Join (Result_lcut_for_search))

View Code

To add a user-defined dictionary:

Many times we need to do word segmentation for our own scenes, and there will be some proprietary vocabulary in the field.

1. User dictionary can be loaded with jieba.load_userdict (file_name)
2. A small number of words can be added manually in the following ways:
- Dynamically modify dictionaries in programs with Add_word (Word, Freq=none, tag=none) and Del_word (word)
- Use Suggest_freq (segment, tune=true) to adjust the word frequency of individual words so that they can (or cannot) be divided.

1 ImportJieba2Result_cut=jieba.cut ('If you put it in the old dictionary, you will get an error. ', hmm=False)3 Print('/'. Join (result_cut))4Jieba.suggest_freq (('in','will be'), True)5Result_cut=jieba.cut ('If you put it in the old dictionary, you will get an error. ', hmm=False)6 Print('/'. Join (Result_cut))

View Code

(2) Keyword extraction

Keyword extraction based on TF-IDF

Import Jieba.analyse

Jieba.analyse.extract_tags (sentence, topk=20, Withweight=false, allowpos= ())
- Sentence for the text to be extracted
- TopK is the keyword that returns several TF/IDF weights, the default value is 20
- Withweight to return the keyword weight value, the default value is False
- Allowpos only includes words with the specified part of speech, the default value is empty, i.e. not filtered

Jieba participle of natural language processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More