Jieba participle of natural language processing

Source: Internet
Author: User
Tags idf

English participle can use space, Chinese is different, some of the principles behind the word, say, first of all, the python commonly used jieba this tool.

First of all, be careful not to use Jieba when doing exercises. Py name the file, otherwise it will appear

Jieba has no attribute named cut ... And so on, if you delete the jieba.py you created, there was an error because the Jieba.pyc file was not deleted.

(1) Basic participle function and usage

Three modes of the next participle are introduced first:

Precision mode: Suitable for the most accurate separation of sentences, suitable for text analysis;

Full mode: All the words in the sentence can be scanned out, fast, but can not solve the ambiguity;

Search engine mode: On the basis of the precise mode, the long words are sliced again, and the recall rate is improved, which is applicable to the search engine participle.

The structure returned by Jieba.cut and Jieba.cut_for_search is an iterative generator that can be used for loops to get every word

The jieba.cut method accepts three input parameters:

    • A string that requires a word breaker
    • The Cut_all parameter is used to control whether full mode is used
    • Hmm parameters are used to control the use of HMM models

Jieba.cut_for_search method accepts two parameters

    • A string that requires a word breaker
    • Whether to use a HMM model.

  

1 ImportJieba2Seg_list = Jieba.cut ("I love learning natural Language processing", cut_all=True)3 Print("Full Mode:"+"/ ". Join (Seg_list))#Full Mode4 5Seg_list = Jieba.cut ("I love natural language processing", cut_all=False)6 Print("Default Mode:"+"/ ". Join (Seg_list))#Precision Mode7 8Seg_list = Jieba.cut ("He graduated from Shanghai Jiaotong University and studied at the Deep Study Institute of Baidu.")#The default is precision mode9 Print(", ". Join (seg_list))Ten  OneSeg_list = Jieba.cut_for_search ("Xiao Ming graduated from the Institute of Chinese Academy of Sciences, after a study at Harvard University")#Search engine Mode A Print(", ". Join (Seg_list))
View Code

jieba.lcut and Jieba.lcut_for_search directly back to list

1 ImportJieba2Result_lcut = Jieba.lcut ("Xiao Ming graduated from the Institute of Chinese Academy of Sciences, after a study at Harvard University")3Result_lcut_for_search = Jieba.lcut ("Xiao Ming graduated from the Institute of Chinese Academy of Sciences, after a study at Harvard University", cut_all=True)4 Print('Result_lcut:', Result_lcut)5 Print('Result_lcut_for_search:', Result_lcut_for_search)6 7 Print(" ". Join (result_lcut))8 Print(" ". Join (Result_lcut_for_search))
View Code

To add a user-defined dictionary:

Many times we need to do word segmentation for our own scenes, and there will be some proprietary vocabulary in the field.

    • 1. User dictionary can be loaded with jieba.load_userdict (file_name)
    • 2. A small number of words can be added manually in the following ways:
      • Dynamically modify dictionaries in programs with Add_word (Word, Freq=none, tag=none) and Del_word (word)
      • Use Suggest_freq (segment, tune=true) to adjust the word frequency of individual words so that they can (or cannot) be divided.
1 ImportJieba2Result_cut=jieba.cut ('If you put it in the old dictionary, you will get an error. ', hmm=False)3 Print('/'. Join (result_cut))4Jieba.suggest_freq (('in','will be'), True)5Result_cut=jieba.cut ('If you put it in the old dictionary, you will get an error. ', hmm=False)6 Print('/'. Join (Result_cut))
View Code

(2) Keyword extraction

Keyword extraction based on TF-IDF

Import Jieba.analyse

    • Jieba.analyse.extract_tags (sentence, topk=20, Withweight=false, allowpos= ())
      • Sentence for the text to be extracted
      • TopK is the keyword that returns several TF/IDF weights, the default value is 20
      • Withweight to return the keyword weight value, the default value is False
      • Allowpos only includes words with the specified part of speech, the default value is empty, i.e. not filtered

Jieba participle of natural language processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.