English participle can use space, Chinese is different, some of the principles behind the word, say, first of all, the python commonly used jieba this tool.
First of all, be careful not to use Jieba when doing exercises. Py name the file, otherwise it will appear
Jieba has no attribute named cut ... And so on, if you delete the jieba.py you created, there was an error because the Jieba.pyc file was not deleted.
(1) Basic participle function and usage
Three modes of the next participle are introduced first:
Precision mode: Suitable for the most accurate separation of sentences, suitable for text analysis;
Full mode: All the words in the sentence can be scanned out, fast, but can not solve the ambiguity;
Search engine mode: On the basis of the precise mode, the long words are sliced again, and the recall rate is improved, which is applicable to the search engine participle.
The structure returned by Jieba.cut and Jieba.cut_for_search is an iterative generator that can be used for loops to get every word
The jieba.cut method accepts three input parameters:
- A string that requires a word breaker
- The Cut_all parameter is used to control whether full mode is used
- Hmm parameters are used to control the use of HMM models
Jieba.cut_for_search method accepts two parameters
- A string that requires a word breaker
- Whether to use a HMM model.
1 ImportJieba2Seg_list = Jieba.cut ("I love learning natural Language processing", cut_all=True)3 Print("Full Mode:"+"/ ". Join (Seg_list))#Full Mode4 5Seg_list = Jieba.cut ("I love natural language processing", cut_all=False)6 Print("Default Mode:"+"/ ". Join (Seg_list))#Precision Mode7 8Seg_list = Jieba.cut ("He graduated from Shanghai Jiaotong University and studied at the Deep Study Institute of Baidu.")#The default is precision mode9 Print(", ". Join (seg_list))Ten OneSeg_list = Jieba.cut_for_search ("Xiao Ming graduated from the Institute of Chinese Academy of Sciences, after a study at Harvard University")#Search engine Mode A Print(", ". Join (Seg_list))
View Code
jieba.lcut and Jieba.lcut_for_search directly back to list
1 ImportJieba2Result_lcut = Jieba.lcut ("Xiao Ming graduated from the Institute of Chinese Academy of Sciences, after a study at Harvard University")3Result_lcut_for_search = Jieba.lcut ("Xiao Ming graduated from the Institute of Chinese Academy of Sciences, after a study at Harvard University", cut_all=True)4 Print('Result_lcut:', Result_lcut)5 Print('Result_lcut_for_search:', Result_lcut_for_search)6 7 Print(" ". Join (result_lcut))8 Print(" ". Join (Result_lcut_for_search))
View Code
To add a user-defined dictionary:
Many times we need to do word segmentation for our own scenes, and there will be some proprietary vocabulary in the field.
- 1. User dictionary can be loaded with jieba.load_userdict (file_name)
- 2. A small number of words can be added manually in the following ways:
- Dynamically modify dictionaries in programs with Add_word (Word, Freq=none, tag=none) and Del_word (word)
- Use Suggest_freq (segment, tune=true) to adjust the word frequency of individual words so that they can (or cannot) be divided.
1 ImportJieba2Result_cut=jieba.cut ('If you put it in the old dictionary, you will get an error. ', hmm=False)3 Print('/'. Join (result_cut))4Jieba.suggest_freq (('in','will be'), True)5Result_cut=jieba.cut ('If you put it in the old dictionary, you will get an error. ', hmm=False)6 Print('/'. Join (Result_cut))
View Code
(2) Keyword extraction
Keyword extraction based on TF-IDF
Import Jieba.analyse
- Jieba.analyse.extract_tags (sentence, topk=20, Withweight=false, allowpos= ())
- Sentence for the text to be extracted
- TopK is the keyword that returns several TF/IDF weights, the default value is 20
- Withweight to return the keyword weight value, the default value is False
- Allowpos only includes words with the specified part of speech, the default value is empty, i.e. not filtered
Jieba participle of natural language processing