Chinese Word Segmentation refers to splitting a Chinese Character Sequence into separate words. The word segmentation module jieba is a useful word segmentation module for python. The string to be segmented can be a unicode or UTF-8 string, GBK string. Note: It is not recommended that you directly enter a GBK string and may incorrectly decoded into a UTF-8. Three word segmentation modes are supported. 1. Accurate mode: it is suitable for Text Analysis to try to make sentences the most accurate cut; 2. In full mode, all words that can be converted into words in a sentence are scanned. The speed is very fast, but ambiguity cannot be solved; 3. Based on the exact search mode, the search engine can further split long words to improve the recall rate. This mode is suitable for search engine word segmentation. # Exact mode seg_list = jieba. cut ("I have been to Tsinghua University and Peking University. ")
# Full mode seg_list = jieba. cut ("I have been to Tsinghua University and Peking University. ", Cut_all = True)
# Search engine mode seg_list = jieba. cut_for_search ("I have been to Tsinghua University and Peking University. ")
# Exact mode: I/I have visited/Tsinghua University/AND/Peking University /.
# Full mode: I/have visited/Tsinghua University/Huada//AND/Beijing/Peking University/ //
# Search engine mode: I/have visited/Tsinghua/Huada//AND/Beijing//Peking University/
|