At present I often use the participle has stuttering participle, nlpir participle and so on
Recently in the use of stuttering participle, a little bit of recommendation, or good use.
First, stuttering participle introduction
Using stuttering participle to Chinese word segmentation, the basic realization principle has three:
- Efficient word-map scanning based on trie tree structure to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence
- Using dynamic programming to find the maximum probability path, the maximum segmentation combination based on word frequency is found.
- For the non-login words, the hmm model based on Chinese characters ' lexical ability is adopted, and the Viterbi algorithm is used.
Second, installation and use (Linux)
1. Download the toolkit, unzip it into the directory, run: Python setup.py install
Hint:a. A good habit is to download the software, first read the Readme, and then to operate. (No read readme, direct try + Baidu, will go a lot of detours);
B. An error occurred while running the install command: no permission! (Some people may encounter this problem because they have insufficient authority.) Execution: sudo!! WHERE "!!" Represents the previous command, which refers to the above installation command), and can be run normally after using sudo.
2. When using stuttering to do participle, the function that must be used is: Jieba.cut (ARG1,ARG2); This is a function for word segmentation, we only need to understand the following three points, you can use
The A.cut method accepts two input parameters: The first argument (arg1) is a string that needs to be participle, and the arg2 parameter is used to control the word breaker pattern.
The word segmentation pattern is divided into two categories: The default mode, which attempts to cut the sentence most precisely, suitable for text analysis, and a full model that scans all words in a sentence for search engines.
B. The string to be participle can be a GBK string, a utf-8 string, or a Unicode
people using Python pay attention to the coding problem, Python is based on ASCII code to deal with characters, when the occurrence of non-ASCII characters (such as the use of Chinese characters in code), the error message: "ASCII codec can ' t encode character ", the solution is to add a statement at the top of the file: #!-*-coding:utf-8-*-to tell the Python compiler:" This file is encoded with Utf-8, you to decode, please use Utf-8. (Remember, this command must be added to the top of the file, if not at the top, the coding problem is still there, not resolved) about the conversion of the code, you can refer to the blog (PS: Personal understanding "Import sys reload (SYS) Sys.setdefaultencoding (' utf-8 ') "These words with" #! -*-coding:utf-8-*-"equivalent)
The structure returned by C.jieba.cut is an iterative generator that can use a for loop to obtain every word (Unicode) that is obtained after a word breaker, or it can be used with list (Jieba.cut (...)). Convert to List
3. The following example provides a description of the use method provided in Jieba:
#!-*-coding:utf-8-*-Importjiebaseg_list= Jieba.cut ("I came to Tsinghua University in Beijing", Cut_all =True)Print "Full Mode:",' '. Join (seg_list) seg_list= Jieba.cut ("I came to Tsinghua University in Beijing")Print "Default Mode:",' '. Join (Seg_list)
The output is:
Full Mode: I/Come/GO/to/North/Beijing/Jing/Qing/Tsinghua/Tsinghua/Hua/Huada/Big/University/ Learn Default Mode: I
Iii. other functions of stuttering Chinese participle
1. Add or manage a custom dictionary
Stutter all the dictionary content stored in Dict.txt, you can constantly improve the content of dict.txt.
2. Keyword Extraction
The key words are extracted by calculating the TF/IDF weights of keywords after word segmentation.
The use of Chinese word segmentation software (under Python)