The R language version of the "stuttering" Chinese participle, which supports the maximum probability method (Maximum probability), the implicit Markov model (Hidden Markov model), the index model (querysegment), the hybrid model (mixsegment), A total of four types of word segmentation, while there are POS tagging, keyword extraction, text simhash similarity comparison and other functions. The project was developed using Rcpp and Cppjieba.
Characteristics
-
supports Windows, Linux operating system (MAC not tested).
-
implement simultaneous loading of multiple word breakers via Rcpp modules You can use different word-breaker patterns and thesaurus, respectively.
-
supports multiple word segmentation modes, Chinese name recognition, keyword extraction, The functions of POS tagging and similarity comparison of text Simhash.
-
supports loading custom user thesaurus, setting word frequency, part of speech.
-
-
supports auto-judging encoding mode.
-
-
Installation is simple and requires no complex setup.
-
can be called by other languages, such as RPY2,JVMR.
-
based on the MIT Protocol.
Installation
Currently the package is not released to Cran and can be installed via GitHub. Windows systems need to install rtools, or you can download a binary package for installation:
?
1 2 |
library(devtools) install_github( "qinwf/jiebaR" ) |
Using the exampleWord segmentation
Jiebar provides four types of word breakers that can be jiebar()
used to initialize the word breaker by using the word breaker segment()
.
?
1 2 3 4 5 6 7 8 9 |
library(jiebaR) ## 接受默认参数,建立分词引擎 mixseg = worker() ## 相当于: ## jiebar( type = "mix", dict = "inst/dict/jieba.dict.utf8", ## hmm = "inst/dict/hmm_model.utf8", ### HMM模型数据 ## user = "inst/dict/user.dict.utf8") ### 用户自定义词库 mixseg <= "江州市长江大桥参加了长江大桥的通车仪式" ### <= 分词运算符 ## 相当于 segment( "江州市长江大桥参加了长江大桥的通车仪式" , mixseg ) |
?
1 2 |
[1] "江州" "市长" "江大桥" "参加" "了" "长江大桥" [7] "的" "通车" "仪式" |
Word breakers are supported for files:
?
1 2 |
mixseg <= "./temp.dat" ### 自动判断输入文件编码模式,默认文件输出在同目录下。 ## segment( "./temp.dat" , mixseg ) |
When you load the word breaker engine, you can customize the thesaurus path, and you can start different engines:
The maximum probability method (mpsegment), which is responsible for constructing the digraph and the dynamic programming algorithm according to the trie tree, is the core of the segmentation algorithm.
The implicit Markov model (Hmmsegment) is based on a HMM model based on the People's Daily and other corpora, and the main algorithm is to represent the hidden state of each word according to the (B,e,m,s) four states. The HMM model is provided by Dict/hmm_model.utf8. The word segmentation algorithm is the Viterbi algorithm.
The mixed model (mixsegment) is a class of four word segmentation engine, which uses the maximum probability method and the implicit Markov model.
The index model (querysegment) first uses a mixed model to cut words, and then for the longer words to be cut out, enumerate all the possible words in the sentence and find out the existence of the thesaurus.
?
1 2 3 4 5 6 7 |
mixseg2 = worker( type
= "mix" , dict = "dict/jieba.dict.utf8" , hmm = "dict/hmm_model.utf8" , user = "dict/test.dict.utf8" , detect=T, symbol = F, lines = 1e+05, output = NULL ) mixseg2 ### 输出worker的设置 |
?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Worker Type: Mix Segment Detect Encoding : TRUE Default Encoding: UTF-8 Keep Symbols : FALSE Output Path : Write File : TRUE Max Read Lines : 1e+05 Fixed Model Components: $dict [1] "dict/jieba.dict.utf8" $hmm [1] "dict/hmm_model.utf8" $user [1] "dict/test.dict.utf8" $detect $encoding $symbol $output $write $lines can be reset. |
You can reset some of the parameter settings by using the symbols commonly used in the R language, $
worker
such as WorkerName$symbol = T
preserving punctuation in the output. Some parameters are determined at the time of initialization, cannot be modified, and can be used WorkerName$PrivateVarible
to obtain this information.
?
0 S |
mixseg$encoding mixseg$detect = F |
You can customize the user thesaurus, we recommend the use of deep Blue Word Library transformation to build Word thesaurus, it can quickly convert Sogou cell thesaurus and other input method thesaurus into the Jiebar thesaurus format.
?
1 2 3 |
ShowDictPath() ### 显示词典路径 EditDict() ### 编辑用户词典 ?EditDict() ### 打开帮助系统 |
POS Labeling
can be used <=.tagger
or tag
to do word segmentation and part-of-speech tagging, part-of-speech tagging using mixed model model participle, labeling and Ictclas-compatible labeling method.
?
1 2 3 |
words = "我爱北京天安门" tagger = worker( "tag" ) tagger <= words |
?
1 2 |
r v ns ns "我" "爱" "北京" "天安门" |
Keyword extraction
Keyword extraction uses the reverse file frequency (IDF) text corpus that can be switched to the path of a custom corpus, using the same method as a word breaker. topn
The parameter is the number of keywords.
?
1 2 3 |
keys = worker( "keywords" , topn = 1) keys <= "我爱北京天安门" keys <= "一个文件路径.txt" |
?
Simhash and Hamming distance
Calculates the corresponding Simhash value for the Chinese document. Simhash is Google's algorithm for text-to-weight, which is now widely used in text processing. Simhash engine First segmentation and keyword extraction, after the calculation of Simhash value and Hamming distance.
?
1 2 3 |
words = "hello world!" simhasher = worker( "simhash" ,topn=2) simhasher <= "江州市长江大桥参加了长江大桥的通车仪式" |
?
1 2 3) 4 5 |
$simhash [1] "12882166450308878002" $keyword 22.3853 8.69667 "长江大桥" "江州" |
?
1 2 3 4 5 6 7 8 |
$distance [1] "23" $lhs 22.3853 8.69667 "长江大桥" "江州" $rhs 11.7392 11.7392 "hello" "world" |
Program Support
Support Windows, Linux, MAC operating system parallel participle.
Simple Natural language statistical analysis function.
Project home:http://www.open-open.com/lib/view/home/1415086153728
R language of "stuttering" Chinese word-breaker version: Jiebar