Brief introduction
The R language version of the "stuttering" Chinese participle, which supports the maximum probability method (Maximum probability), the implicit Markov model (Hidden Markov model), the index model (querysegment), the hybrid model (mixsegment), A total of four types of word segmentation, while there are POS tagging, keyword extraction, text simhash similarity comparison and other functions. The project was developed using Rcpp and Cppjieba.
Characteristics
Support for Windows, Linux operating system (MAC is not tested).
Using RCPP modules to load multiple word breakers simultaneously, different word segmentation patterns and thesaurus can be used respectively.
Support a variety of word segmentation patterns, Chinese name recognition, keyword extraction, pos tagging and text Simhash similarity comparison and other functions.
Support Load custom user thesaurus, set word frequency, part of speech.
Also support Simplified Chinese, Traditional Chinese word segmentation.
Auto-judge encoding mode is supported.
than the original "stuttering" Chinese word speed is faster, is the other R word packet 5-20 times.
Easy installation, no complex setup required.
Can be called by other languages, such as RPY2,JVMR.
Based on the MIT Protocol.
Installation
Currently the package is not released to Cran and can be installed via GitHub.
* Note: This is an Ubuntu installation environment
install.packages("devtools")library(devtools)install_github("qinwf/jiebaR")library(jiebaR)
Use
Jiebar provides four types of word breakers that can be initialized by Jiebar () to use segment () for word breakers.
Library (Jiebar)# Accept the default parameters, set up the word breaker mixseg = worker () ## Equivalent:# Jiebar(type ="Mix", Dict ="Inst/dict/jieba.dict.utf8",# hmm="Inst/dict/hmm_model.utf8",# HMMModel data# User="Inst/dict/user.dict.utf8")# User-defined Thesaurus mixseg <="Guangdong province Shenzhen Unicom"## <= Word breaker operator# Equivalent to segment ("Guangdong province Shenzhen Unicom", mixseg)# Participle results# [1]"Guangdong province" "Shenzhen" "Unicom"Mixseg <="You know I don't know."# [1]"You" "Know" "I" "No" "Know"Mixseg <="I attended a classmate's wedding yesterday."# [1]"I" "Yesterday" "Participation" "a" "Classmate" "Wedding"Hehe: The result of participle is quite good
R language Chinese word jiebar