R language of "stuttering" Chinese word-breaker version: Jiebar

Last Update:2014-12-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The R language version of the "stuttering" Chinese participle, which supports the maximum probability method (Maximum probability), the implicit Markov model (Hidden Markov model), the index model (querysegment), the hybrid model (mixsegment), A total of four types of word segmentation, while there are POS tagging, keyword extraction, text simhash similarity comparison and other functions. The project was developed using Rcpp and Cppjieba.

Characteristics

supports Windows, Linux operating system (MAC not tested).
implement simultaneous loading of multiple word breakers via Rcpp modules You can use different word-breaker patterns and thesaurus, respectively.
supports multiple word segmentation modes, Chinese name recognition, keyword extraction, The functions of POS tagging and similarity comparison of text Simhash.
supports loading custom user thesaurus, setting word frequency, part of speech.
supports auto-judging encoding mode.
Installation is simple and requires no complex setup.
can be called by other languages, such as RPY2,JVMR.
based on the MIT Protocol.

Installation

Currently the package is not released to Cran and can be installed via GitHub. Windows systems need to install rtools, or you can download a binary package for installation:

1 2	`library(devtools)` `install_github(` `"qinwf/jiebaR"` `)`

Using the exampleWord segmentation

Jiebar provides four types of word breakers that can be jiebar() used to initialize the word breaker by using the word breaker segment() .

1 2 3 4 5 6 7 8 9 library(jiebaR) ## 接受默认参数，建立分词引擎 mixseg = worker() ## 相当于： ## jiebar( type = "mix", dict = "inst/dict/jieba.dict.utf8", ## hmm = "inst/dict/hmm_model.utf8", ### HMM模型数据 ## user = "inst/dict/user.dict.utf8") ### 用户自定义词库 mixseg <= "江州市长江大桥参加了长江大桥的通车仪式" ### <= 分词运算符 ## 相当于 segment( "江州市长江大桥参加了长江大桥的通车仪式" , mixseg )

1 2	`[1]` `"江州"` `"市长"` `"江大桥"` `"参加"` `"了"` `"长江大桥"` `[7]` `"的"` `"通车"` `"仪式"`

Word breakers are supported for files:

1 2	`mixseg <=` `"./temp.dat"` `### 自动判断输入文件编码模式，默认文件输出在同目录下。` `## segment( "./temp.dat" , mixseg )`

When you load the word breaker engine, you can customize the thesaurus path, and you can start different engines:

The maximum probability method (mpsegment), which is responsible for constructing the digraph and the dynamic programming algorithm according to the trie tree, is the core of the segmentation algorithm.

The implicit Markov model (Hmmsegment) is based on a HMM model based on the People's Daily and other corpora, and the main algorithm is to represent the hidden state of each word according to the (B,e,m,s) four states. The HMM model is provided by Dict/hmm_model.utf8. The word segmentation algorithm is the Viterbi algorithm.

The mixed model (mixsegment) is a class of four word segmentation engine, which uses the maximum probability method and the implicit Markov model.

The index model (querysegment) first uses a mixed model to cut words, and then for the longer words to be cut out, enumerate all the possible words in the sentence and find out the existence of the thesaurus.

1 2 3 4 5 6 7 mixseg2 = worker( type= "mix" , dict = "dict/jieba.dict.utf8" , hmm = "dict/hmm_model.utf8" , user = "dict/test.dict.utf8" , detect=T, symbol = F, lines = 1e+05, output = NULL ) mixseg2 ### 输出worker的设置

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Worker Type: Mix Segment Detect Encoding : TRUE Default Encoding: UTF-8 Keep Symbols : FALSE Output Path : Write File : TRUE Max Read Lines : 1e+05 Fixed Model Components: $dict [1] "dict/jieba.dict.utf8" $hmm [1] "dict/hmm_model.utf8" $user [1] "dict/test.dict.utf8" $detect $encoding $symbol $output $write $lines can be reset.

You can reset some of the parameter settings by using the symbols commonly used in the R language, $ worker such as WorkerName$symbol = T preserving punctuation in the output. Some parameters are determined at the time of initialization, cannot be modified, and can be used WorkerName$PrivateVarible to obtain this information.

0 S	`mixseg$encoding` `mixseg$detect = F`

You can customize the user thesaurus, we recommend the use of deep Blue Word Library transformation to build Word thesaurus, it can quickly convert Sogou cell thesaurus and other input method thesaurus into the Jiebar thesaurus format.

1 2 3 ShowDictPath() ### 显示词典路径 EditDict() ### 编辑用户词典 ?EditDict() ### 打开帮助系统

POS Labeling

can be used <=.tagger or tag to do word segmentation and part-of-speech tagging, part-of-speech tagging using mixed model model participle, labeling and Ictclas-compatible labeling method.

1 2 3 words = "我爱北京天安门" tagger = worker( "tag" ) tagger <= words

1 2	`r` `v` `ns ns` `"我"` `"爱"` `"北京"` `"天安门"`

Keyword extraction

Keyword extraction uses the reverse file frequency (IDF) text corpus that can be switched to the path of a custom corpus, using the same method as a word breaker. topnThe parameter is the number of keywords.

1 2 3 keys = worker( "keywords" , topn = 1) keys <= "我爱北京天安门" keys <= "一个文件路径.txt"

1 2	`8.9954` `"天安门"`

Simhash and Hamming distance

Calculates the corresponding Simhash value for the Chinese document. Simhash is Google's algorithm for text-to-weight, which is now widely used in text processing. Simhash engine First segmentation and keyword extraction, after the calculation of Simhash value and Hamming distance.

1 2 3 words = "hello world!" simhasher = worker( "simhash" ,topn=2) simhasher <= "江州市长江大桥参加了长江大桥的通车仪式"

1 2 3) 4 5 $simhash [1] "12882166450308878002" $keyword 22.3853 8.69667 "长江大桥" "江州"

1 2 3 4 5 6 7 8 $distance [1] "23" $lhs 22.3853 8.69667 "长江大桥" "江州" $rhs 11.7392 11.7392 "hello" "world"

Program Support

Support Windows, Linux, MAC operating system parallel participle.
Simple Natural language statistical analysis function.

Project home:http://www.open-open.com/lib/view/home/1415086153728

R language of "stuttering" Chinese word-breaker version: Jiebar

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

R language of "stuttering" Chinese word-breaker version: Jiebar

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

R language of "stuttering" Chinese word-breaker version: Jiebar

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support