R language Chinese word jiebar

Source: Internet
Author: User

Brief introduction

The R language version of the "stuttering" Chinese participle, which supports the maximum probability method (Maximum probability), the implicit Markov model (Hidden Markov model), the index model (querysegment), the hybrid model (mixsegment), A total of four types of word segmentation, while there are POS tagging, keyword extraction, text simhash similarity comparison and other functions. The project was developed using Rcpp and Cppjieba.

Characteristics

Support for Windows, Linux operating system (MAC is not tested).
Using RCPP modules to load multiple word breakers simultaneously, different word segmentation patterns and thesaurus can be used respectively.
Support a variety of word segmentation patterns, Chinese name recognition, keyword extraction, pos tagging and text Simhash similarity comparison and other functions.
Support Load custom user thesaurus, set word frequency, part of speech.
Also support Simplified Chinese, Traditional Chinese word segmentation.
Auto-judge encoding mode is supported.
than the original "stuttering" Chinese word speed is faster, is the other R word packet 5-20 times.
Easy installation, no complex setup required.
Can be called by other languages, such as RPY2,JVMR.
Based on the MIT Protocol.

Installation

Currently the package is not released to Cran and can be installed via GitHub.
* Note: This is an Ubuntu installation environment

install.packages("devtools")library(devtools)install_github("qinwf/jiebaR")library(jiebaR)
Use

Jiebar provides four types of word breakers that can be initialized by Jiebar () to use segment () for word breakers.

Library (Jiebar)#  Accept the default parameters, set up the word breaker mixseg = worker () ##  Equivalent:# Jiebar(type ="Mix", Dict ="Inst/dict/jieba.dict.utf8",# hmm="Inst/dict/hmm_model.utf8",# HMMModel data# User="Inst/dict/user.dict.utf8")# User-defined Thesaurus mixseg <="Guangdong province Shenzhen Unicom"## <= Word breaker operator# Equivalent to segment ("Guangdong province Shenzhen Unicom", mixseg)# Participle results# [1]"Guangdong province" "Shenzhen" "Unicom"Mixseg <="You know I don't know."# [1]"You"   "Know" "I"   "No"   "Know"Mixseg <="I attended a classmate's wedding yesterday."# [1]"I"   "Yesterday" "Participation" "a"   "Classmate" "Wedding"Hehe: The result of participle is quite good

R language Chinese word jiebar

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.