R Language--jiebar Basics

Last Update:2017-05-21 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the function in the Jiebar (a large part of the reference Jiebar official documents: qinwenfeng.com/jiebar/)
**no.1**
Worker (type = "Mix", Dict = dictpath, hmm = hmmpath, user = USERPATH,IDF = Idfpath,
Stop_word = Stoppath, write = t, Qmax =, topn = 5,encoding = "UTF-8", Detect = t,
symbol = f, lines = 1e+05,output = NULL, bylines = f, user_weight = "Max")
The role of the worker () function is to build a word breaker, usually when parsing text, you need to first build a word breaker.

#构建分词器的语句如下, without adding any parameters, use the default parameters in the function
> Word breaker =worker ()

The parameters of the worker () function are described below:
(1) type (mix): Word segmentation model with several options available
[1]MP dictionary-based maximum probability model
[2]HMM based on the HMM model, you can find words that are not in the dictionary
[3]mix mixed model, first with MP participle, after the end of the call hmm to the remaining words can be the word out
[4]query index model, re-slicing words greater than a certain length
[5]tag tag model, POS tagging based on user dictionaries
[6]keywords keyword model, tf-idf to remove keywords
[7]simhash Simhash model, calculated on the basis of keywords Simhash

(2) Dict (Dictpath): System dictionary, default path is Jiebar::D ictpath, file name is Jieba.dict.utf8
The default data structure for system dictionaries is three columns: words, word frequency, part of speech
>readlines (jiebar::D ictpath,5,encoding = "UTF-8")
[1] "1th Shop 3 N" "1 Shop 3 N" "4S Shop 3 n" "4s shop 3 N" "AA System 3 N"

(3) Hmm (hmmpath): Hmm dictionary, default Jiebar::hmmpath

(4) User (UserPath): Users dictionary, default Jiebar::userpath

(5) IDF (Idfpath): IDF dictionary, default Jiebar::idfpath

(6) Stop_word (Stoppath): Deactivate this dictionary, default Stoppath

(7) write (T): Write to file, default to T
This parameter is only used when the input is a file path. This parameter is valid only for word-breaker and part-of-speech tagging.

(8) Qmax (20): The maximum possible number of characters in the index model, the default 20

(9) TOPN (5): Number of keywords extracted

(encoding) (UTF-8): Default encoding UTF-8

(one) Detect (T): whether to check the input file encoding, default check (t)

Symbol (f): whether the symbol is preserved or not, by default (f)

Lines (1e+05): The maximum number of rows per file read to control the length of the read file. For large files, implement a split read

(+) output (NULL): Specifies the output path, a string path. This parameter will only be used if the input is a file path

(+) bylines (F): Whether the file result is output by row, or if it is, the file or string vector that is read in is word-by-line

() User_weight (max): The word frequency of words in the user dictionary, which defaults to "Max", the maximum value in the system dictionary.

You can also choose the min minimum or the median of median.

**no.2**
Segment (Code, Jiebar, mod = NULL)
Segment is mainly used for participle
Code: A Chinese statement or file path
Jiebar:jiebar Worker, the word breaker built above
MoD: Modify the default result type, which can be "mix", "hmm", "query", "full", "level", or "MP" any one of the

> text= "Today's Changchun held a marathon, but I did not go, good pity"
> Segment (text, Word breaker)
[1] "Today" "" Changchun "" held "" Marathon "" competition "" but "" I "
[10] "No" "Go" "good" "regret"

**no.3**
New_user_word (worker, words, tags = rep ("n", Length (words))): Used to add user dictionaries
Worker: A word breaker
Words: New words added
Tags: add tags, default "n", that is, nouns
> Segment ("I like to listen to a person's loneliness, two people's fault", word breaker)
[1] "I" "Can" "like" "Listen" "a" "Person" "" Lonely "" Two "" Person ""
[12] "wrong"
> New_user_word (Word breaker, "a person's loneliness two people's fault")
[1] TRUE
> Segment ("I like to listen to a person's loneliness two people's fault", word breaker)
[1] "I" "Can" "like"
[4] "Listen" "A person's loneliness two people's fault"

**no.4**
Use ReadLines () and writelines () to Word-breaker a file
Example: My directory has a comment about news hot events, file name Girl_comm.txt, store by row
> Girl_comm<-readlines (' girl_comm.txt ', encoding = ' UTF-8 ')
> Head (GIRL_COMM)
[1] ": Thank you, this review is very thick//"
[2] ": Do you only intercept some of the Netizen's reply to mislead the public is not necessarily very good? Most of the netizens are saying that it is not clear that the truth should not be one-sided commentary, and the education of the bear children is not what you say is very high ~ "
[3] ": Did you see someone kick it?" No Video no Truth "
[4] ": Kicking people too, but bear children are also unbearable. "
[5] ": If only two voices can be defined as bear children, the whole world only dumb not bear children you must have been a child.
[6] ": Children lively is nature, if parents do not teach is called bear children, and so you have a child will know."

#由于分词时候不用考虑标点符号, so I'm not going to be dealing with the preceding colon here.
> result<-segment (girl_comm, word breaker)
> Head (Result)
[[1]]
[1] "Thank you" "This" "comment" "very" "heavy"

[[2]]
[1] "You" "Such" "Only" "Intercept" "part" "Netizen" ""
[8] "reply" "to" "mislead" "public" "" Practice "" also "
[15] "not necessarily" "very" "good" "most" "Netizen" "All" "in"
[22] "not clear" "Things" "Truth" "no" "should" "one-sided" "Comment"
[29] "and" "No" "You" "Say" "" Education "" Bear "
[36] "child" "voice" "Very high"
> result_merge<-sapply (result, function (x) {paste (x, collapse = "")})
> Writelines (Result_merge, "./some.txt") #需要现在工作目录理新建一个名为some. txt files
> File.remove ("./some.txt") #在工作目录移除该文件
> Head (result_merge)
[1] "Thank you for this review very thick"
[2] "You only intercept some of the Netizen's reply to mislead the public's practice is not very good, most of the netizens are not clear that the truth should not be one-sided commentary and not the education of the bear child is very high"
[3] "You see people kicking no video without the truth"
[4] "kicking is too much but the bear child is unbearable."
[5] "If only two voices can be defined as bear children, the whole world only dumb not bear children you must have been a child."
[6] "It's natural for a child to be lively, if the parents don't teach, Shong wait until you have a baby."
**no.5**
To participle a file:
> output_file<-segment ("Girl_comm.txt", word breaker)
> readlines (output_file,4,encoding = "UTF-8")
[1] "Thank you for this review very thick"
[2] "You only intercept some of the Netizen's reply to mislead the public's practice is not very good, most of the netizens are not clear that the truth should not be one-sided commentary and not the education of the bear child is very high"
[3] "You see people kicking no video without the truth"
[4] "kicking is too much but the bear child is unbearable."

Specify the output path:
Word breaker $output = "some"
> Word breaker $output= "some"
> Segment ("Girl_comm.txt", word breaker)
[1] "some"
> ReadLines ("some", 4,encoding = "UTF-8")
[1] "Thank you for this review very thick"
[2] "You only intercept some of the Netizen's reply to mislead the public's practice is not very good, most of the netizens are not clear that the truth should not be one-sided commentary and not the education of the bear child is very high"
[3] "You see people kicking no video without the truth"
[4] "kicking is too much but the bear child is unbearable."

If you want to turn off the automatic detection path, you can use the following statement:
Word breaker $write=nofile

**no.6**
Use the following two statements to achieve branch output and preserve punctuation
(1) Word breaker $bylines=true
> Word breaker $bylines=t
> Segment (C ("This is the first sentence", "This is the second sentence"), the word breaker)
[[1]]
[1] "This is" the first sentence "the words"

[[2]]
[1] "This is" "the second sentence" "Words"

(2) Word breaker $symbol=true
> Word breaker $symbol=t
> Segment ("Like you, no reason, is so wayward", word breaker)
[[1]]
[1] "like" "You" "," "No Reason" "," "is" "then" wayward "

Second, tags and keywords
(1) Mark------word tagging for words
> text= "Changchun Winter for six months, the other half year is spring and summer fall"
> Tagger=worker ("tag")
> Segment (Text,tagger)
NS UJ t ns m R M v TG t
"Changchun" "" Winter "" long "" six months "" other "" Half Year "" is "" Spring "" Summer Fall "

The words that have been divided can also be marked:
> Wk=worker ()
> result<-segment (TEXT,WK)
> Vector_tag (Result,tagger)
NS UJ t ns m R M v TG t
"Changchun" "" Winter "" long "" six months "" other "" Half Year "" is "" Spring "" Summer Fall "

(2) Keyword extraction--extracting representative words from text according to certain rules
> Extract_kw=worker ("keywords", TOPN = 2)
> text= "The Power of Love is strong, love is indestructible, love is great"
> Keywords (text,extract_kw)
20.5491 10.9562
"Love", "Indestructible."

Text that has been divided into words can also be used to extract keywords using vector_keywords (word structure, word breaker)
> result=segment (TEXT,WK)
> Vector_keywords (result,extract_kw)
20.5491 10.9562
"Love", "Indestructible."

Iii. Calculating Hamming distance
Distance (Codel, coder, Jiebar)
Vector_distance (Codel, coder, Jiebar)

Codel: For distance, it is a Chinese statement or a path to a text file; for vector_distance, it is a word-text vector
Coder: For distance, it is a Chinese statement or a path to a text file; for vector_distance, it is a word-text vector
Jiebar:jiebar worker

> Summa<-worker (' simhash ', TOPN = 2)
> Simhash ("Jiangzhou Mayor River Bridge attended the opening ceremony of the Changjiang River Bridge", Summa)
$simhash
[1] "12882166450308878002"

$keyword
22.3853 8.69667
"Yangtze River Bridge" "Jiangzhou"

Distance ("Hello world!", "Jiangzhou Mayor River Bridge participated in the opening ceremony of the Yangtze River Bridge", abstract)

For a vector that already has a good word, you can also calculate Simhash:
> text= "Jiangzhou Mayor River Bridge attended the opening ceremony of the Changjiang River bridge"
> result<-segment (TEXT,WK)
> Vector_simhash (result,summa) #就是有点麻烦了
$simhash
[1] "9659751748513812269"

$keyword
11.7392 11.1926
"Yangtze River Bridge" "Bridge"
> Distance ("Hello world!", "Jiangzhou Mayor River Bridge participated in the opening ceremony of the Yangtze River Bridge", Summa) #两个不同的文本, distance is not 0
$distance
[1] 23

$lhs
11.7392 11.7392
"Hello" "World"

$rhs
22.3853 8.69667
"Yangtze River Bridge" "Jiangzhou"

> Distance ("Hello World", "Hello World", summa) #两个相同的文本, distance is 0
$distance
[1] 0

$lhs
11.7392 11.7392
"Hello" "World"

$rhs
11.7392 11.7392
"Hello" "World"

> Vector_distance (C ("Today", "Weather", "true", "very", "good", "", "Feeling"), C ("Today", "Weather", "true", "very", "good", "", "Feeling"), Summa) #相同的分词结果, Distance is 0
$distance
[1] 0

$lhs
6.45994 6.18823
"Weather" "good"

$rhs
6.45994 6.18823
"Weather" "good"

> Vector_distance (C ("Today", "weather", "nice"), C ("Today", "Weather", "good"), Summa)
$distance
[1] 0

$lhs
9.11319 6.45994
"Very good" "Weather"

$rhs
9.11319 6.45994
"Very good" "Weather"

Iv. statistics of Word frequency
Freq (x)
X is a vector, that is, a word-breaker result vector

V. Generate IDF files GET_IDF ()
GET_IDF (x, Stop_word = stoppath, Path = NULL)
X: A list
Stop_word: Stop Word
Path: Output file path

Temp Output directory = Tempfile ()
A_big_list = List (c ("Test", "bit"), C ("test"))
GET_IDF (a_big_list, stop = jiebar::stoppath, Path = temp output directory)
ReadLines (Temporary output directory)
#> [1] "0.693147180559945" "Test 0"

Vi. frequently used packages with the Jiebar
(1) Wordcloud2: Draw the word graph using
(2) Cidian: It seems unable to install the
(3) ROPENCC: Simple and complex conversion of Chinese characters
(4) Text2vec: Use Help https://cran.r-project.org/web/packages/text2vec/vignettes/text-vectorization.html

R Language--jiebar Basics

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More