R language Do text mining Part2 word processing

Last Update:2015-11-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Part2 word processing

After installing the related software package in Rstudio, we can do the related word processing, please refer to the Part1 section to install the required package. Reference Document: Play text mining, this article is about using R to do text mining is very detailed, and some related information download, it is worth seeing!

1. rwordseg function

The documentation is available for download at http://download.csdn.net/detail/cl1143015961/8436741 and is simply described here.

Word segmentation

> Segmentcn (C ("If you shed tears when you miss the Sun", "You also miss the Stars")

[[1]]

[1] "If" "You" "because" "wrong" "Over" "Sun" "and"

[8] "Tears"

[[2]]

[1] "You" "also" "Will" "wrong" "Over" "Star"

Can see the effect is not very good, "miss" This is a word is separated, stating that there is no word in the thesaurus, so we sometimes need to add the words we need to the thesaurus.

Add words to delete words

> Insertwords ("Miss")

> Segmentcn (C ("If you shed tears when you miss the Sun", "You also miss the Stars")

[[1]]

[1] "If" "You" "because" "Miss" "The Sun" "and" "Tears"

[[2]]

[1] "You" "also" "Will" "Miss" "Star"

In some cases, you do not want a word to be divided, for example, or "Miss" the word, where the "wrong" and "too" semantics should not be a word, so you can delete the word from the dictionary, add the words you need to continue to do participle, the effect is much better.

> Segmentcn ("This is wrong, you can make it, but you shouldn't do it again.")

[1] "This" "Miss" "Go" "You" "Can" "Commit" "but"

[8] "Now" "then" "Commit" "no" "should" "

>deletewords ("Miss")

> Insertwords ("Past")

> Segmentcn ("This is wrong, you can make it, but you shouldn't do it again.")

[1] "This" "Wrong" "Past" "You" "Can" "Commit" "but"

[8] "Now" "then" "Commit" "no" "should" "

Ann Loading and unloading dictionary

In doing word processing, you may encounter some more refined and specialized articles, professional vocabulary in the Thesaurus is not, this time need to find the relevant dictionary, installed in R. For example, in the news analysis, some entertainment news will have a lot of star Singer's name appear, these names in making participle, will not be recognized as a word. At this point you may need to add a dictionary of the name, the dictionary can be built or can be found on the Internet. Recommended from search Sogou Input method of the Thesaurus http://pinyin.sogou.com/dict/, you can choose the required classification dictionary download.

Here I use a dictionary names: http://pinyin.sogou.com/dict/cate/index/429.

> Segmentcn ("2015 years of several years of play have appeared RMB figure")

[1] "2015" "" several "" open "" Year "

[6] "Play" "All" "appeared" "The" "Tang"

[11] "Yan" "" Figure "

>installdict ("D:\\r\\sources\\dictionaries\\singers.scel", Dictname = "names")

3732 words were loaded! ... New dictionary ' names ' was installed!

>SEGMENTCN ("2015 years of several years of play have appeared RMB figure")

[1] "2015" "" several "" open "" Year "

[6] "Play" "All" "appeared" "" RMB "

[11] "the figure"

>listdict ()

Name Type Des

1 Names star official recommendation, thesaurus from the user upload

Path

1 e:/programfiles/r/r-3.1.2/library/rwordseg/dict/names.dic

You can also delete a dictionary that you do not need to add yourself.

> uninstalldict ()

3732 words were removed! ... The dictionary ' names ' wasuninstalled!

>listdict ()

[1] Name Type Des Path

<0 line > (or 0-length row.names)

These are the basic introduction, RWORDSEG and more features, please check their Chinese documents.

2. Make a word for a brand officer

The data source is the official microblog of a clothing brand from 2012 to the end of 2014 Weibo. The basic content of the data structure as shown, see the content can probably guess which brand is it.

First installed with clothing-related dictionaries, the same is from the Sogou input method of the thesaurus download two costumes class dictionary, http://pinyin.sogou.com/dict/cate/index/397, the address of the first two thesaurus.

> installdict ("D:\\r\\sources\\dictionaries\\fushi.scel", Dictname = "Fushi")

> installdict ("D:\\r\\sources\\dictionaries\\ali_fushi.scel", Dictname = "Alifushi")

> listdict ()

Name Type

1 names stars

2 Pangu Text

3 Fushi Apparel

4 Ali Costumes

The next step is to read the data into R, you can see a total of 1640 micro-BO data, note the data encoding format, readlines default read format is GBK format, read format is not garbled.

>hlzj <-readlines ("d:\\r\\rworkspace\\orgdata.txt", encoding = "UTF-8")

>length (HLZJ)

[1] 1640

Next is to do participle, to first remove the data may exist in the number and some special symbols, and then participle.

>hlzjtemp <-gsub ("[0-90123456789 < > ~]", "", HLZJ)

> Hlzjtemp <-SEGMENTCN (hlzjtemp)

> Hlzjtemp[1:2]

[[1]]

[1] "new" "Recommended" "Fashion" "camouflage" "fabric" "Design"

[7] "for" "Simple" "" single "" West "" inject "

[13] "extraordinary" "" Wild "" Charm "" Good ""

[19] "waterproof" "effect" "make" "Practical" "Sex" "More"

[25] "High" "Pole" "with" "Spring" "suction" "Eye"

[31] "Highlights" "Spring" "New" "Hailan House" "Men" "leisure"

[37] "Suit" "Korean version" "Camouflage" "Suit" "Jacket" "Hwxajaa"

[[2]]

[1] "Small series" "Recommended" "Slim" "Thin" "hooded" "Warm Heart"

[7] "Design" "Wind" "Warm" "contrast" "line" "Design"

[13] "Young" "fashion" "go to relatives and friends" "Leisure" "Travel" "

[19] "Fashion" "select" "Vitality" "Winter" "Warm" "easy"

[25] "Winter" "Hot" "Hailan Home" "Genuine" "Men" "warm"

[31] "hooded" "Down jacket" "Coat" "Hwrajga"

Can see the microblog content has been done too word processing, the process is very simple, but in fact, may need to see the word processing results, some thesaurus does not exist so the word is cut off need to be added, so that the word segmentation effect to achieve the best.

3. To stop the word

Participle has a result, but the results of the participle there are many like, "Bar", "" "," "and" and "these meaningless modal words, or" even "," but "such a transition word, or some symbols, such words are called stop words. For further analysis, these stops may need to be removed.

First self-organized a Stop word list, this stop list is my own find, contains some common stop words, and then according to the actual content of some non-practical analysis of the meaning of the words, can be used as our stop vocabulary, online can find others have been sorted out the stop vocabulary.

>stopwords<-unlist (read.table ("D:\\r\\rworkspace\\stopwords.txt", Stringsasfactors=f))

> stopwords[50:100]

V150 V151 V152 V153 V154 V155 V156

"Ouch" "Oh" "I" "We" "Press" "" "

V157 V158 V159 V160 V161 V162 V163

"Bar da" "to" "just" "is" "This" "The Spirit" "ratio"

V164 V165 V166 V167 V168 V169 V170

"For example," "I," "I," "he," "each other," "side," "No."

V171 V172 V173 V174 V175 V176 V177

"Other", "Don't Say" "and" "and" "No more than" "not" "not"

V178 V179 V180 V181 V182 V183 V184

"Not only" "not" "no matter" "not" "but" "not only" "not"

V185 V186 V187 V188 V189 V190 V191

"No matter" "Not afraid" "otherwise" "not" "not special" "inflexible" "Don't ask"

V192 V193 V194 V195 V196 V197 V198

"Not only" "toward" "" while "" Take advantage "" Dash "

V199 V1100

"Besides" "" besides "

Removestopwords <-Function (x,stopwords) {

Temp <-character (0)

Index <-1

Xlen <-Length (x)

while (index <= xlen) {

if (length (stopwords[stopwords==x[index)]) <1)

temp<-C (Temp,x[index])

Index <-Index +1

}

Temp

}

> HLZJTEMP2 <-lapply (hlzjtemp,removestopwords,stopwords)

> Hlzjtemp2[1:2]

[[1]]

[1] "new" "Recommended" "Fashion" "camouflage" "fabric" "Design"

[7] "simple" "single" "West" "inject" "extraordinary" "Wild"

[13] "charm" "Waterproof" "Effect" "practical" "Sex" "High"

[19] "Pole" "with" "Spring" "suction" "Eye" "highlights"

[25] "Spring" "New" "Hailan House" "Men" "casual" "suits"

[31] "Korean version" "Camouflage" "Suit" "Jacket" "Hwxajaa"

[[2]]

[1] "Small series" "Recommended" "Slim" "Thin" "hooded" "Warm Heart"

[7] "Design" "Wind" "Warm" "contrast" "line" "Design"

[13] "Young" "fashion" "Go to Visit Friends" "Leisure" "Travel" "fashion"

[19] "Select" "Vitality" "Winter" "Warm" "Easy" "Winter"

[25] "Hot sale" "Hailan Home" "Genuine" "Men" "warm" "hooded"

[31] "Down jacket" "Jacket" "HWRAJGA"

Compared with the content of hlzjtemp[1:2] can be clearly found that "of" such words are removed.

4. Word Cloud

Word cloud is now a very common analysis diagram, put these words in a picture, frequency to show the size of words, so it can be very intuitive to see those words appear more, in public opinion analysis is often used.

The following procedure is to make a statistic of the word segmentation results, calculate the number of occurrences of each word and sort, and then take the top 150 of the 150 words, using the Wordcloud () method to draw the word cloud.

> Words <-lapply (Hlzjtemp2,strsplit, "")

> Wordsnum <-Table (unlist (words))

> Wordsnum <-Sort (wordsnum) #排序

> Wordsdata <-data.frame (words =names (wordsnum), freq = Wordsnum)

> Library (wordcloud) #加载画词云的包

> weibo.top150 <-Tail (wordsdata,150) #取前150个词

>colors=brewer.pal (8, "DARK2")

>wordcloud (Weibo.top150$words,weibo.top150$freq,scale=c (8,0.5), colors=colors,random.order=f)

The content of the brand micro-blog has obvious characteristics, the brand name "Hailan House" appears more frequently than other words, followed by more frequent words are "link", "flagship store", "Fashion", "new", "Slim", "menswear", can probably see this brand focus on menswear, The Weibo account often does new product recommendation, may provide the clothing link to its flagship store, also can see "All Star War", "Run brothers" such TV shows, a little understanding of this is the home of the two years of the Hai LAN two programs sponsored, so in its micro-blog appeared many times is very normal.

The original data is not shared, you can find another data to try.

R language Do text mining Part2 word processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

R language Do text mining Part2 word processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

R language Do text mining Part2 word processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support