R language Do text mining Part2 word processing

Source: Internet
Author: User

Part2 word processing

After installing the related software package in Rstudio, we can do the related word processing, please refer to the Part1 section to install the required package. Reference Document: Play text mining, this article is about using R to do text mining is very detailed, and some related information download, it is worth seeing!

1. rwordseg function

The documentation is available for download at http://download.csdn.net/detail/cl1143015961/8436741 and is simply described here.

Word segmentation

> Segmentcn (C ("If you shed tears when you miss the Sun", "You also miss the Stars")

[[1]]

[1] "If" "You" "because" "wrong" "Over" "Sun" "and"

[8] "Tears"

[[2]]

[1] "You" "also" "Will" "wrong" "Over" "Star"

Can see the effect is not very good, "miss" This is a word is separated, stating that there is no word in the thesaurus, so we sometimes need to add the words we need to the thesaurus.

Add words to delete words

> Insertwords ("Miss")

> Segmentcn (C ("If you shed tears when you miss the Sun", "You also miss the Stars")

[[1]]

[1] "If" "You" "because" "Miss" "The Sun" "and" "Tears"

[[2]]

[1] "You" "also" "Will" "Miss" "Star"

In some cases, you do not want a word to be divided, for example, or "Miss" the word, where the "wrong" and "too" semantics should not be a word, so you can delete the word from the dictionary, add the words you need to continue to do participle, the effect is much better.

> Segmentcn ("This is wrong, you can make it, but you shouldn't do it again.")

[1] "This" "Miss" "Go" "You" "Can" "Commit" "but"

[8] "Now" "then" "Commit" "no" "should" "

>deletewords ("Miss")

> Insertwords ("Past")

> Segmentcn ("This is wrong, you can make it, but you shouldn't do it again.")

[1] "This" "Wrong" "Past" "You" "Can" "Commit" "but"

[8] "Now" "then" "Commit" "no" "should" "

Ann Loading and unloading dictionary

In doing word processing, you may encounter some more refined and specialized articles, professional vocabulary in the Thesaurus is not, this time need to find the relevant dictionary, installed in R. For example, in the news analysis, some entertainment news will have a lot of star Singer's name appear, these names in making participle, will not be recognized as a word. At this point you may need to add a dictionary of the name, the dictionary can be built or can be found on the Internet. Recommended from search Sogou Input method of the Thesaurus http://pinyin.sogou.com/dict/, you can choose the required classification dictionary download.

Here I use a dictionary names: http://pinyin.sogou.com/dict/cate/index/429.

> Segmentcn ("2015 years of several years of play have appeared RMB figure")

[1] "2015" "" several "" open "" Year "

[6] "Play" "All" "appeared" "The" "Tang"

[11] "Yan" "" Figure "

>installdict ("D:\\r\\sources\\dictionaries\\singers.scel", Dictname = "names")

3732 words were loaded! ... New dictionary ' names ' was installed!

>SEGMENTCN ("2015 years of several years of play have appeared RMB figure")

[1] "2015" "" several "" open "" Year "

[6] "Play" "All" "appeared" "" RMB "

[11] "the figure"

>listdict ()

Name Type Des

1 Names star official recommendation, thesaurus from the user upload

Path

1 e:/programfiles/r/r-3.1.2/library/rwordseg/dict/names.dic

You can also delete a dictionary that you do not need to add yourself.

> uninstalldict ()

3732 words were removed! ... The dictionary ' names ' wasuninstalled!

>listdict ()

[1] Name Type Des Path

<0 line > (or 0-length row.names)

These are the basic introduction, RWORDSEG and more features, please check their Chinese documents.

2. Make a word for a brand officer

The data source is the official microblog of a clothing brand from 2012 to the end of 2014 Weibo. The basic content of the data structure as shown, see the content can probably guess which brand is it.

First installed with clothing-related dictionaries, the same is from the Sogou input method of the thesaurus download two costumes class dictionary, http://pinyin.sogou.com/dict/cate/index/397, the address of the first two thesaurus.

> installdict ("D:\\r\\sources\\dictionaries\\fushi.scel", Dictname = "Fushi")

> installdict ("D:\\r\\sources\\dictionaries\\ali_fushi.scel", Dictname = "Alifushi")

> listdict ()

Name Type

1 names stars

2 Pangu Text

3 Fushi Apparel

4 Ali Costumes

The next step is to read the data into R, you can see a total of 1640 micro-BO data, note the data encoding format, readlines default read format is GBK format, read format is not garbled.

>hlzj <-readlines ("d:\\r\\rworkspace\\orgdata.txt", encoding = "UTF-8")

>length (HLZJ)

[1] 1640

Next is to do participle, to first remove the data may exist in the number and some special symbols, and then participle.

>hlzjtemp <-gsub ("[0-90123456789 < > ~]", "", HLZJ)

> Hlzjtemp <-SEGMENTCN (hlzjtemp)

> Hlzjtemp[1:2]

[[1]]

[1] "new" "Recommended" "Fashion" "camouflage" "fabric" "Design"

[7] "for" "Simple" "" single "" West "" inject "

[13] "extraordinary" "" Wild "" Charm "" Good ""

[19] "waterproof" "effect" "make" "Practical" "Sex" "More"

[25] "High" "Pole" "with" "Spring" "suction" "Eye"

[31] "Highlights" "Spring" "New" "Hailan House" "Men" "leisure"

[37] "Suit" "Korean version" "Camouflage" "Suit" "Jacket" "Hwxajaa"

[[2]]

[1] "Small series" "Recommended" "Slim" "Thin" "hooded" "Warm Heart"

[7] "Design" "Wind" "Warm" "contrast" "line" "Design"

[13] "Young" "fashion" "go to relatives and friends" "Leisure" "Travel" "

[19] "Fashion" "select" "Vitality" "Winter" "Warm" "easy"

[25] "Winter" "Hot" "Hailan Home" "Genuine" "Men" "warm"

[31] "hooded" "Down jacket" "Coat" "Hwrajga"

Can see the microblog content has been done too word processing, the process is very simple, but in fact, may need to see the word processing results, some thesaurus does not exist so the word is cut off need to be added, so that the word segmentation effect to achieve the best.

3. To stop the word

Participle has a result, but the results of the participle there are many like, "Bar", "" "," "and" and "these meaningless modal words, or" even "," but "such a transition word, or some symbols, such words are called stop words. For further analysis, these stops may need to be removed.

First self-organized a Stop word list, this stop list is my own find, contains some common stop words, and then according to the actual content of some non-practical analysis of the meaning of the words, can be used as our stop vocabulary, online can find others have been sorted out the stop vocabulary.

>stopwords<-unlist (read.table ("D:\\r\\rworkspace\\stopwords.txt", Stringsasfactors=f))

> stopwords[50:100]

V150 V151 V152 V153 V154 V155 V156

"Ouch" "Oh" "I" "We" "Press" "" "

V157 V158 V159 V160 V161 V162 V163

"Bar da" "to" "just" "is" "This" "The Spirit" "ratio"

V164 V165 V166 V167 V168 V169 V170

"For example," "I," "I," "he," "each other," "side," "No."

V171 V172 V173 V174 V175 V176 V177

"Other", "Don't Say" "and" "and" "No more than" "not" "not"

V178 V179 V180 V181 V182 V183 V184

"Not only" "not" "no matter" "not" "but" "not only" "not"

V185 V186 V187 V188 V189 V190 V191

"No matter" "Not afraid" "otherwise" "not" "not special" "inflexible" "Don't ask"

V192 V193 V194 V195 V196 V197 V198

"Not only" "toward" "" while "" Take advantage "" Dash "

V199 V1100

"Besides" "" besides "

Removestopwords <-Function (x,stopwords) {

Temp <-character (0)

Index <-1

Xlen <-Length (x)

while (index <= xlen) {

if (length (stopwords[stopwords==x[index)]) <1)

temp<-C (Temp,x[index])

Index <-Index +1

}

Temp

}

> HLZJTEMP2 <-lapply (hlzjtemp,removestopwords,stopwords)

> Hlzjtemp2[1:2]

[[1]]

[1] "new" "Recommended" "Fashion" "camouflage" "fabric" "Design"

[7] "simple" "single" "West" "inject" "extraordinary" "Wild"

[13] "charm" "Waterproof" "Effect" "practical" "Sex" "High"

[19] "Pole" "with" "Spring" "suction" "Eye" "highlights"

[25] "Spring" "New" "Hailan House" "Men" "casual" "suits"

[31] "Korean version" "Camouflage" "Suit" "Jacket" "Hwxajaa"

[[2]]

[1] "Small series" "Recommended" "Slim" "Thin" "hooded" "Warm Heart"

[7] "Design" "Wind" "Warm" "contrast" "line" "Design"

[13] "Young" "fashion" "Go to Visit Friends" "Leisure" "Travel" "fashion"

[19] "Select" "Vitality" "Winter" "Warm" "Easy" "Winter"

[25] "Hot sale" "Hailan Home" "Genuine" "Men" "warm" "hooded"

[31] "Down jacket" "Jacket" "HWRAJGA"

Compared with the content of hlzjtemp[1:2] can be clearly found that "of" such words are removed.

4. Word Cloud

Word cloud is now a very common analysis diagram, put these words in a picture, frequency to show the size of words, so it can be very intuitive to see those words appear more, in public opinion analysis is often used.

The following procedure is to make a statistic of the word segmentation results, calculate the number of occurrences of each word and sort, and then take the top 150 of the 150 words, using the Wordcloud () method to draw the word cloud.

> Words <-lapply (Hlzjtemp2,strsplit, "")

> Wordsnum <-Table (unlist (words))

> Wordsnum <-Sort (wordsnum) #排序

> Wordsdata <-data.frame (words =names (wordsnum), freq = Wordsnum)

> Library (wordcloud) #加载画词云的包

> weibo.top150 <-Tail (wordsdata,150) #取前150个词

>colors=brewer.pal (8, "DARK2")

>wordcloud (Weibo.top150$words,weibo.top150$freq,scale=c (8,0.5), colors=colors,random.order=f)

The content of the brand micro-blog has obvious characteristics, the brand name "Hailan House" appears more frequently than other words, followed by more frequent words are "link", "flagship store", "Fashion", "new", "Slim", "menswear", can probably see this brand focus on menswear, The Weibo account often does new product recommendation, may provide the clothing link to its flagship store, also can see "All Star War", "Run brothers" such TV shows, a little understanding of this is the home of the two years of the Hai LAN two programs sponsored, so in its micro-blog appeared many times is very normal.

The original data is not shared, you can find another data to try.

R language Do text mining Part2 word processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.