Part2 word processing
After installing the related software package in Rstudio, we can do the related word processing, please refer to the Part1 section to install the required package. Reference Document: Play text mining, this article is about using R to do text mining is very detailed, and some related information download, it is worth seeing!
1. rwordseg function
The documentation is available for download at http://download.csdn.net/detail/cl1143015961/8436741 and is simply described here.
Word segmentation
> Segmentcn (C ("If you shed tears when you miss the Sun", "You also miss the Stars")
[[1]]
[1] "If" "You" "because" "wrong" "Over" "Sun" "and"
[8] "Tears"
[[2]]
[1] "You" "also" "Will" "wrong" "Over" "Star"
Can see the effect is not very good, "miss" This is a word is separated, stating that there is no word in the thesaurus, so we sometimes need to add the words we need to the thesaurus.
Add words to delete words
> Insertwords ("Miss")
> Segmentcn (C ("If you shed tears when you miss the Sun", "You also miss the Stars")
[[1]]
[1] "If" "You" "because" "Miss" "The Sun" "and" "Tears"
[[2]]
[1] "You" "also" "Will" "Miss" "Star"
In some cases, you do not want a word to be divided, for example, or "Miss" the word, where the "wrong" and "too" semantics should not be a word, so you can delete the word from the dictionary, add the words you need to continue to do participle, the effect is much better.
> Segmentcn ("This is wrong, you can make it, but you shouldn't do it again.")
[1] "This" "Miss" "Go" "You" "Can" "Commit" "but"
[8] "Now" "then" "Commit" "no" "should" "
>deletewords ("Miss")
> Insertwords ("Past")
> Segmentcn ("This is wrong, you can make it, but you shouldn't do it again.")
[1] "This" "Wrong" "Past" "You" "Can" "Commit" "but"
[8] "Now" "then" "Commit" "no" "should" "
Ann Loading and unloading dictionary
In doing word processing, you may encounter some more refined and specialized articles, professional vocabulary in the Thesaurus is not, this time need to find the relevant dictionary, installed in R. For example, in the news analysis, some entertainment news will have a lot of star Singer's name appear, these names in making participle, will not be recognized as a word. At this point you may need to add a dictionary of the name, the dictionary can be built or can be found on the Internet. Recommended from search Sogou Input method of the Thesaurus http://pinyin.sogou.com/dict/, you can choose the required classification dictionary download.
Here I use a dictionary names: http://pinyin.sogou.com/dict/cate/index/429.
> Segmentcn ("2015 years of several years of play have appeared RMB figure")
[1] "2015" "" several "" open "" Year "
[6] "Play" "All" "appeared" "The" "Tang"
[11] "Yan" "" Figure "
>installdict ("D:\\r\\sources\\dictionaries\\singers.scel", Dictname = "names")
3732 words were loaded! ... New dictionary ' names ' was installed!
>SEGMENTCN ("2015 years of several years of play have appeared RMB figure")
[1] "2015" "" several "" open "" Year "
[6] "Play" "All" "appeared" "" RMB "
[11] "the figure"
>listdict ()
Name Type Des
1 Names star official recommendation, thesaurus from the user upload
Path
1 e:/programfiles/r/r-3.1.2/library/rwordseg/dict/names.dic
You can also delete a dictionary that you do not need to add yourself.
> uninstalldict ()
3732 words were removed! ... The dictionary ' names ' wasuninstalled!
>listdict ()
[1] Name Type Des Path
<0 line > (or 0-length row.names)
These are the basic introduction, RWORDSEG and more features, please check their Chinese documents.
2. Make a word for a brand officer
The data source is the official microblog of a clothing brand from 2012 to the end of 2014 Weibo. The basic content of the data structure as shown, see the content can probably guess which brand is it.
First installed with clothing-related dictionaries, the same is from the Sogou input method of the thesaurus download two costumes class dictionary, http://pinyin.sogou.com/dict/cate/index/397, the address of the first two thesaurus.
> installdict ("D:\\r\\sources\\dictionaries\\fushi.scel", Dictname = "Fushi")
> installdict ("D:\\r\\sources\\dictionaries\\ali_fushi.scel", Dictname = "Alifushi")
> listdict ()
Name Type
1 names stars
2 Pangu Text
3 Fushi Apparel
4 Ali Costumes
The next step is to read the data into R, you can see a total of 1640 micro-BO data, note the data encoding format, readlines default read format is GBK format, read format is not garbled.
>hlzj <-readlines ("d:\\r\\rworkspace\\orgdata.txt", encoding = "UTF-8")
>length (HLZJ)
[1] 1640
Next is to do participle, to first remove the data may exist in the number and some special symbols, and then participle.
>hlzjtemp <-gsub ("[0-90123456789 < > ~]", "", HLZJ)
> Hlzjtemp <-SEGMENTCN (hlzjtemp)
> Hlzjtemp[1:2]
[[1]]
[1] "new" "Recommended" "Fashion" "camouflage" "fabric" "Design"
[7] "for" "Simple" "" single "" West "" inject "
[13] "extraordinary" "" Wild "" Charm "" Good ""
[19] "waterproof" "effect" "make" "Practical" "Sex" "More"
[25] "High" "Pole" "with" "Spring" "suction" "Eye"
[31] "Highlights" "Spring" "New" "Hailan House" "Men" "leisure"
[37] "Suit" "Korean version" "Camouflage" "Suit" "Jacket" "Hwxajaa"
[[2]]
[1] "Small series" "Recommended" "Slim" "Thin" "hooded" "Warm Heart"
[7] "Design" "Wind" "Warm" "contrast" "line" "Design"
[13] "Young" "fashion" "go to relatives and friends" "Leisure" "Travel" "
[19] "Fashion" "select" "Vitality" "Winter" "Warm" "easy"
[25] "Winter" "Hot" "Hailan Home" "Genuine" "Men" "warm"
[31] "hooded" "Down jacket" "Coat" "Hwrajga"
Can see the microblog content has been done too word processing, the process is very simple, but in fact, may need to see the word processing results, some thesaurus does not exist so the word is cut off need to be added, so that the word segmentation effect to achieve the best.
3. To stop the word
Participle has a result, but the results of the participle there are many like, "Bar", "" "," "and" and "these meaningless modal words, or" even "," but "such a transition word, or some symbols, such words are called stop words. For further analysis, these stops may need to be removed.
First self-organized a Stop word list, this stop list is my own find, contains some common stop words, and then according to the actual content of some non-practical analysis of the meaning of the words, can be used as our stop vocabulary, online can find others have been sorted out the stop vocabulary.
>stopwords<-unlist (read.table ("D:\\r\\rworkspace\\stopwords.txt", Stringsasfactors=f))
> stopwords[50:100]
V150 V151 V152 V153 V154 V155 V156
"Ouch" "Oh" "I" "We" "Press" "" "
V157 V158 V159 V160 V161 V162 V163
"Bar da" "to" "just" "is" "This" "The Spirit" "ratio"
V164 V165 V166 V167 V168 V169 V170
"For example," "I," "I," "he," "each other," "side," "No."
V171 V172 V173 V174 V175 V176 V177
"Other", "Don't Say" "and" "and" "No more than" "not" "not"
V178 V179 V180 V181 V182 V183 V184
"Not only" "not" "no matter" "not" "but" "not only" "not"
V185 V186 V187 V188 V189 V190 V191
"No matter" "Not afraid" "otherwise" "not" "not special" "inflexible" "Don't ask"
V192 V193 V194 V195 V196 V197 V198
"Not only" "toward" "" while "" Take advantage "" Dash "
V199 V1100
"Besides" "" besides "
Removestopwords <-Function (x,stopwords) {
Temp <-character (0)
Index <-1
Xlen <-Length (x)
while (index <= xlen) {
if (length (stopwords[stopwords==x[index)]) <1)
temp<-C (Temp,x[index])
Index <-Index +1
}
Temp
}
> HLZJTEMP2 <-lapply (hlzjtemp,removestopwords,stopwords)
> Hlzjtemp2[1:2]
[[1]]
[1] "new" "Recommended" "Fashion" "camouflage" "fabric" "Design"
[7] "simple" "single" "West" "inject" "extraordinary" "Wild"
[13] "charm" "Waterproof" "Effect" "practical" "Sex" "High"
[19] "Pole" "with" "Spring" "suction" "Eye" "highlights"
[25] "Spring" "New" "Hailan House" "Men" "casual" "suits"
[31] "Korean version" "Camouflage" "Suit" "Jacket" "Hwxajaa"
[[2]]
[1] "Small series" "Recommended" "Slim" "Thin" "hooded" "Warm Heart"
[7] "Design" "Wind" "Warm" "contrast" "line" "Design"
[13] "Young" "fashion" "Go to Visit Friends" "Leisure" "Travel" "fashion"
[19] "Select" "Vitality" "Winter" "Warm" "Easy" "Winter"
[25] "Hot sale" "Hailan Home" "Genuine" "Men" "warm" "hooded"
[31] "Down jacket" "Jacket" "HWRAJGA"
Compared with the content of hlzjtemp[1:2] can be clearly found that "of" such words are removed.
4. Word Cloud
Word cloud is now a very common analysis diagram, put these words in a picture, frequency to show the size of words, so it can be very intuitive to see those words appear more, in public opinion analysis is often used.
The following procedure is to make a statistic of the word segmentation results, calculate the number of occurrences of each word and sort, and then take the top 150 of the 150 words, using the Wordcloud () method to draw the word cloud.
> Words <-lapply (Hlzjtemp2,strsplit, "")
> Wordsnum <-Table (unlist (words))
> Wordsnum <-Sort (wordsnum) #排序
> Wordsdata <-data.frame (words =names (wordsnum), freq = Wordsnum)
> Library (wordcloud) #加载画词云的包
> weibo.top150 <-Tail (wordsdata,150) #取前150个词
>colors=brewer.pal (8, "DARK2")
>wordcloud (Weibo.top150$words,weibo.top150$freq,scale=c (8,0.5), colors=colors,random.order=f)
The content of the brand micro-blog has obvious characteristics, the brand name "Hailan House" appears more frequently than other words, followed by more frequent words are "link", "flagship store", "Fashion", "new", "Slim", "menswear", can probably see this brand focus on menswear, The Weibo account often does new product recommendation, may provide the clothing link to its flagship store, also can see "All Star War", "Run brothers" such TV shows, a little understanding of this is the home of the two years of the Hai LAN two programs sponsored, so in its micro-blog appeared many times is very normal.
The original data is not shared, you can find another data to try.
R language Do text mining Part2 word processing