R language do text mining Part5

Last Update:2015-03-19 Source: Internet

Author: User

Tags knowledge base

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

PART5 sentiment analysis

This is the last article in this series, in fact, the text mining every part of the single carry out is worth digging and careful study, I am still in the primary research stage, with R in the ready-made algorithm to achieve their own needs, of course, also refer to the wisdom of many netizens, so also want to summarize my harvest to share to everyone , and I hope I can be inspired by the way I look at everyone's share.

On the internet to turn over the Chinese text sentiment analysis of some articles, and then think of my own analysis of the method of emotion, I think my idea is really simple and direct. This is a paper which introduces the tendency of Chinese text sentiment analysis. Http://wenku.baidu.com/link?url=TVf5LgNS6esnunpgubvM14z24m0f4lTyD483gw_ HENP2RYEL6XZANSLZ8OCCZCFLWKLQD0PDBHVUCV4-0LOTDGP3HL-KQETTWJ3L91HFTA3, there are three main ways to do affective analysis in the middle. The first is the creation of the Affective Tendency dictionary by the existing electronic dictionary or the word Knowledge base; the second, unsupervised machine learning method. The third kind of learning method based on manual tagging corpus.

The above three methods are not carefully explained, they all have a common feature, need a corpus of emotional tendencies. My implementation in R is similar to the first method, to tidy up a commendatory term thesaurus with a derogatory thesaurus (this versatile internet has its own little tidying up OK). Make Word segmentation for text and extract the emotional words in the middle. Give each text a sentiment tendency score initial value is 1, match with commendatory derogatory thesaurus, commendatory term +1, derogatory-1, calculate the final sentiment inclination score of each text, positive value is positive, negative is negative evaluation. The method can basically realize the sentiment tendency judgment, but can also improve. As mentioned in the previous reference paper, can also be based on the word of speech strength to assess the strong feelings, not only +1 and 1 of the points; and consider some words in different contexts may be different emotional tendencies, such as the "pride" in the paper, which I think may need to sort out a special case of words , and negative positive circumstances, such as "Do not like it is impossible!" "And according to my scoring criteria it's the result of a negative evaluation; a rhetorical question," Where's the cheap? "The result is a positive. "Cheap" The word I put it under the commendatory term table, in fact, carefully consider if it is said that "cheap and affordable" is definitely commendatory, if said "cheap not good goods", it will be commendatory, this is not right, or the second problem in different contexts, emotional tendencies will be different.

The implementation process in R:

1. Data input Processing

Data is also a brand official micro, take it Weibo 1376 comments, sentiment commendatory term Library and derogatory library, read the data into R. With thesaurus: http://www.datatang.com/data/44317/, may not be very whole, need to organize rich, I look at the clothing related text, found some words like "faded", "Involute", "Show Thin", "fat" are not in the inside, These need to be added in addition to themselves.

> Hlzj.comment <-readlines ("Hlzj_commenttest.txt")

> Negative <-readlines ("D:\\r\\rworkspace\\hlzjworkfiles\\negative.txt")

> Positive <-readlines ("D:\\r\\rworkspace\\hlzjworkfiles\\positive.txt")

> Length (hlzj.comment)

[1] 1376

> Length (negative)

[1] 4477

> Length (positive)

[1] 5588

2. Word processing and rating of comments

The process is similar to the word processing described in Part2. Then I wrote a Method Getemotionaltype (), the results of the segmentation and negative table and positive table as a comparison of the calculation score.

> Commenttemp <-gsub ("[0-90123456789 < > ~]", "", Hlzj.comment)

> Commenttemp <-SEGMENTCN (commenttemp)

> Commenttemp[1:2]

[[1]]

[1] " Congratulations " " everyone " " and " " no " Find " " I 'm "

[[2] "

[1] " no " " private messages to " " i small i " " give " " drain " " "&NBSP;

> Emotionrank <-getemotionaltype (commenttemp,positive,negative)

[1] 0.073

[1] 0.145

[1] 0.218

[1] 0.291

[1] 0.363

[1] 0.436

[1] 0.509

[1] 0.581

[1] 0.654

[1] 0.727

[1] 0.799

[1] 0.872

[1] 0.945

> Emotionrank[1:10]

[1] 1 0 2 1 1 2 3 1 0 0

> Commentemotionalrank <-list (rank=emotionrank,comment=hlzj.comment)

> Commentemotionalrank <-as.data.frame (Commentemotionalrank)

> Fix (Commentemotionalrank)

Getemotionaltype <-Function (x,pwords,nwords) {    emotiontype <-numeric (0)    Xlen <-length (x)    emotiontype[1:xlen]<-0    Index <-1 while    (index <=xlen) {        Ylen <-length (X[[index]])        Index2 <-1 while        (index2<= ylen) {           if (length (Pwords[pwords==x[[index]][index2])) >= 1) {               Emotiontype[index] <-Emotiontype[index] + 1            }elseif (Length (Nwords[nwords==x[[index]][index2]]) >= 1) {               Emotiontype[index] <-Emotiontype[index]-1            }            index2<-Index2 + 1        }        #获取进度       if (index%%100==0) {        print (round (index/xlen,3)        }              Index <-index +1    }    Emotiontype}

See the results below, the first figure looks quite normal, and the second figure seems to be a comment from the HLZJ-sponsored RM when the clothes are torn. Not black their home, just want to find an example to illustrate the effect of poor evaluation, it seems not very ideal. Those rhetorical questions can not be identified and judged, there are some more colloquial "drunk", "too times" such words are not put into the emotional thesaurus, the emotional orientation of these comments is not very good recognition effect.

As I said before, the method needs to be improved, my method is only one of the most basic analysis of the realization of emotion, there are any questions welcome to correct.

Reprint please indicate the source, thank you!

R language do text mining Part5

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More