Hotel comment sentiment analysis system (I) -- A Summary of text disposition Analysis
Question:The author analyzes the text disposition of the reviews of the hotel and analyzes whether the reviews of the Hotel (including the general evaluation and the detailed evaluation, including the price, health, service, and environment) are negative or negative.
When learning the course "Search Engine", the teacher assigned a small project. I have never touched this field during my undergraduate course, so now I can only explore it step by step from a cainiao. I want to have a preliminary understanding of the search engine, text trend analysis, and network public opinion in the learning process.
Because it is in the exploratory stage, it is inevitable that there will be some incorrect expressions, wrong formulas, and wrong understandings in this article. Therefore, do not think that the ideas in this article are correct. If you find any problem, you are welcome to discuss it together.
This section describes a summary of text trend analysis.
I would like to thank the following papers for their help in theoretical knowledge:
A. Li Xiaojun, Dai Lin, and other text trend analysis summaries, Journal of Zhejiang University, 2011.07
B. Dan Dafu, Research and Implementation of text threat Classification Technology Based on Network comments, National University of Defense Technology, 2010.10
I. Definition and main tasks of text Orientation Analysis
Definition:Sentiment classification is the mining of users' views on a certain thing (such as a product) or comment text, therefore, obtaining this opinion or comment is a positive or negative opinion on this thing. Text emotions are generally divided into two types (front and back) or three types (Front, back and neutral ). Positive refers to positive (supportive, healthy) attitudes and positions in the subject; negative) it refers to the negative (opposed and unhealthy) attitude and position in the text; neutral category (neutral) refers to the neutral attitude and position in the text. According to the current research, there are many studies that consider the two types.
Main Tasks:(1) find words or phrases in the document that can reflect emotions;
(2) determine the tendency polarity and strength of the words or phrases;
(3) find out the relationship between the extracted words or phrases and the topic.
II,Differentiation between text Tendency Analysis and topic Mining
Opinion Mining Based on text tendency analysis. Compared with topic mining, it is necessary to make some intelligent understanding of the text-trend analysis, and extract the author's opinion, emotion, attitude and other information on this basis.
III,Main process of text Tendency Analysis
1) collect and organize raw materials. Generally, crawler tools are used to collect materials at regular intervals. For example, open-source Java crawler software includes heritrix and nutch;
2) text preprocessing. eliminate noise, tag filtering, and word segmentation of collected materials to provide better original analysis text for subsequent analysis. for example, htm1parser is a webpage analysis tool with good fault tolerance. The word segmentation software is ICTCLAS compiled by the computer Research Office of the Chinese Emy of sciences.
3) subjective text recognition: Use a pre-established corpus database or classifier to identify subjective and objective texts, remove some non-emotional texts, and improve accuracy.
4) Determination of subjective texts. The subjective texts are judged by simple statistical methods, machine learning or relevance analysis based on the corpus.
IV,Main Methods of text Tendency Analysis
4.1Semantic-based text tendency Research Method
The current method mainly uses appropriate word Extraction and tendency calculation to calculate the overall tendency of the text by performing simple statistics on the tendency value.
A. extract adjectives or phrases that reflect subjective colors from the analysis text, and then determine the tendency of the extracted adjectives or phrases one by one and assign them a tendency value, finally, sum up all the above tendency values to get the overall text tendency of the article.
B. Create a preference semantic model library, and sometimes a preference dictionary is attached. Then, the document to be evaluated is matched by reference to the semantic mode database, and the tendency value corresponding to all matching modes is accumulated to obtain the tendency of the entire document.
4.2Machine learning-based text tendency Research Method
The classification process of machine learning-based text preferences is roughly as follows: manually label the text preferences, extract the text Feature Representation, use it as a training set, and construct a classifier by means of machine learning, the text to be tested can be classified by classifier to obtain the text tendency category information. Common Feature Representation Methods include n-gram Feature Representation, evaluation phrase Feature Representation, and single word Feature Representation. Common feature extraction methods include Mi, information gain (IG), Chi statistic (Ch I), and document frequency (DF. Common classification methods include: Central vector classification, KNN classification, sensor classification, Bayesian classification, maximum embedding classification, and support vector machine classification.
4.3Research Method of Relevance-based text Tendency
The basic idea of similarity-based method is similar to K-Nearest Neighbor method, that is, using K labeled sample points to mark new samples by similarity between samples. The similarity-based method uses the number of common words and phrases between statements and the word similarity in the semantic dictionary to calculate the semantic similarity of statements.
(1) Summary of text Tendency Analysis