Classic opinion mining algorithms (Text Mining Series)

Source: Internet
Author: User
I recently read an article about View MiningOf KDDThe mining algorithms of mining and summarizing customer reviews (kdd04) are classic and are hereby recorded. The problem to be solved in this paper is, Identify users' comments(Positive or negative. The following is an example of a digital camera: Digital Camera: feature: photo quality positive: 253 <user comments sentence> negative: 8 <user comments sentence> feature: positive: 135 <user comments> negative: 12 ......
Algorithm process 1. main steps: <1> using users' comments to mine item features <2> recognize the emotions of Sentences containing ideas <3> inductive resultsCompared with the previous algorithm, this algorithm highlights the following ( Feature-basedTo determine the emotion of a sentence. The emotional analysis used by the algorithm is based on the sentence level ( Sentence-level), Rather than document-level ).
2. Mining ProcessFirst, easy to understand:
(1) crawl reviews:The first step is crawlers who capture and store user comments from the website.
(2) POS tagging:Use nlprocessor or Stanford Parser (recommended) to tag each comment of a user by POS, The POS tag will mark each word as a part of speech, Like this: i/FW recently/Rb purchased/vbd the/dt canon/JJ powershot/nn G3/nn and/CC AM/VBP extremely/Rb satisfied/VBN with/in the/dt purchase /nn. /. (the original sentence is: I recently purchased the Canon powershot G3 and am extremely satisfied with the purchase .)
(3) frequent feature identification:To mine the features of an item, you must find the feature words in the sentence. Feature Words are all nounsIn the example of step 2, Nn stands for nouns.. First, extract all nouns or phrases labeled by NN, and then find out frequent feature sets ( Frequent feature refers to the feature that accounts for a proportion of the total number of nouns greater than the minimum support.).
(4) feature pruning: PruningIs one of the most important steps to greatly improve the effectiveness of algorithms. This algorithm has been pruned twice: A. compactness pruningFrequent feature sets include feature phrases. Because feature phrase extraction is not performed manually, it directly forms two or more adjacent nouns into phrases.. Therefore, you need to determine and determine whether to pruning. The judgment rule is: if two or more nouns in a phrase appear in the same sentence, the distance between them is calculated in order. If the distance between every two nouns is no more than three words, this feature phrase is considered to be compact (influential ). If no more than two sentences have an influential feature phrase, the feature phrase should be removed, that is, pruning. I personally think that this pruning is not very useful. Because the phrases are screened for frequent feature sets, the proportion of all nouns is relatively large. In the case of a large number of user comments, generally, more than two sentences contain a feature phrase. Therefore, this pruning has no effect on frequent feature sets.
B. Redundancy pruningConsider one situation: Camera's feature phrase picture quality is a very meaningful feature, but does it make sense to separate the two words into a separate picture and quality? Apparently not. Does picture indicate that the camera has a high pixel or that the camera has a good color? If this parameter is not specified, picture does not indicate the camera quality. Similarly, does quality indicate that the pictures are good or the camera battery quality is good? It still does not indicate what is good or bad for the camera. Therefore, remove the meaningless features contained in a feature phrase. The pruning rule is that for feature a, if a feature phrase also contains Calculate the number of times a appears separately in a sentence (feature phrases containing a cannot appear in this sentence). If the number of times is less than a threshold value (set to 3 in the paper), the feature phrase is removed. The pruning effect is quite obvious.
(5) Opinion word extraction:In Contains frequent featuresIn the sentence, Extract adjectives closest to features as opinion words. In the POs tag, JJ indicates adjectives.
(6) Opinion Orientation identification:This step is one of the biggest highlights of the algorithm. It recognizes the positive or negative semantic of opinion words, that is, it greatly recognizes users' emotions for a certain feature of an item. The identification method is as follows: Preset 30 adjectives with obvious polarity, stored in the seedlistAs a set of viewpoint words with known polarity. For example, positive: Great, fantastic, nice, cool, negative: Bad, disappointing, dull, and so on. First, for opinion word o, find Synonym. If a synonym exists in the seedlist, the part of speech of the synonym is known, Set o as the same part of speech; then add O to the seedlist (expand the seedlist). If the synonym does not exist in the seedlist, search for Antonym. If the antonym exists in the seedlist, set o as the opposite part of speech and add it to the seedlist. If no opinion word is found, search for the next opinion word. Continues to iterate the search process until the size of the seedlist is no longer expanded.. If the word cannot be found, the word is considered invalid.
(7) infrequent feature identification:In consideration of one situation, some items have special characteristics that some users pay special attention to. For example, some users are very concerned about the user experience of the camera software (software is one of the characteristics of the camera. Therefore, you should identify these features to meet your actual needs. Because the same opinion word can modify multiple features, for example, amazing (excellent) can modify the photo quality of the camera or the software of the camera. Therefore, a simple and effective way to identify non-frequent feature sets is, In a sentence containing opinion words, find the term closest to the opinion word, as a non-frequent feature set.
(8) Opinion sentence orientation identification:The last important step is to determine the emotional direction of a sentence (note that only sentences Containing opinion words are judged ). The judgment rule is: A. Extract All adjectivesTo determine the part of speech, add 1 to the adjectives that contain a positive part of speech, and 1 to the adjectives that contain a negative part of speech. If the final value of a sentence is greater than 0, the emotion of the sentence is positive; if the value is less than 0, it is negative. If the value is equal to 0, see the next step. B. For a sentence whose value is just equal to 0, determine the sentence An adjective that belongs to a set of opinion wordsInstead of all adjectives. Because opinion words can better reflect the user's emotional direction. Like the previous step, add 1 or subtract 1. If the value is greater than 0, positive; if it is less than 0, negative. So what if the value is equal to 0? C. If it is equal to or equal to 0 The emotional direction of the current sentence is the same as that of the previous sentence.. Because users prefer to praise or criticize only items in the same paragraph (including multiple sentences.
(9) Summary generation:The final step is to calculate the number of positive or negative sentences that each feature belongs. Then, according to the form at the beginning of this article, get the summary.

Classic opinion mining algorithms (Text Mining Series)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.