Emotional Analysis of text classification-features with low Information volume removed

Source: Internet
Author: User
Tags nltk

When your classification model has hundreds or thousands of features, because of text classification, many (if not the majority) features low information, this is a good choice. These features are common to all classes, so they make a small contribution in the classification process. Some are harmless, but in summary, features with low information volumes will reduce performance.

Eliminate the noise data to give your model clarity, so that you can remove the low-volume features. It can save you from the disaster of overfitting and dimension. When you only use higher information features, you can improve performance and reduce the size of the model, resulting in faster training and classification, using less memory. Deleting a feature seems to be an intuitive mistake, but please wait until you see the result.

Selection of features with high information volume

Using the same evaluate_classifier method to use binary classification in previous articles, I used 10000 of the most informative words to get the following results:

evaluating best word featuresaccuracy: 0.93pos precision: 0.890909090909pos recall: 0.98neg precision: 0.977777777778neg recall: 0.88Most Informative Features             magnificent = True              pos : neg    =     15.0 : 1.0             outstanding = True              pos : neg    =     13.6 : 1.0               insulting = True              neg : pos    =     13.0 : 1.0              vulnerable = True              pos : neg    =     12.3 : 1.0               ludicrous = True              neg : pos    =     11.8 : 1.0                  avoids = True              pos : neg    =     11.7 : 1.0             uninvolving = True              neg : pos    =     11.7 : 1.0              astounding = True              pos : neg    =     10.3 : 1.0             fascination = True              pos : neg    =     10.3 : 1.0                 idiotic = True              neg : pos    =      9.8 : 1.0
Compared to sentiment classification in the first article using all words as features:

evaluating single word featuresaccuracy: 0.728pos precision: 0.651595744681pos recall: 0.98neg precision: 0.959677419355neg recall: 0.476Most Informative Features         magnificent = True              pos : neg    =     15.0 : 1.0         outstanding = True              pos : neg    =     13.6 : 1.0           insulting = True              neg : pos    =     13.0 : 1.0          vulnerable = True              pos : neg    =     12.3 : 1.0           ludicrous = True              neg : pos    =     11.8 : 1.0              avoids = True              pos : neg    =     11.7 : 1.0         uninvolving = True              neg : pos    =     11.7 : 1.0          astounding = True              pos : neg    =     10.3 : 1.0         fascination = True              pos : neg    =     10.3 : 1.0             idiotic = True              neg : pos    =      9.8 : 1.0
With only the best 10000 words, accuracy has exceeded 20% and Pos precision has increased by nearly 24%, while negative recall has increased by more than 40%. These are a huge increase, with no reduction, and even a slight increase in POS recall and neg accuracy. The complete code and explanation for these results are as follows.

import collections, itertoolsimport nltk.classify.util, nltk.metricsfrom nltk.classify import NaiveBayesClassifierfrom nltk.corpus import movie_reviews, stopwordsfrom nltk.collocations import BigramCollocationFinderfrom nltk.metrics import BigramAssocMeasuresfrom nltk.probability import FreqDist, ConditionalFreqDist def evaluate_classifier(featx):    negids = movie_reviews.fileids('neg')    posids = movie_reviews.fileids('pos')     negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]    posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]     negcutoff = len(negfeats)*3/4    poscutoff = len(posfeats)*3/4     trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]    testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]     classifier = NaiveBayesClassifier.train(trainfeats)    refsets = collections.defaultdict(set)    testsets = collections.defaultdict(set)     for i, (feats, label) in enumerate(testfeats):            refsets[label].add(i)            observed = classifier.classify(feats)            testsets[observed].add(i)     print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)    print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])    print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])    print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])    print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])    classifier.show_most_informative_features() def word_feats(words):    return dict([(word, True) for word in words]) print 'evaluating single word features'evaluate_classifier(word_feats) word_fd = FreqDist()label_word_fd = ConditionalFreqDist() for word in movie_reviews.words(categories=['pos']):    word_fd.inc(word.lower())    label_word_fd['pos'].inc(word.lower()) for word in movie_reviews.words(categories=['neg']):    word_fd.inc(word.lower())    label_word_fd['neg'].inc(word.lower()) # n_ii = label_word_fd[label][word]# n_ix = word_fd[word]# n_xi = label_word_fd[label].N()# n_xx = label_word_fd.N() pos_word_count = label_word_fd['pos'].N()neg_word_count = label_word_fd['neg'].N()total_word_count = pos_word_count + neg_word_count word_scores = {} for word, freq in word_fd.iteritems():    pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word],        (freq, pos_word_count), total_word_count)    neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word],        (freq, neg_word_count), total_word_count)    word_scores[word] = pos_score + neg_score best = sorted(word_scores.iteritems(), key=lambda (w,s): s, reverse=True)[:10000]bestwords = set([w for w, s in best]) def best_word_feats(words):    return dict([(word, True) for word in words if word in bestwords]) print 'evaluating best word features'evaluate_classifier(best_word_feats) def best_bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):    bigram_finder = BigramCollocationFinder.from_words(words)    bigrams = bigram_finder.nbest(score_fn, n)    d = dict([(bigram, True) for bigram in bigrams])    d.update(best_word_feats(words))    return d print 'evaluating best words + bigram chi_sq word features'evaluate_classifier(best_bigram_word_feats)

Computing information gain

To find the most informative feature, we need to calculate the information gain for each word. The information gain of classification is a measure of a common feature in a particular class and other classes. A word mainly appears in positive movie comments, and rarely appears in negative comments as a high amount of information. For example, in movie reviews, the existence of "Magnificent" is an important indicator, indicating that it is positive. This makes "Magnificent" a word with a high amount of information. Note that the features with the largest Information volume have not changed. This makes sense, because this view only uses the most informative features and ignores others.

One is that the best indicator of information gain is Chi-square. Nltk includes it in the bigramassocmeasures class of the measurement standard data packet. To use it, we need to calculate the frequency of each word: its overall frequency and its frequency in each category. Freqdist is used to represent the overall frequency of a word. The condition of conditionalfreqdist is a category tag. Once we have these numbers, we can use the bigramassocmeasures. chi_sq function to calculate the score for the vocabulary, sort the scores, put them in a set, and take the first 10000. Then, we put these words in a set and use a group member qualification test in our feature selection function to select only those words that appear in the set. Now, every file is classified based on these highly informative words.

The above Code also evaluated a combination of 200 notable binary phrases. The result is as follows:

evaluating best words + bigram chi_sq word featuresaccuracy: 0.92pos precision: 0.913385826772pos recall: 0.928neg precision: 0.926829268293neg recall: 0.912Most Informative Features             magnificent = True              pos : neg    =     15.0 : 1.0             outstanding = True              pos : neg    =     13.6 : 1.0               insulting = True              neg : pos    =     13.0 : 1.0              vulnerable = True              pos : neg    =     12.3 : 1.0       (‘matt‘, ‘damon‘) = True              pos : neg    =     12.3 : 1.0          (‘give‘, ‘us‘) = True              neg : pos    =     12.3 : 1.0               ludicrous = True              neg : pos    =     11.8 : 1.0             uninvolving = True              neg : pos    =     11.7 : 1.0                  avoids = True              pos : neg    =     11.7 : 1.0    (‘absolutely‘, ‘no‘) = True              neg : pos    =     10.6 : 1.0
This indicates that the dual group is not important when only words with a high amount of information are used. In this case, the best way to evaluate whether there is a binary group or no difference is to view accuracy and recall. With binary groups, you get more even performance for each class. If there is no Binary Group, the accuracy and recall rate are not balanced. However, the difference may depend on your specific data, so do not assume that these observations are always correct.

The biggest lesson to improve feature selection here is that improving feature selection will improve your classifier. Dimensionality reduction is one of the best things you can do to improve classifier performance. It doesn't matter if data doesn't increase value. It is particularly recommended that sometimes data actually makes your model worse.


Original article: http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/

Emotional Analysis of text classification-features with low Information volume removed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.