Accuracy is not the only measurement for evaluating the effectiveness of classifier. The other two useful indicators are precision and recall. These two measurements provide more perspectives on performance features of binary classifiers.
The precisionprecision of a classifier measures the correctness of a classifier. Higher accuracy means less false positives, while lower accuracy means more false positives. This is often the opposite of recall, as a simple method to improve accuracy and reduce recall.
The recall of a classifier measures the integrity or sensitivity of the classifier. A higher recall means less false negatives, while a lower recall means more false negatives. Increasing the recall rate can often reduce the accuracy, because increasing the sample space makes precision more and more difficult to achieve.
F-measure metric accuracy and recall can be combined to produce a single value called F value, which is the weighted harmonic mean of accuracy and recall rate. I found that the F value is about as useful as accuracy. Or in other words, the F value is mostly useless relative to precision and recall, as you will see below.
The nltk measurement module provides functions to calculate the accuracy and recall rates of Naive Bayes classifier. To do this, you need to create two sets of classification labels: a reference set with the correct value and a test set as the observed value. The following is a modified version of the naive Bayes classifier code we have trained in previous articles. This time, instead of measurement accuracy, we will collect reference values and observations for each label (positive or negative), and then use these sets to calculate the accuracy of the naive Bayes classifier, precision, recall, and f values. The actual collected values are indexes that are used to enumerate each feature set.
import collectionsimport nltk.metricsfrom nltk.classify import NaiveBayesClassifierfrom nltk.corpus import movie_reviews def word_feats(words): return dict([(word, True) for word in words]) negids = movie_reviews.fileids('neg')posids = movie_reviews.fileids('pos') negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids] negcutoff = len(negfeats)*3/4poscutoff = len(posfeats)*3/4 trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats)) classifier = NaiveBayesClassifier.train(trainfeats)refsets = collections.defaultdict(set)testsets = collections.defaultdict(set) for i, (feats, label) in enumerate(testfeats): refsets[label].add(i) observed = classifier.classify(feats) testsets[observed].add(i) print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos'])print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])
Accuracy and recall rate of positive and negative comments
I found the results quite interesting:
pos precision: 0.651595744681pos recall: 0.98pos F-measure: 0.782747603834neg precision: 0.959677419355neg recall: 0.476neg F-measure: 0.636363636364
So what does this mean?
- Almost every POS file is correctly identified with a recall rate of 98%. This means that there are very few false negatives in the POs class.
- However, only 65% of the files for a given POS classification may be correct. Poor accuracy will result in 35% false positive for the POs class.
- Being identified as negative any file is 96% may be correct (High Precision ). This means that the negative type rarely reports false positives.
- However, many negative files are incorrectly classified. The reason for the low recall is false negative for the negative number of 52%.
- The F value does not provide any useful information. Having it doesn't bring insights. Without it, we will lose all our knowledge. (F-measure provides no useful information. There's no insight to be gained from having it, and we wouldn't lose any knowledge if it was taken away .)
A possible explanation for the above results by using better feature selection is that people usually use positive comments in negative comments, but there is "no" (or some other negative words) in front of the word, such as "not very large ". In addition, because the classifier uses the bag-of-words model, it assumes that each word is independent, and it cannot know that "not very large" is a negative one. If so, if we need to train on multiple words, these indicators should be improved and I will discuss the next topic in future articles.
Another possibility is the rich natural neutral words, which are not emotional. However, the classifier treats all words as the same and must be either positive or negative to each word. Therefore, some neutral or meaningless words may be placed in the POs class because the classification does not know what else to do. If this is the case, if we eliminate neutral or meaningless words in the feature set and only use words with rich emotions for classification, then the indicators should be improved. This usually uses information gain, also known as mutual information, to improve feature selection, which will be discussed in future articles.
If you have your own theory to explain the results, or think about how to improve accuracy and recall, please share it in your comments.
Original article: http://streamhacker.com/2010/05/17/text-classification-sentiment-analysis-precision-recall/
Sentiment analysis of text classification-accuracy and recall rate