Classify the sentiment of sentences from the Rotten Tomatoes dataset
Title Link: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews
More and more like Ipython notebook. All of the following work can be done on one page, and Firefox support is better than chrome.
Datasets are divided into TRAIN.TSV and TEST.TSV. fields are delimited by \ t, and each row has four fields: Phraseid,sentenceid,phrase,sentiment.
Emotional identity:
0-negative
1-somewhat Negative
2-neutral
3-somewhat Positive
4-positive
Import Pandas as Pddf = Pd.read_csv (' train.tsv ', header=0,delimiter= ' \ t ') df.info () <class ' Pandas.core.frame.DataFrame ' >int64index:156060 entries, 0 to 156059Data columns (total 4 columns):P Hraseid 156060 non-null int64sentenceid 156060 non-null int64phrase 156060 non-null objectsentiment 156060 Non-null Int64dtypes:int64 (3), object (1)
<textarea tabindex="0" style="position:absolute; padding-top:0px; padding-left:0px; width:1px; height:1em; outline:none medium"></textarea>
DF. Head ()
OUT[6]:
| |
phraseid |
sentenceid |
Phrase |
sentiment |
| 0 |
1 |
1 |
A series of escapades demonstrating the adage ... |
1 |
| 1 |
2 |
1 |
A series of escapades demonstrating the adage ... |
2 |
| 2 |
3 |
1 |
A Series |
2 |
| 3 |
4 |
1 |
A |
2 |
| 4 |
5 |
1 |
Series |
2 |
in []:d F. Sentiment.value_counts ()/df. Sentiment.count () out[13]:2 0.5099453 0.2109891 0.1747604 0.0589900 0.045316dtype:float64
Test the classification accuracy directly with the first 5 lines of the training set:
X_train = df[' Phrase ']y_train = df[' sentiment ']import numpy as Npfrom sklearn.feature_extraction.text import Tfidftransformerfrom sklearn.pipeline Import pipelinefrom sklearn.linear_model Import LOGISTICREGRESSIONTEXT_CLF = Pipeline (' Vect ', Countvectorizer ()), (' tfidf ', Tfidftransformer ()), (' CLF ', logisticregression () ), ]) TEXT_CLF = Text_clf.fit (x_train,y_train) x_test = Df.head () [' Phrase ']predicted = text_clf.predict (x_test) print Np.mean (predicted = = Df.head () [' sentiment ']) for phrase, sentiment in zip (X_test, predicted): print ('%r =%s '% (p Hrase, sentiment))
Classification accuracy and results:
0.8 ' A series of escapades demonstrating the adage that's good for the goose are also good for the gander, some of WH Ich occasionally amuses but none of the which amounts to much of a story. ' + 3 ' a series of escapades demonstrating the ad Age that's good for the goose ' + 2 ' a series ' = + 2 ' A ' + 2 ' series ' = 2
Df.head () [' sentiment ']0 2
The first classification error.
Test Data set:
TEST_DF = pd.read_csv (' test.tsv ', header=0,delimiter= ' \ t ') test_df.info () <class ' Pandas.core.frame.DataFrame ' >int64index:66292 entries, 0 to 66291Data columns (total 3 columns):P Hraseid 66292 non-null Int64sentenceid 6 6292 non-null int64phrase 66292 non-null objectdtypes:int64 (2), object (1)
Use a well-trained model to classify test data sets:
From numpy Import savetxtx_test = test_df[' Phrase ']phraseids = test_df[' Phraseid ']predicted = text_clf.predict (x_test) pred = [[index+156061,x] for index,x in enumerate (predicted)]savetxt ('.. /submissions/lr_benchmark.csv ', pred,delimiter= ', ', fmt= '%d,%d ', header= ' phraseid,sentiment ', comments= ')
Submit Result:
Reference: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
--sentiment Analysis of Kaggle contest questions on Movie reviews