movie text sentiment classification
GitHub Address
Kaggle Address
This task is mainly to the film review text emotional classification, mainly divided into positive comments and negative comments, so is a two classification problem, two classification model we can choose some common models such as Bayesian, logistic regression, one of the challenges here is the vectorization of textual content, therefore, We first try to TF-IDF based on the Vectorization method, and then try to Word2vec.
#-*-Coding:utf-8-*-
import pandas as PD
import numpy as NP
import re from
BS4 import beautifulsoup
def review_to_wordlist (review):
"
to turn the review of IMDB into a word sequence
reference: http://blog.csdn.net/longxinchen_ml/article/ details/50629613
'
# Remove HTML tags, get content
Review_text = BeautifulSoup (review, "Html.parser"). Get_text ()
# Take the regular expression out of the conforming section
Review_text = Re.sub ("[^a-za-z]", "", Review_text)
# lowercase all words and turn into words list
words = Review_text.lower (). Split ()
# returns words return
words
Load Data Set
# load data Set train = Pd.read_csv ('/users/frank/documents/workspace/kaggle/dataset/bag_of _WORDS_MEETS_BAGS_OF_POPCORN/LABELEDTRAINDATA.TSV ', header=0, delimiter= "\ T", quoting=3) test = pd.read_csv ('/Users/ FRANK/DOCUMENTS/WORKSPACE/KAGGLE/DATASET/BAG_OF_WORDS_MEETS_BAGS_OF_POPCORN/TESTDATA.TSV ', header=0, delimiter= \ t ", quoting=3) print train.head () print test.head ()
ID sentiment review
0 "5814_8" 1 "with all this stuff going in the moment ...
1 "2381_9" 1 "\" The Classic War of the Worlds\ "by Timothy
... 2 "7759_3" 0 "The film starts with a manager (Nicholas Bell ...
3 "3630_4" 0 "It must is assumed that those who praised Thi ...
4 "9495_8" 1 "superbly trashy and wondrously unpretentious ...
ID Review
0 "12311_10" "naturally in a film who's main themes
are of ... 1 "8348_2" "This movie is a disaster within a disaster fi ...
2 "5828_4" "All", the This is a movie for kids. We saw ...
3 "7186_2" "afraid of the Dark left me with the Impressio ...
4 "12128_7" "A very accurate depiction of small time mob l ...
preprocessing Data
# preprocessing data label = train[' sentiment '] train_data = [] for i in range (len (train[' review ')): Train_data.append (". Join" (Review_to_wordlist (train[' review '][i]))) Test_data = [] for i in range (len (test[' review '])) : Test_data.append ('. Join (Review_to_wordlist (test[' review '][i])) # Preview Data Print train_data[0], ' \ n ' Print test_data[ 0]
With all this stuff going under the moment with MJ I ve started listening to he music watching the odd documentary here And there watched the Wiz and watched Moonwalker again maybe I just want to get a certain insight into this guy I tho Ught was really cool on the eighties just to maybe make up my mind whether he's guilty or innocent Moonwalker is part bio Graphy part feature film which I remember going to see at the cinema when it is originally released some of it has subtle Messages about MJ s feeling towards the press and also the obvious message of drugs is bad m Kay visually impressive Of course this was all on Michael Jackson so unless you remotely like MJ in anyway then you were going to hate this and Find it boring some may call MJ a egotist for consenting to the making of this movie but MJ and most of his fans would SA Y that he made it for the fans which if true was really nice's him the actual feature film bit when it finally starts was O Nly on for minutes or so Excluding the smooth criminal sequence and Joe Pesci are convincing as a psychopathic all powerful drug lord why he wants MJ dead So bad are beyond me because MJ overheard his plans nah Joe Pesci s character ranted the He wanted people to know It's he who's supplying drugs etc so I dunno maybe he just hates MJ s music lots of cool things in this like MJ turning into a car and a robot and the whole speed demon sequence also the director must has had the patience of a saint when it Came to filming the kiddy bad sequence as usually directors hate working with one kid let alone a whole bunch of them perf Orming a complex dance scene bottom line This movie was for people who like MJ on one level or another which I think are MOS T people if not then stay away it does try and give off a wholesome message and ironically MJ S bestest Buddy in this movi E is a girl Michael Jackson was truly one of the most talented people ever to grace this planet it is he guilty well with All the attention i ve gAve This subject hmmm well I don t know because people can being different behind closed doors I know this for a fact he's E Ither an extremely nice but stupid guy or one of the most sickest liars I hope he's not the latter naturally in a film w Ho S main themes is of mortality nostalgia and loss of innocence it's perhaps not surprising it's rated more HIGHL Y by older viewers than younger ones however there are a craftsmanship and completeness to the film which anyone can enjoy The pace is steady and constant the characters full and engaging the relationships and interactions natural showing that Y ou do not need floods of tears to show emotion screams to show fear shouting to show dispute or violence to show anger NAT Urally Joyce S short stories lends the film a ready made structure as perfect as a polished diamond but the small changes Hu Ston makes such as the inclusion of the poem fit in neatly it's truly a masterpiece of tact subtlety and overwhelming Bea Uty
feature processing
Directly to the computer these word text, the computer is not calculated, so we need to convert the text to vectors, there are several common text vector processing methods, such as: Word Count
TF-IDF Vector
Word2vec Vector
Let's try it first with TF-IDF.
From Sklearn.feature_extraction.text import Tfidfvectorizer as TFIDF
# Reference: http://blog.csdn.net/longxinchen_ml/ article/details/50629613
TFIDF = TFIDF (min_df=