Goodreads:machine Learning (Part 3)

Last Update:2016-10-03 Source: Internet

Author: User

Tags xgboost

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the first installment of this series, we scraped reviews from Goodreads. In Thesecond one, we performed exploratory data analysis and created new variables. We is now ready for the ' main dish ': Machine learning!

Setup and General Data prep

Let's start by loading the libraries and our dataset.

Library (data.table) library (DPLYR) library (caret) library (rtexttools) library (xgboost) library (ROCR) SETWD ("C:/users /florent/desktop/data_analysis_applications/goodreads_textmining ") Data <-read.csv (" GoodReadsCleanData.csv ", Stringsasfactors = FALSE)

To recap, at this point, we had the following features in our dataset:

Review.idbookratingreviewreview.lengthmean.sentimentmedian.sentimentcount.afinn.positivecount.afinn.negativecount.bing.ne Gativecount.bing.positive

For this example, we'll simplify the analysis by collapsing the 1 to 5 stars rating into a binary variable:whether the Bo OK is rated a "good read" (4 or 5 stars) or not (1 to 3 stars). This would allow us to use classification algorithms, and to has less unbalanced categories.

Set.seed (1234) # Creating The outcome valuedata$good.read <-0data$good.read[data$rating = = 4 | data$rating = 5] <- 1

The "good reads", or positive reviews, represent about 85% of the dataset, and the ' bad reads ', or negative reviews, with good.read == 0, about 15%. We then create the train and test subsets. The dataset is still fairly unbalanced, so we don ' t just randomly assign data points to the train and test datasets; We make sure to preserve the percentage of good reads in each subset by using the caret function ' createdatapartition ' for Stratified sampling.

Trainidx <-createdatapartition (data$good.read,                                 p =.,                                 list = FALSE, times                                 = 1) train <-data[ Trainidx,]test <-Data[-trainidx,]

Creating the document-term matrices (DTM)

Our goal are to use the frequency of individual words in the reviews as features in our machine learning algorithms. In order to does, we need to start by counting the number of occurrence of each word in each review. Fortunately, there is tools to do just this, that would return a convenient "document-term Matrix", with the reviews in RO WS and the words in columns; Each entry in the matrix indicates the number of occurrences of this particular word in that particular review.

A Typical DTM would look like this:

About

Reviews		across	ADO	Adult
Review 1	0	2	1	0
Review 2	1	0	0	1

We don ' t want to catch every single word this appears in at least one review, because very rare words would increase the SI Ze of the DTM while has little predictive power. So we'll only have keep in our DTM words that appear on least a certain percentage of all reviews, say 1%. This was controlled by the ' sparsity ' parameter in the ' following code, with sparsity = 1-0.01 = 0.99 .

There is a challenge though. The premise of our analysis are that some words appear in negative reviews and not in positive reviews, and reversely (or a t least with a different frequency). But if we are keep words that appear in 1% of our overall training datasets, because negative reviews represent only 15% O F Our datasets, we are effectively requiring that a negative word appears in of the 1%/15% = 6.67% negative reviews; H a threshold and won ' t do.

The solution is to create a different DTM for we training dataset, one for positive reviews and one for negative review s, and then to merge them together. This by, the effective threshold for negative words are to appear in only 1% of the negative reviews.

# Creating A DTM for the negative reviewssparsity <-. 99bad.dtm <-Create_matrix (Train$review[train$good.read = = 0],                          Language = "中文版", removestopwords = FALSE, RemoveNumbers = TRUE, stemwords = FALSE, removesparseterms = sparsity) #Converting the DTM in a data framebad.dtm.df <-as.data.frame (As.matrix (BAD.DTM), Row.name s = train$review.id[train$good.read = 0]) # Creating A DTM for the positive Reviewsgood.dtm <-Create_matrix (Train$revi Ew[train$good.read = = 1], language = "中文版", removestopwords = FALSE , removenumbers = TRUE, Stemwords = FALSE, R emovesparseterms = sparsity) good.dtm.df <-data.table (As.matrix (GOOD.DTM), Row.names = Train $review. id[train$good.reAD = = 1]) # Joining the both DTM togethertrain.dtm.df <-bind_rows (bad.dtm.df, GOOD.DTM.DF) train.dtm.df$review.id <-C (Train$review.id[train$good.read = = 0], Train$review.id[train$good.read = = 1]) Train.dtm.df < -Arrange (TRAIN.DTM.DF, review.id) train.dtm.df$good.read <-train$good.read

We also want to use the analyses our aggregate variables (review length, mean and median sentiment, count of positive a nd negative words according to the other lexicons), so we join the DTM to the train dataset, by review ID. We also convert all NA values in our data frames to 0 (these NA has been generated where words were absent of reviews, so That's the correct of dealing with them here; But kids, Don's convert NA to 0 at home without thinking about it first).

TRAIN.DTM.DF <-train%>%  Select (-C (book, Rating, Review, good.read))%>%  Inner_join (TRAIN.DTM.DF, by = "Review.id")%>%  Select (-review.id) train.dtm.df[is.na (TRAIN.DTM.DF)] <-0

 # Creating The test Dtmtest.dtm <-Create_matrix (test$review, language = "中文版", Removestopwords = FALSE, removenumbers = TRUE, S Temwords = FALSE, removesparseterms = sparsity) test.dtm.df <-data.table (As.matrix (TEST.DTM)  ) test.dtm.df$review.id <-test$review.idtest.dtm.df$good.read <-test$good.readtest.dtm.df <-test%>% Select (-C (book, Rating, Review, good.read))%>% Inner_join (test.dtm.df, by = "Review.id")%>% Select (-review.id)

A Challenge here's to ensure, the test DTM has the same columns as the train dataset. Obviously, some words may appear in the test datasets while being absent of the train datasets, but there's nothing we can D O about them as our algorithms won ' t has anything to say about them. The trick we ' re going to use relies on the flexibility of the data.tables:when you join by rows, Data.tables with diff Erent columns, the resulting data.table automatically have all the columns of the initial data.tables, with the missing Values set as NA. So we is going to add a row of our training data.table to our test data.table and immediately remove it after the missing Columns would have been created; Then we'll keep only the columns which appear in the training datasets (i.e. discard all columns which appear only in the T EST dataset).

TEST.DTM.DF <-Head (bind_rows (TEST.DTM.DF, Train.dtm.df[1,]),-1) test.dtm.df <-test.dtm.df%>%   Select ( One_of (Colnames (TRAIN.DTM.DF)) test.dtm.df[is.na (TEST.DTM.DF)] <-0

With this, we had our training and test datasets and we can start crunching numbers!

Machine learning

We'll be using xgboost here, as it yields the best results (I tried Random forests and support vectors machines too, but th E resulting accuracy is too instable with these to be reliable).

We start by calculating our baseline accuracy, what would get by always predicting the most frequent category, and then we Calibrate our model.

BASELINE.ACC <-sum (test$good.read = = "1")/nrow (test) Xgb.train <-as.matrix (select (TRAIN.DTM.DF,-good.read), C0/>dimnames = Dimnames (train.dtm.df)) xgb.test <-As.matrix (select (TEST.DTM.DF,-good.read),                      dimnames= Dimnames (TEST.DTM.DF)) Xgb.model <-xgboost (data = xgb.train,                      label = Train.dtm.df$good.read,                     nrounds = 400,                      objective = "binary:logistic") Xgb.predict <-Predict (Xgb.model, xgb.test) xgb.results <-data.frame (good.read = test$good.read,                          pred = XGB.PREDICT)

The xgboost algorithm yields a probabilist prediction, so we need to determine a threshold over which we ' ll classify a Rev Iew as good. In order to do, we'll plot the ROC (Receiver Operating characteristic) curve for the true negative rate against the F Alse negative rate.

rocr.pred <-Prediction (xgb.results$pred, Xgb.results$good.read) rocr.perf <-performance (ROCR.pred, ' TNR ', ' FNR ') plot (rocr.perf, colorize = TRUE)

Things is looking pretty good. It seems that by using a threshold of about 0.8 (where the curve becomes red), we can correctly classify more than 50% of The negative reviews (the true negative rate) while misclassifying as negative reviews less than 10% of the positive Revie WS (The false negative rate).

Xgb.table <-Table (true = Xgb.results$good.read,                    pred = As.integer (xgb.results$pred >= 0.80)) XGB.TABLEXGB.ACC <-sum (diag (xgb.table))/nrow (test)

Our overall accuracy are 87%, so we beat the benchmark of all predicting that a review is positive (which would yield a 83.4% accuracy here, to is precise), while catching 61.5% of the negative reviews. Not bad for a ' black box ' algorithm, without any parameter optimization or feature engineering!

Directions for further analyses

If we wanted to go deeper in the analysis, a good starting point would is to look at the relative importance of features I n The xgboost algorithm:

# # Feature analysis with Xgboostnames <-colnames (test.dtm.df) Importance.matrix <-xgb.importance (names, model = X Gb.model) Xgb.plot.importance (importance.matrix[1:20,])

As we can see, there is a few words, such as "Colleen" or "You" that is unlikely to being useful in a more general setting, But overall, we find is the most predictive words is negative ones, which is to be expected. We also see the aggregate variables, and review.length count.bing.negative , made the top 10.

There is several ways we could improve on the analysis at this point, such as:

Using N-grams (i.e. sequences of words, such as "did not is like") in addition to single words, to better qualify negative TE RMS. "was very disappointed" would obviously has a different impact compared to "is not disappointed", even though on a Word-by-word basis they could not being distinguished.

Fine-tuning the parameters of the xgboost algorithm.

Looking at the negative reviews that has been misclassified, in order to determine what features to add to the analysis.

Conclusion

We have covered a lot of ground in the Series:from webscraping to sentiment analysis to predictive analytics with Machin E Learning. The main conclusion I would draw from this exercise are that we are now having at our disposal a large number of powerful tools T Hat can is used "off-the-shelf" to build fairly quickly a complete and meaningful analytical pipeline.

As for the first and the first of the installments, the complete R code for this part is available onmy GitHub.

Transferred from: https://www.r-bloggers.com/goodreads-machine-learning-part-3/

Goodreads:machine Learning (Part 3)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More