R language: smote-supersampling Rare Events in R: How to treat unbalanced data with R

Last Update:2015-01-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

smote-supersampling Rare events in R: Super-sampling rare events with R

In this example, the following three packages will be used
{DMWR}-Functions and data for the book ' Data Mining with R ' and SMOTE algorithm:smote algorithm
{Caret}-modeling wrapper, functions, commands: Model encapsulation, functions, commands

{PROC}-area under the Curve (AUC) functions: Under-curve (ACU) function

The smote algorithm is designed to solve the problem of imbalance classification. That is, it can produce a new "smoted" of data that solves the class imbalance problem set. Alternatively, it can also run the classification algorithm in this new dataset and return the resulting model.

We use thyroid disease data to conduct research.
Let's clean up some data
# load data, delete colons and periods, and append column names
Hyper <-read.csv (' Http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.data ') , header=f)
Names <-read.csv (' Http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.names ', header=f, sep= ' \ t ') [[1]]
Names <-gsub (pattern = ": |[.]", replacement= "", x = names)
Colnames (Hyper) <-names
# We changed the column name of the first column from hypothyroid, negative to target, and changed the negative to 0, and the other value to 1.
Colnames (Hyper) [1]<-"Target"
Colnames (Hyper)
# # [1] "target" "age"
# # [3] "sex" "On_thyroxine"
# # [5] "Query_on_thyroxine" "On_antithyroid_medication"
# # [7] "Thyroid_surgery" "Query_hypothyroid"
# # [9] "query_hyperthyroid" "Pregnant"
# # [One] "sick" "tumor"
# # [] "Lithium" "goitre"
# # [[]] "tsh_measured" "TSH"
# # [+] "t3_measured" "T3"
# # [+] "tt4_measured" "TT4"
# # [+] "t4u_measured" "t4u"
# [+] "fti_measured" "FTI"
# [+] "tbg_measured" "TBG"
Hyper$target<-ifelse (hyper$target== "negative", 0, 1)
# Check for positive and negative results
Table (Hyper$target)
##
# 0 1
# 3012 151
Prop.table (table (Hyper$target))
##
# 0 1
# 0.95226 0.04774
# visible, 1 is only 5%. This is obviously a distorted data set and is also a rare event.
Head (hyper,2)
# # target Age sex on_thyroxine query_on_thyroxine on_antithyroid_medication
# 1 1 M F f F
# # 2 1 f t f F
# # Thyroid_surgery query_hypothyroid query_hyperthyroid Pregnant sick tumor
# 1 F F f f f f
# 2 F F f f f f
# # Lithium goitre tsh_measured TSH t3_measured T3 tt4_measured TT4
# # 1 F f y x y 0.60 y 15
# # 2 F F y 145 y 1.70 y 19
# # t4u_measured t4u fti_measured FTI tbg_measured TBG
# 1 y 1.48 y n?
# 2 y 1.13 y n?
# This data is a factor variable (a character value) that needs to be converted to a two-valued number to facilitate modeling:
Ind<-sapply (Hyper,is.factor)
Hyper[ind]<-lapply (Hyper[ind],as.character)

hyper[hyper== "?"] =na
hyper[hyper== "F"]=0
hyper[hyper== "T"]=1
hyper[hyper== "n"]=0
hyper[hyper== "Y"]=1
hyper[hyper== "M"]=0
hyper[hyper== "F"]=1

Hyper[ind]<-lapply (Hyper[ind],as.numeric)

Replacenawithmean<-function (x) {replace (x,is.na (x), mean (X[!is.na (x)))}

Hyper<-replacenawithmean (Hyper)

Model Research
We use the createdatapartition (data partitioning function) function in the caret package to randomly divide the data into the same two copies.

Library (caret)
# # Loading Required Package:lattice
# # Loading Required Package:ggplot2
Set.seed (1234)
Splitindex<-createdatapartition (Hyper$target,time=1,p=0.5,list=false)
Trainsplit<-hyper[splitindex,]
Testsplit<-hyper[-splitindex,]

Prop.table (table (Trainsplit$target))
##
# 0 1
# 0.95006 0.04994
Prop.table (table (Testsplit$target))
##
# 0 1
# 0.95446 0.04554
The results of the two classifications are balanced, so there are still about 5% representatives, and we are still at a good level.

We use the Treebag model algorithm in the caret package to model the training set data and predict the test set data.

Ctrl<-traincontrol (method= "CV", number=5)
Tbmodel<-train (target~.,data=trainsplit,method= "Treebag",
Trcontrol=ctrl)
# # Loading Required package:ipred
# # Loading Required Package:plyr
Predictors<-names (Trainsplit) [Names (trainsplit)! = ' target ']
Pred<-predict (Tbmodel$finalmodel,testsplit[,predictors])
To evaluate the model, we use the ROC function of the proc package to calculate the AUC score and draw
Library (PROC)
# # Type ' citation ("PROC") ' for a citation.
##
# # Attaching package: ' PROC '
##
# # The following objects are masked from ' package:stats ':
##
# # CoV, smooth, Var
Auc<-roc (testsplit$target,pred)
Print (AUC)
##
# # Call:
# # Roc.default (response = testsplit$target, predictor = pred)
##
# # Data:pred in 1509 controls (Testsplit$target 0) < cases (Testsplit$target 1).
# # Area under the curve:0.985
Plot (Auc,ylim=c (0,1), print.thres=true,main=paste (' AUC ', round (auc$auc[[1]],2)))
##
# # Call:
# # Roc.default (response = testsplit$target, predictor = pred)
##
# # Data:pred in 1509 controls (Testsplit$target 0) < cases (Testsplit$target 1).
# # Area under the curve:0.985
Abline (h=1,col= "Blue", lwd=2)
Abline (h=0,col= "Red", lwd=2)

The AUC score is 0.98, which is a very good result (because it ranges from 0.5 to 1).

It's hard to imagine smote can improve on this, but then we'll use smote to model the data and see how the AUC results

In R, the smote algorithm is part of the DMWR software package, and the main parameters are as follows three: Perc.over: The number of samples generated in a few classes when oversampling; K: the K value when using the K-nearest neighbor algorithm to generate a few classes of samples in oversampling, the default is 5;perc.under: When Under-sampled, For each of the few sample samples generated, select the number of samples from the original data majority class. For example, perc.over=500 represents that for each of the few samples in the original dataset, 5 new few samples will be generated, and perc.under=80 indicates that a sample of most of the classes selected from the original dataset is 80% of the few samples in the new data set.

Library (DMWR)
# # Loading Required Package:grid
##
# # Attaching package: ' DMWR '
##
# # The following objects are masked from ' package:plyr ':
##
# # Join
Trainsplit$target<-as.factor (Trainsplit$target)
Trainsplit<-smote (target~.,trainsplit,perc.over=100,perc.under=200)
Trainsplit$target<-as.numeric (Trainsplit$target)
# We again use the prop.table () function to check the balance of the results, to make sure we have the negative, positive data to the same.
Prop.table (table (Trainsplit$target))
##
# 1 2
# 0.5 0.5
# Build Treebag Model again
Tbmodel<-train (target~.,data=trainsplit,method= "Treebag",
Trcontrol=ctrl)
Predictors<-names (Trainsplit) [Names (trainsplit)! = ' target ']
Pred<-predict (Tbmodel$finalmodel,testsplit[,predictors])
Auc<-roc (testsplit$target,pred)
Print (AUC)
##
# # Call:
# # Roc.default (response = testsplit$target, predictor = pred)
##
# # Data:pred in 1509 controls (Testsplit$target 0) < cases (Testsplit$target 1).
# # Area under the curve:0.99
Wow, up to 0.99, than the previous 0.985 has improved
Plot (Auc,ylim=c (0,1), print.thres=true,main=paste (' AUC ', round (auc$auc[[1]],2)))
##
# # Call:
# # Roc.default (response = testsplit$target, predictor = pred)
##
# # Data:pred in 1509 controls (Testsplit$target 0) < cases (Testsplit$target 1).
# # Area under the curve:0.99
Abline (h=1,col= "Blue", lwd=2)

Abline (h=0,col= "Red", lwd=2)

R language: smote-supersampling Rare Events in R: How to treat unbalanced data with R

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

R language: smote-supersampling Rare Events in R: How to treat unbalanced data with R

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

R language: smote-supersampling Rare Events in R: How to treat unbalanced data with R

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support