R language: smote-supersampling Rare Events in R: How to treat unbalanced data with R

Source: Internet
Author: User

smote-supersampling Rare events in R: Super-sampling rare events with R

In this example, the following three packages will be used
{DMWR}-Functions and data for the book ' Data Mining with R ' and SMOTE algorithm:smote algorithm
{Caret}-modeling wrapper, functions, commands: Model encapsulation, functions, commands

{PROC}-area under the Curve (AUC) functions: Under-curve (ACU) function


The smote algorithm is designed to solve the problem of imbalance classification. That is, it can produce a new "smoted" of data that solves the class imbalance problem set. Alternatively, it can also run the classification algorithm in this new dataset and return the resulting model.

We use thyroid disease data to conduct research.
Let's clean up some data
# load data, delete colons and periods, and append column names
Hyper <-read.csv (' Http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.data ') , header=f)
Names <-read.csv (' Http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.names ', header=f, sep= ' \ t ') [[1]]
Names <-gsub (pattern = ": |[.]", replacement= "", x = names)
Colnames (Hyper) <-names
# We changed the column name of the first column from hypothyroid, negative to target, and changed the negative to 0, and the other value to 1.
Colnames (Hyper) [1]<-"Target"
Colnames (Hyper)
# # [1] "target" "age"
# # [3] "sex" "On_thyroxine"
# # [5] "Query_on_thyroxine" "On_antithyroid_medication"
# # [7] "Thyroid_surgery" "Query_hypothyroid"
# # [9] "query_hyperthyroid" "Pregnant"
# # [One] "sick" "tumor"
# # [] "Lithium" "goitre"
# # [[]] "tsh_measured" "TSH"
# # [+] "t3_measured" "T3"
# # [+] "tt4_measured" "TT4"
# # [+] "t4u_measured" "t4u"
# [+] "fti_measured" "FTI"
# [+] "tbg_measured" "TBG"
Hyper$target<-ifelse (hyper$target== "negative", 0, 1)
# Check for positive and negative results
Table (Hyper$target)
##
# 0 1
# 3012 151
Prop.table (table (Hyper$target))
##
# 0 1
# 0.95226 0.04774
# visible, 1 is only 5%. This is obviously a distorted data set and is also a rare event.
Head (hyper,2)
# # target Age sex on_thyroxine query_on_thyroxine on_antithyroid_medication
# 1 1 M F f F
# # 2 1 f t f F
# # Thyroid_surgery query_hypothyroid query_hyperthyroid Pregnant sick tumor
# 1 F F f f f f
# 2 F F f f f f
# # Lithium goitre tsh_measured TSH t3_measured T3 tt4_measured TT4
# # 1 F f y x y 0.60 y 15
# # 2 F F y 145 y 1.70 y 19
# # t4u_measured t4u fti_measured FTI tbg_measured TBG
# 1 y 1.48 y n?
# 2 y 1.13 y n?
# This data is a factor variable (a character value) that needs to be converted to a two-valued number to facilitate modeling:
Ind<-sapply (Hyper,is.factor)
Hyper[ind]<-lapply (Hyper[ind],as.character)

hyper[hyper== "?"] =na
hyper[hyper== "F"]=0
hyper[hyper== "T"]=1
hyper[hyper== "n"]=0
hyper[hyper== "Y"]=1
hyper[hyper== "M"]=0
hyper[hyper== "F"]=1

Hyper[ind]<-lapply (Hyper[ind],as.numeric)

Replacenawithmean<-function (x) {replace (x,is.na (x), mean (X[!is.na (x)))}

Hyper<-replacenawithmean (Hyper)


Model Research
We use the createdatapartition (data partitioning function) function in the caret package to randomly divide the data into the same two copies.

Library (caret)
# # Loading Required Package:lattice
# # Loading Required Package:ggplot2
Set.seed (1234)
Splitindex<-createdatapartition (Hyper$target,time=1,p=0.5,list=false)
Trainsplit<-hyper[splitindex,]
Testsplit<-hyper[-splitindex,]

Prop.table (table (Trainsplit$target))
##
# 0 1
# 0.95006 0.04994
Prop.table (table (Testsplit$target))
##
# 0 1
# 0.95446 0.04554
The results of the two classifications are balanced, so there are still about 5% representatives, and we are still at a good level.

We use the Treebag model algorithm in the caret package to model the training set data and predict the test set data.

Ctrl<-traincontrol (method= "CV", number=5)
Tbmodel<-train (target~.,data=trainsplit,method= "Treebag",
Trcontrol=ctrl)
# # Loading Required package:ipred
# # Loading Required Package:plyr
Predictors<-names (Trainsplit) [Names (trainsplit)! = ' target ']
Pred<-predict (Tbmodel$finalmodel,testsplit[,predictors])
To evaluate the model, we use the ROC function of the proc package to calculate the AUC score and draw
Library (PROC)
# # Type ' citation ("PROC") ' for a citation.
##
# # Attaching package: ' PROC '
##
# # The following objects are masked from ' package:stats ':
##
# # CoV, smooth, Var
Auc<-roc (testsplit$target,pred)
Print (AUC)
##
# # Call:
# # Roc.default (response = testsplit$target, predictor = pred)
##
# # Data:pred in 1509 controls (Testsplit$target 0) < cases (Testsplit$target 1).
# # Area under the curve:0.985
Plot (Auc,ylim=c (0,1), print.thres=true,main=paste (' AUC ', round (auc$auc[[1]],2)))
##
# # Call:
# # Roc.default (response = testsplit$target, predictor = pred)
##
# # Data:pred in 1509 controls (Testsplit$target 0) < cases (Testsplit$target 1).
# # Area under the curve:0.985
Abline (h=1,col= "Blue", lwd=2)
Abline (h=0,col= "Red", lwd=2)




The AUC score is 0.98, which is a very good result (because it ranges from 0.5 to 1).

It's hard to imagine smote can improve on this, but then we'll use smote to model the data and see how the AUC results

In R, the smote algorithm is part of the DMWR software package, and the main parameters are as follows three: Perc.over: The number of samples generated in a few classes when oversampling; K: the K value when using the K-nearest neighbor algorithm to generate a few classes of samples in oversampling, the default is 5;perc.under: When Under-sampled, For each of the few sample samples generated, select the number of samples from the original data majority class. For example, perc.over=500 represents that for each of the few samples in the original dataset, 5 new few samples will be generated, and perc.under=80 indicates that a sample of most of the classes selected from the original dataset is 80% of the few samples in the new data set.

Library (DMWR)
# # Loading Required Package:grid
##
# # Attaching package: ' DMWR '
##
# # The following objects are masked from ' package:plyr ':
##
# # Join
Trainsplit$target<-as.factor (Trainsplit$target)
Trainsplit<-smote (target~.,trainsplit,perc.over=100,perc.under=200)
Trainsplit$target<-as.numeric (Trainsplit$target)
# We again use the prop.table () function to check the balance of the results, to make sure we have the negative, positive data to the same.
Prop.table (table (Trainsplit$target))
##
# 1 2
# 0.5 0.5
# Build Treebag Model again
Tbmodel<-train (target~.,data=trainsplit,method= "Treebag",
Trcontrol=ctrl)
Predictors<-names (Trainsplit) [Names (trainsplit)! = ' target ']
Pred<-predict (Tbmodel$finalmodel,testsplit[,predictors])
Auc<-roc (testsplit$target,pred)
Print (AUC)
##
# # Call:
# # Roc.default (response = testsplit$target, predictor = pred)
##
# # Data:pred in 1509 controls (Testsplit$target 0) < cases (Testsplit$target 1).
# # Area under the curve:0.99
Wow, up to 0.99, than the previous 0.985 has improved
Plot (Auc,ylim=c (0,1), print.thres=true,main=paste (' AUC ', round (auc$auc[[1]],2)))
##
# # Call:
# # Roc.default (response = testsplit$target, predictor = pred)
##
# # Data:pred in 1509 controls (Testsplit$target 0) < cases (Testsplit$target 1).
# # Area under the curve:0.99
Abline (h=1,col= "Blue", lwd=2)

Abline (h=0,col= "Red", lwd=2)


R language: smote-supersampling Rare Events in R: How to treat unbalanced data with R

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.