8 tactics to Combat imbalanced Classes on Your machine learning Dataset

Source: Internet
Author: User

8 tactics to Combat imbalanced Classes on Your machine learning Dataset

by Jason Brownlee on August learning Process

Have this happened?

You is working on your dataset. You create a classification model and get 90% accuracy immediately. "Fantastic" you think. You dive a little deeper and discover this 90% of the data belongs to one class. damn!

This is a example of an imbalanced dataset and the frustrating results it can cause.

In this post, you'll discover the tactics, the can use, deliver great results on machine learning datasets with IMB alanced data.

Find some balance in your machine learning.
Photo by MichaEli, some rights reserved.

Coming to Gripswith imbalanced Data

I get Emailsabout class imbalance all the time, for example:

I have a binaryclassification problem and one class was present with 60:1 ratio in my trainingset. I used the logistic regression and the result seems to just ignores oneclass.

And this:

I am working ONA classification model. In my dataset I has three different labels to beclassified, let them is A, B and C. But the training dataset I has adataset with 70% volume, B with 25% and C with 5%. Most of the time my results areoverfit to A. Can suggest how can I solve this problem?

I write longlists of techniques to try and think on the best ways to get past Thisproblem. I-finally took the advice of one of my students:

Perhaps one Ofyour upcoming blog posts could address the problem of training a model toperform against highly imbalanced D ATA, and outline some techniques andexpectations.

frustration!

Imbalanced Datacan cause you a lot of frustration.

You feel veryfrustrated if you discovered that your data have imbalanced classes and thatall of the great results you tho Ught were getting turn out to be a lie.

The next wave offrustration hits when the books, articles and blogs posts don ' t seem to give yougood advice about handling The imbalance in your data.

Relax, there aremany options and we ' re going to go through them all. It is possible, you canbuild predictive models for imbalanced data.

What isimbalanced Data?

Imbalanced datatypically refers to a problem with classification problems where the classesare is not represented equally.

For example, Youmay has a 2-class (binary) classification problem with + instances (rows). Atotal of instances is labeled with Class-1 and the remaining, instancesare labeled with Class-2.

This is animbalanced datasets and the ratio of Class-1 to Class-2 instances are 80:20 Ormore concisely 4:1.

You can have aclass imbalance problem on two-class classification problems as well Asmulti-class classification problems. Most techniques can is used on either.

The remainingdiscussions would assume a two-class classification problem because it's easierto think about and describe.

Imbalance Iscommon

Mostclassification data sets do not has exactly equal number of instances in Eachclass, but a small difference often does Not matter.

There areproblems where a class imbalance is not just common, it is expected. Forexample, in datasets like those that characterize fraudulent transactions areimbalanced. The vast majority of the transactions would be in the "Not-fraud" class and a very small minority would be in the "fraud" CLA Ss.

Another Exampleis customer churn datasets, where the vast majority of customers stay with Theservice (the "No-churn" class ) and a small minority cancel their subscription (the "Churn" class).

When there are amodest class imbalance like 4:1 in the example above it can cause problems.

Accuracy Paradox

The accuracy paradox is thename for the exact situation of the introduction to this post.

It's the Casewhere your accuracy measures tell the story so you had excellent accuracy (such as 90%), but the accuracy is only reflecting the underlying classdistribution.

It is Verycommon, because classification accuracy are often the first measure we use whenevaluating models in our classific ation problems.

Put It all onred!

What are going onin our models if we train on an imbalanced dataset?

As you mighthave guessed, the reason we get 90% accuracy on a imbalanced data (with 90% ofthe instances in Class-1) are be Cause our models look at the data and cleverlydecide, the best thing to do are to always predict "Class-1" and Achieveh IgH accuracy.

This was bestseen when using a simple rule based algorithm. If you print out the ' rule in thefinal model ' you'll see that it is very likely predicting one class regardlessof the data It's asked to predict.

8 Tactics Tocombat imbalanced Training Data

We Nowunderstand What class imbalance are and why it provides misleadingclassification accuracy.

What is Ouroptions?

1) canyou Collect more Data?

You might Thinkit's silly, but collecting more data are almost always overlooked.

Can you collectmore data? Take a second and think about whether is able to gather moredata on your problem.

A larger datasetmight expose a different and perhaps more balanced perspective on the classes.

More examples Ofminor classes is useful later when we look at Resampling your dataset.

2) Try Changingyour performance Metric

Accuracy is notthe metric to use when working with an imbalanced dataset. We have seen that itis misleading.

There aremetrics that has been designed to tell you a more truthful stories when workingwith imbalanced classes.

I give more advice on selectingdifferent performance measures in my post "classification accuracy are not ENOUGH:MOREPERFO Rmance Measures you Can use ".

In this post Ilook at a imbalanced dataset that characterizes the recurrence of breastcancer in patients.

From, Post,i recommend looking at the following performance measures, can give moreinsight into the accuracy of th E model than traditional classificationaccuracy:

    • Confusion Matrix: A breakdown of predictions into a table showing correctpredictions (the diagonal) and the types of incorrect predictions made (whatclasses incorrect predictions were assigned).
    • Precision: A measure of a classifiers exactness.
    • Recall: A measure of a classifiers completeness
    • F1 score (or F-score): A weighted average of precision and recall.

I would alsoadvice a look at the following:

    • Kappa (or Cohen ' s Kappa): Classification accuracy normalized by the imbalance ofthe classes in The data.
    • ROC Curves: Like precision and recall, accuracy are divided intosensitivity and specificity and models can be chos En based on the balancethresholds of these values.

You can learn a lot more about using the Roccurves to compare classification accuracy in our post "assessing and comparing Cla Ssifierperformance with ROC Curves ".

Still not sure? Start with Kappa, it'll give you a better idea of what's going on thanclassification accuracy.

3) tryresampling Your Dataset

You can changethe a dataset that is use to build your predictive model to has more balanceddata.

This change iscalled sampling your dataset and there is both main methods that's can use toeven-up the classes:

    1. You can add copiesof instances from the Under-represented class called over-sampling (or moreformally sampling with Replac ement), or
    2. You can deleteinstances from the over-represented class, called under-sampling.

These approachesare often very easy-to-implement and fast to run. They is an excellentstarting point.

In fact, I wouldadvise the always try both approaches on all of the Your imbalanced datasets,just to the if it gives you a Boost in your preferred accuracy measures.

You can learn a little more in the Thewikipedia article titled "Oversampling and Undersampling in Dataanalysis".

Some Rules Ofthumb

    • Consider testingunder-sampling when you had an a lot data (tens-or hundreds of thousands ofinstances or more)
    • Consider testingover-sampling when you don ' t has a lot of data (tens of thousands of recordsor less)
    • Consider Testingrandom and non-random (e.g. stratified) sampling schemes.
    • Consider testingdifferent resampled ratios (e.g. you don ' t has to target a 1:1 ratio in abinary classification problem, T ry other ratios)

4) Try generatesynthetic Samples

A simple-to-togenerate synthetic samples is-randomly sample the attributes from Instancesin the minority class.

You could samplethem empirically within your dataset or do could use a method like Naive bayesthat can sample each attrib Ute independently when run in reverse. You'll havemore and different data, but the non-linear relationships between theattributes is not a be preserved.

There aresystematic algorithms that's can use to generate synthetic samples. The mostpopular of such algorithms is called SMOTE or the synthetic minorityover-sampling technique.

As its namesuggests, SMOTE are an oversampling method. It works by creating syntheticsamples from the minor class instead of creating copies. The algorithm selectstwo or more similar instances (using a distance measure) and perturbing aninstance one attribute at a Time by a random amount within the difference tothe neighboring instances.

Learn more on SMOTE, see the original2002 paper titled "Smote:synthetic Minority over-sampling technique".

There is anumber of implementations of the SMOTE algorithm, for example:

    • In Python, take alook at the "Unbalanceddataset" module. It provides a number ofimplementations of SMOTE as well as various other resampling techniques thatyou could try.
    • In R, the DMWR package provides an implementation of SMOTE.
    • In Weka, you canuse the SMOTE supervised filter.

5) Try Differentalgorithms

As always, istrongly advice the your favorite algorithm on every problem. Youshould at least is spot-checking a variety of different types of algorithms ona given problem.

For more onspot-checking algorithms, see my post "Why should is spot-checkingalgorithms on your machine learning Probl EMS ".

That's being said,decision trees often perform well on imbalanced datasets. The splitting rulesthat look at the class variable used in the creation of the trees, can forceboth classes to be Addresse D.

If in doubt, Trya few popular decision tree algorithms like C4.5, C5.0, CART, and Randomforest.

For some example R code using Decisiontrees, see my post titled "Non-linear Classification in R with Decisiontrees".

For the example of using CART in Pythonand Scikit-learn, see my post titled "Get Your Hands Dirty with Scikit-learn now".

6) Try Penalizedmodels

You can use the samealgorithms but give them a different perspective on the problem.

Penalizedclassification imposes an additional cost on the model for makingclassification mistakes on the minority class du Ring training. These penaltiescan bias the model to pay more attention to the minority class.

Often thehandling of class penalties or weights is specialized to the learningalgorithm. There is penalized versions of algorithms such as PENALIZED-SVM Andpenalized-lda.

It is also possible to has genericframeworks for penalized models. For example, Weka have a costsensitiveclassifier that can wrap anyclassifier and apply a custom penalty matrix for Miss CLA Ssification.

Usingpenalization is desirable if you're locked into a specific algorithm and areunable to resample or you ' re getting poo R results. It provides yet another Wayto "balance" the classes. Setting up the penalty matrix can is complex. Youwill very likely has to try a variety of penalty schemes and see what the worksbest for your problem.

7) Try Adifferent perspective

There is fieldsof study dedicated to imbalanced datasets. They has their own algorithms,measures and terminology.

Taking a lookand thinking about your problem from these perspectives can sometimes shameloose some ideas.

The might to consider is anomaly detection and change detection.

Anomaly detection is the DETECTIONOF rare events. This might is a machine malfunction indicated through itsvibrations or a malicious activity by a program indicated by it ' s Sequence Ofsystem calls. The events is rare and when compared to normal operation.

This shift inthinking considers the minor class as the outliers class which might help youthink of new ways to separate an D classify samples.

Change detection was similarto anomaly detection except rather than looking for an anomaly it was lookingfor a change or dif Ference. This might is a behavior of a user asobserved by usage patterns or bank transactions.

Both of Theseshifts take a more real-time stance to the classification problem this mightgive you some new ways of Thinkin G about your problem and maybe some moretechniques to try.

8) Try Gettingcreative

Really climbinside your problem and think about-to-break it down into smaller problemsthat is more tractable.

For inspiration, take a look at the verycreative answers on Quora in response to the question "in classification, how does y OU handle anunbalanced training set? "

For example:

Decompose Yourlarger class into smaller number of other classes ...

... use a one classclassifier ... (e.g. treat like outlier detection)

... resampling theunbalanced training set to not one balanced set, but several. Running anensemble of classifiers on these sets could produce a much better result thanone classifier alone

These is just afew of some interesting and creative ideas you could try.

For more ideas, check out these Commentson the Reddit post "classification when 80% of my training set isof one class".

Pick a Methodand take Action

The Needto is Algorithm Wizard or a statistician to build accurate and reliablemodels from imbalanced datasets.

We have Covereda number of techniques so you can use the model an imbalanced dataset.

Hopefully thereare one or both so you can take off the shelf and apply immediately, forexample changing your accuracy met Ric and resampling your dataset. Both Arefast and would have an impact straight away.

Which method is going to try?

A Finalword, Start Small

Remember that wecannot know which approach are going to best serve you and the datasets you areworking on.

You can use Someexpert heuristics to pick this method or that, but in the end, the best advicei can give you are to "become The scientist "and empirically test each Methodand Select the one, gives you, the best results.

Start small Andbuild Upon what do you learn.

Want more? Further Reading ...

There areresources on class imbalance if you know where to look, but they is few andfar between.

I ' ve looked andthe following is what I think is the cream of the crop. If you ' d like to divedeeper into some of the academic literature on dealing with class Imbalance,check out some of the Lin KS below.

Books

    • Imbalanced learning:foundations, Algorithms, andapplications

Papers

    • Data Mining for imbalanced Datasets:an overview
    • Learning from imbalanced Data
    • Addressing the curse of imbalanced Training sets:one-sided Selection (PDF)
    • A Study ofthe Behavior of Several Methods for balancing machine learning Training Data

Did you findthis post useful? Still have questions?

Leave a Commentand let me know on your problem and any questions you still has abouthandling imbalanced classes.

(EXT) 8 tactics to Combat imbalanced Classes in Your machine learning Dataset

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.