How does "data processing" deal with unbalanced datasets in machine learning?

Last Update:2018-08-22 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

in machine learning, we often encounter unbalanced datasets. In cancer data sets, for example, the number of cancer samples may be far less than the number of non-cancer samples, and in the bank's credit data set, the number of customers on schedule may be much larger than the number of customers who defaulted. For example, a very well-known German credit data set, the positive and negative sample classification is not very balanced: If you do not do any processing simply to train, then in the training results (in the case of SVM), most of the good customers (about 97%) can be correctly identified as good customers, but most of the bad customers (about 95%) will be identified as good customers. at this time, if we only use accuracy to evaluate the model, then the bank may be able to withstand the huge losses caused by default. In the "Machine learning" model selection and Evaluation section of the NTU Zhou Zhihua, a more comprehensive approach to evaluating models using precision, Recall, F1 score (weighted average precision and Recall) is mentioned. This article will discuss how to solve the problem of classifying disequilibrium in machine learning:

Oversampling over-sampling
Down sampling under-sampling
Combination of upper and lower sampling
Integrated sampling Ensemble sampling
Cost-sensitive Learning cost-sensitive Learning

Note: GitHub Open source project Github-scikit-learn-contrib/imbalanced-learn The implementation code for most of the algorithms in this answer is provided with detailed documentation and comments. Oversampling over-sampling Oversampling is an increase in the number of samples that are already small. At present, the more common methods include SMOTE, Adasyn, SVM smote,bsmote. The implementation of the smote and Adasyn algorithms can also be referenced in this GitHub project. I visualized the result: For example, in the blue triangle represents the majority of samples (may be set as a positive example), the green triangle represents the original few samples (may be set as a counter example), and the red dot is the use of the smote algorithm generated by the inverse example. Similarly, Adasyn can have similar effects. However, smote in some cases is not particularly good, nor very stable, which is also related to its own algorithmic thinking. We can compare the performance of smote and Adasyn under the following conditions: but Adasyn is not perfect--when the split two categories of samples can be clearly divided and the data points are spaced very large, Adasyn will appear nan. For example, Adasyn is likely to have problems in the following situations: SMOTE:smote:synthetic Minority over-sampling technique Adasyn:adasyn:adaptive Synthetic sampling approach for imbalanced learning Down sampling under-sampling The next sample is to reduce the number of such samples. Random down sampling is needless to say, the implementation is very simple. But its performance is not very good, so there are some new methods, more well-known are:

Tomek Links
One-sided selection: addressing the curse of imbalanced training sets:one-sided selection
Neighboorhood Cleaning Rule: Improving identification of difficult small classes by Balancing class distribution
Nearmiss: KNN approach to unbalanced data distributions:a case study involving information extraction

combination of upper and lower sampling As the name implies, reduce the number of samples from the original category to a relatively small number of samples , but also in the original sample of a few categories to increase. Integrated Sampling Ensemble sampling as we know, some of the methods under sampling may cause us to lose some of the more important data points, but xu-ying Liu, Jianxin Wu, and Zhi-hua Zhou's thesis exploratory undersampling for In Class-imbalance learning , easyensemble and Balancecascade methods are proposed to solve this problem to some extent. in the paper, the author mentions that some of Easyensemble's ideas are similar to balanced random forests, but easyensemble use samples to randomly train decision trees. cost-sensitive learning cost-sensitive Learning we all know that compared to a normal client misjudged as a bad loan customer, a bad loan client misjudged as a normal customer may bring greater losses to the bank, compared with non-cancer patients who misjudged cancer patients, the miscarriage of cancer patients as non-cancer patients may lead to treatment can not be timely and lead to more serious consequences. So there is the idea of cost-sensitive learning--to solve the problem of unbalanced sample classification. This section can refer to Charles X. Ling, Victor S. Sheng's thesis cost-sensitive learning and the Class imbalance Problem:Copyright NOTICE: This article is the original blog post, do not, offenders must investigate.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More