in machine learning, we often encounter unbalanced datasets. In cancer data sets, for example, the number of cancer samples may be far less than the number of non-cancer samples, and in the bank's credit data set, the number of customers on schedule may be much larger than the number of customers who defaulted. For example, a very well-known German credit data set, the positive and negative sample classification is not very balanced: If you do not do any processing simply to train, then in the training results (in the case of SVM), most of the good customers (about 97%) can be correctly identified as good customers, but most of the bad customers (about 95%) will be identified as good customers. at this time, if we only use accuracy to evaluate the model, then the bank may be able to withstand the huge losses caused by default. In the "Machine learning" model selection and Evaluation section of the NTU Zhou Zhihua, a more comprehensive approach to evaluating models using precision, Recall, F1 score (weighted average precision and Recall) is mentioned. This article will discuss how to solve the problem of classifying disequilibrium in machine learning:
- Oversampling over-sampling
- Down sampling under-sampling
- Combination of upper and lower sampling
- Integrated sampling Ensemble sampling
- Cost-sensitive Learning cost-sensitive Learning
Note: GitHub Open source project Github-scikit-learn-contrib/imbalanced-learn The implementation code for most of the algorithms in this answer is provided with detailed documentation and comments. Oversampling over-sampling Oversampling is an increase in the number of samples that are already small. At present, the more common methods include SMOTE, Adasyn, SVM smote,bsmote. The implementation of the smote and Adasyn algorithms can also be referenced in this GitHub project. I visualized the result: For example, in the blue triangle represents the majority of samples (may be set as a positive example), the green triangle represents the original few samples (may be set as a counter example), and the red dot is the use of the smote algorithm generated by the inverse example. Similarly, Adasyn can have similar effects. However, smote in some cases is not particularly good, nor very stable, which is also related to its own algorithmic thinking. We can compare the performance of smote and Adasyn under the following conditions: but Adasyn is not perfect--when the split two categories of samples can be clearly divided and the data points are spaced very large, Adasyn will appear nan. For example, Adasyn is likely to have problems in the following situations: SMOTE:smote:synthetic Minority over-sampling technique Adasyn:adasyn:adaptive Synthetic sampling approach for imbalanced learning Down sampling under-sampling The next sample is to reduce the number of such samples. Random down sampling is needless to say, the implementation is very simple. But its performance is not very good, so there are some new methods, more well-known are:
- Tomek Links
- One-sided selection: addressing the curse of imbalanced training sets:one-sided selection
- Neighboorhood Cleaning Rule: Improving identification of difficult small classes by Balancing class distribution
- Nearmiss: KNN approach to unbalanced data distributions:a case study involving information extraction
combination of upper and lower sampling As the name implies, reduce the number of samples from the original category to a relatively small number of samples , but also in the original sample of a few categories to increase. Integrated Sampling Ensemble sampling as we know, some of the methods under sampling may cause us to lose some of the more important data points, but xu-ying Liu, Jianxin Wu, and Zhi-hua Zhou's thesis exploratory undersampling for In Class-imbalance learning , easyensemble and Balancecascade methods are proposed to solve this problem to some extent. in the paper, the author mentions that some of Easyensemble's ideas are similar to balanced random forests, but easyensemble use samples to randomly train decision trees. cost-sensitive learning cost-sensitive Learning we all know that compared to a normal client misjudged as a bad loan customer, a bad loan client misjudged as a normal customer may bring greater losses to the bank, compared with non-cancer patients who misjudged cancer patients, the miscarriage of cancer patients as non-cancer patients may lead to treatment can not be timely and lead to more serious consequences. So there is the idea of cost-sensitive learning--to solve the problem of unbalanced sample classification. This section can refer to Charles X. Ling, Victor S. Sheng's thesis cost-sensitive learning and the Class imbalance Problem:Copyright NOTICE: This article is the original blog post, do not, offenders must investigate.