Data imbalance in Machine Learning
Recently, I encountered a problem where the positive data is much less than the negative data. Such a dataset will make the learned model more biased towards negative prediction results during machine learning. I found some relevant documents and learned some methods and technologies to solve this problem.
First, what are the problems caused by unbalanced datasets. Generally, the learner has the following two assumptions: one is that the learner has the highest accuracy, and the other is that the learner should be used in the test set with the same distribution as the training set. If the data is not balanced, the learner must be more accurate than the prediction result in a larger proportion. For example, if the ratio of positive is 1%, and the ratio of negative is 99%, it is obvious that even if you do not study, directly predict that all results are negative, so the accuracy can reach 99%, however, if you build a learner, it is likely that you cannot reach 99%. This is a problem caused by unbalanced data proportions. Even if the accuracy of this model is higher, it will definitely not work well in actual application, and it is not the model we want.
I understand the problems caused by the imbalance of datasets in the learning process, and there are also many corresponding solutions. The following are two common methods.
1. Start with the dataset. Since the data is not balanced, we can manually balance the data set. The number of classes in the training set can be the same as that in the small class through the type with a large random sampling ratio, or the number of small classes can be repeated to the same as that in the large class. The problem with the former is that information may be lost because only some samples are used. The latter's problem may be caused by overfitting because there are duplicate samples. The problem of the former can be solved through the esemble method, that is, each time a training set is formed, all small samples are included, and samples are randomly selected from the large samples to form a training set, this repeat many times to get a lot of training sets and training models. During the test, the voting method is used to determine the classification result.
In addition to balancing datasets, you can also filter features in different categories and small categories, and then combine them to form a learner. This may also improve the effect.
2. Start with the learner. The simplest way is to change the proportion of the category samples by changing the cutoff of the judgment class. You can also learn only one category. In addition, considering the different costs of misjudgment of different samples during learning, the learner prefers to predict small classes. This can also improve the model.
There are many studies on how to deal with unbalanced datasets. For more information, see the references.
References
1 Sotiris kotsiantis, et al. Handing imbalanced Datasets: A review.2006.
2 foster provost. Machine Learning from imbalanced data sets.
Reference address: Http://blog.sciencenet.cn/blog-54276-377102.html