Data imbalance in Machine Learning

Source: Internet
Author: User

Data imbalance in Machine Learning

 

Recently, I encountered a problem where the positive data is much less than the negative data. Such a dataset will make the learned model more biased towards negative prediction results during machine learning. I found some relevant documents and learned some methods and technologies to solve this problem.

First, what are the problems caused by unbalanced datasets. Generally, the learner has the following two assumptions: one is that the learner has the highest accuracy, and the other is that the learner should be used in the test set with the same distribution as the training set. If the data is not balanced, the learner must be more accurate than the prediction result in a larger proportion. For example, if the ratio of positive is 1%, and the ratio of negative is 99%, it is obvious that even if you do not study, directly predict that all results are negative, so the accuracy can reach 99%, however, if you build a learner, it is likely that you cannot reach 99%. This is a problem caused by unbalanced data proportions. Even if the accuracy of this model is higher, it will definitely not work well in actual application, and it is not the model we want.

I understand the problems caused by the imbalance of datasets in the learning process, and there are also many corresponding solutions. The following are two common methods.
1. Start with the dataset. Since the data is not balanced, we can manually balance the data set. The number of classes in the training set can be the same as that in the small class through the type with a large random sampling ratio, or the number of small classes can be repeated to the same as that in the large class. The problem with the former is that information may be lost because only some samples are used. The latter's problem may be caused by overfitting because there are duplicate samples. The problem of the former can be solved through the esemble method, that is, each time a training set is formed, all small samples are included, and samples are randomly selected from the large samples to form a training set, this repeat many times to get a lot of training sets and training models. During the test, the voting method is used to determine the classification result.
In addition to balancing datasets, you can also filter features in different categories and small categories, and then combine them to form a learner. This may also improve the effect.

2. Start with the learner. The simplest way is to change the proportion of the category samples by changing the cutoff of the judgment class. You can also learn only one category. In addition, considering the different costs of misjudgment of different samples during learning, the learner prefers to predict small classes. This can also improve the model.

There are many studies on how to deal with unbalanced datasets. For more information, see the references.

References
1 Sotiris kotsiantis, et al. Handing imbalanced Datasets: A review.2006.
2 foster provost. Machine Learning from imbalanced data sets.


Reference address: Http://blog.sciencenet.cn/blog-54276-377102.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.