The original is: "The Impact of imbalanced Training Data for convolutional neural Networks"
This blog is the paper's reading notes, there is inevitably a lot of details of the wrong place.
Also hope that you crossing can forgive, welcome criticism correct.
More related blog please poke: http://blog.csdn.net/cyh_24
If you want to reprint, please attach this article link: http://blog.csdn.net/cyh_24/article/details/49871387
Abstract
This paper mainly studies the effect of using unbalanced data to train CNN on image classification. The data set used in this paper is CIFAR-10, and the author uses this database to manually generate different amounts of data for different types of distributions. For example, make one category of images occupy a large proportion, while the other is a small proportion. using the different training sets of these builds, train a CNN and test for the correct rate.
The results show that unbalanced training rallies have a significant negative impact on the results, while the training set can achieve the best performance in a balanced situation.
Furthermore, the paper concludes thatoversampling is a good and effective way to solve the problem of unbalanced training sets.
Experimental process DataSet
The data set used is CIFAR-10, which has 10 classes, 6000 per class, and a total of 6w images.
The CIFAR-10 is segmented, using 5000 of them as training and 1000 as the test image.
Generate different data distributions
Explain:
- Dist.1 is balanced data, each class accounted for 10% of the weight;
- Dist.2 showed that Airplane,automobile,bird and cat accounted for 8%, while the other categories accounted for 12% ... This should be able to read it.
So now there are 11 training sets, which are then trained using the same CNN, or tested using the original test data.
Oversampling
The Oversampling method used in this article is very simple:
For each category, some images are randomly selected for reproduction until the number of pictures is equal to the picture that accounts for the largest proportion.
Resultsdistribution performace
Oversampling Performance
The above is the oversampling after the training of CNN performance, you can see that almost every class has a promotion, but dist.1 (balanced training data) is the highest.
Total Performance
Average of the following per dist accuracy rate, the following table shows the accuracy of the comparison chart, dark color is the accuracy of imbalanced, light is oversampling after the accuracy rate.
The goal of the article is very clear, the idea is very simple, and no other trick, I also talked about this.
To summarize, the article tells the matter and the conclusion:
- The distribution of training data has a great impact on CNN results.
- Obviously, the balanced training set is optimal, the more unbalanced the data, the worse the accuracy rate;
- The use of oversampling can improve the accuracy rate;
Copyright NOTICE: If you want to reprint, please attach this article link, not very grateful! Author's homepage: http://blog.csdn.net/cyh_24
Paper note "The Impact of imbalanced Training Data for CNN"