Paper note "The Impact of imbalanced Training Data for CNN"

Last Update:2015-11-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The original is: "The Impact of imbalanced Training Data for convolutional neural Networks"

This blog is the paper's reading notes, there is inevitably a lot of details of the wrong place.

Also hope that you crossing can forgive, welcome criticism correct.

More related blog please poke: http://blog.csdn.net/cyh_24

If you want to reprint, please attach this article link: http://blog.csdn.net/cyh_24/article/details/49871387

Abstract

This paper mainly studies the effect of using unbalanced data to train CNN on image classification. The data set used in this paper is CIFAR-10, and the author uses this database to manually generate different amounts of data for different types of distributions. For example, make one category of images occupy a large proportion, while the other is a small proportion. using the different training sets of these builds, train a CNN and test for the correct rate.

The results show that unbalanced training rallies have a significant negative impact on the results, while the training set can achieve the best performance in a balanced situation.

Furthermore, the paper concludes thatoversampling is a good and effective way to solve the problem of unbalanced training sets.

Experimental process DataSet

The data set used is CIFAR-10, which has 10 classes, 6000 per class, and a total of 6w images.

The CIFAR-10 is segmented, using 5000 of them as training and 1000 as the test image.

Generate different data distributions

Explain:

Dist.1 is balanced data, each class accounted for 10% of the weight;
Dist.2 showed that Airplane,automobile,bird and cat accounted for 8%, while the other categories accounted for 12% ... This should be able to read it.

So now there are 11 training sets, which are then trained using the same CNN, or tested using the original test data.

Oversampling

The Oversampling method used in this article is very simple:

For each category, some images are randomly selected for reproduction until the number of pictures is equal to the picture that accounts for the largest proportion.

Resultsdistribution performace

Oversampling Performance

The above is the oversampling after the training of CNN performance, you can see that almost every class has a promotion, but dist.1 (balanced training data) is the highest.

Total Performance

Average of the following per dist accuracy rate, the following table shows the accuracy of the comparison chart, dark color is the accuracy of imbalanced, light is oversampling after the accuracy rate.

The goal of the article is very clear, the idea is very simple, and no other trick, I also talked about this.

To summarize, the article tells the matter and the conclusion:

The distribution of training data has a great impact on CNN results.
Obviously, the balanced training set is optimal, the more unbalanced the data, the worse the accuracy rate;
The use of oversampling can improve the accuracy rate;

Copyright NOTICE: If you want to reprint, please attach this article link, not very grateful! Author's homepage: http://blog.csdn.net/cyh_24

Paper note "The Impact of imbalanced Training Data for CNN"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Paper note "The Impact of imbalanced Training Data for CNN"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Paper note "The Impact of imbalanced Training Data for CNN"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support