First, Introduction second, influence third, other people's solution data level: The algorithm level: four, individual solution five, Reference
First, Introduction
Before doing emotional classification problems are using SST, and so on, some classical corpus, but when you want to do the corpus, only to find that things are not as simple as imagined. To carry out corpus cleaning, corpus segmentation (10 intersection), now also consider the question of the balance of the corpus.
Imbalance problem: The number of corpora between categories varies greatly
Take a look at my corpus:
A total of 6 categories, the number varies very much. Second, influence
The imbalance of corpus number between categories is a factor that restricts the accuracy of many classification algorithms. Many classifiers tend to classify corpora into large classes, thus resulting in lower accuracy of classification. But the problem of unbalanced classification is real and pervasive, and many times those few are worthy of our attention. For example, cyber attacks, credit card illegal transactions, etc. The illegal transaction of credit card belongs to the few categories, and the classification is less accurate, so it is difficult to find the illegal record.
Why the less accurate rate of the classification of corpus is low. Because the characteristics of the few corpus are not obvious, it is easy to mix with the noise corpus. And most of the classification methods are based on the characteristics of the classification. The few features are not obvious, so it is difficult to distinguish the corpus of the few classes. Iii. Solutions for others
General Practice: Data level:
The sample is copied directly, that is, the category samples with less sample number are kept. Interpolation method: Through the sample normalization, sampling, obtain sample distribution, extreme value, mean, and so on, according to the sample distribution, extreme value, mean to generate new samples to expand the number of samples.
Less-than-sampled direct deletion randomly reduces the number of most class samples. Algorithm level:
Weighted loss function, a common method of dealing with unbalanced data is to set the weight of loss function, so that the loss of a few kinds of discriminant errors is greater than that of most kinds of discriminant errors. In Python's sk-learn we can use the Class_weight parameter to set weights, and to increase the weight of a few classes, such as 10 times times the number of classes.
RBG and kaiming give a pretty good method, not to be introduced in detail here.
For more information see links: http://blog.csdn.net/u014380165/article/details/77019084
See a blog:
It's a very large brain hole, a special sampling.
First of all, this paper analyzes the category of fewer samples, through the analysis of syntactic dependence of text, analyzes the related attributes of the word, and then uses the method of synonym substitution to generate new text. method is simple and effective
https://blog.csdn.net/u014535908/article/details/79035653 Iv. Individual Solutions
Do not think well, want to try, the results come out and fill in five, Reference
https://blog.csdn.net/jerryfy007/article/details/72904257
Http://blog.sina.com.cn/s/blog_afa352bf0102vo57.html
https://blog.csdn.net/u014380165/article/details/77019084