Affective classification-A solution to the imbalance of corpus classification

Source: Internet
Author: User

First, Introduction second, influence third, other people's solution data level: The algorithm level: four, individual solution five, Reference

First, Introduction

Before doing emotional classification problems are using SST, and so on, some classical corpus, but when you want to do the corpus, only to find that things are not as simple as imagined. To carry out corpus cleaning, corpus segmentation (10 intersection), now also consider the question of the balance of the corpus.

Imbalance problem: The number of corpora between categories varies greatly

Take a look at my corpus:

A total of 6 categories, the number varies very much. Second, influence

The imbalance of corpus number between categories is a factor that restricts the accuracy of many classification algorithms. Many classifiers tend to classify corpora into large classes, thus resulting in lower accuracy of classification. But the problem of unbalanced classification is real and pervasive, and many times those few are worthy of our attention. For example, cyber attacks, credit card illegal transactions, etc. The illegal transaction of credit card belongs to the few categories, and the classification is less accurate, so it is difficult to find the illegal record.
Why the less accurate rate of the classification of corpus is low. Because the characteristics of the few corpus are not obvious, it is easy to mix with the noise corpus. And most of the classification methods are based on the characteristics of the classification. The few features are not obvious, so it is difficult to distinguish the corpus of the few classes. Iii. Solutions for others

General Practice: Data level:

The sample is copied directly, that is, the category samples with less sample number are kept. Interpolation method: Through the sample normalization, sampling, obtain sample distribution, extreme value, mean, and so on, according to the sample distribution, extreme value, mean to generate new samples to expand the number of samples.

Less-than-sampled direct deletion randomly reduces the number of most class samples. Algorithm level:

Weighted loss function, a common method of dealing with unbalanced data is to set the weight of loss function, so that the loss of a few kinds of discriminant errors is greater than that of most kinds of discriminant errors. In Python's sk-learn we can use the Class_weight parameter to set weights, and to increase the weight of a few classes, such as 10 times times the number of classes.

RBG and kaiming give a pretty good method, not to be introduced in detail here.
For more information see links: http://blog.csdn.net/u014380165/article/details/77019084

See a blog:
It's a very large brain hole, a special sampling.

First of all, this paper analyzes the category of fewer samples, through the analysis of syntactic dependence of text, analyzes the related attributes of the word, and then uses the method of synonym substitution to generate new text. method is simple and effective

https://blog.csdn.net/u014535908/article/details/79035653 Iv. Individual Solutions

Do not think well, want to try, the results come out and fill in five, Reference

https://blog.csdn.net/jerryfy007/article/details/72904257
Http://blog.sina.com.cn/s/blog_afa352bf0102vo57.html
https://blog.csdn.net/u014380165/article/details/77019084

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.