Experience in solving unbalanced data in classification

Source: Internet
Author: User
Problem:Studies show that in some applications, the ratio of may invalidate some classification methods, or even the ratio of may invalidate some classification methods.
(1) The information contained in a few categories is limited, making it difficult to determine the distribution of a few types of data. That is, it is difficult to find regular patterns within a few categories, resulting in a low recognition rate of a few categories.
(2) data fragmentation. Many classification algorithms adopt the divide and conquer method. The gradual division of sample space will lead to data fragmentation issues, so that we can only find the data rules in each independent sub-space, for a few classes, each sub-space contains a small amount of data information, and some cross-space data rules cannot be mined.
(3) Inappropriate inductive bias. Many Inductive Reasoning systems tend to classify samples into most classes when there are uncertainties.
------------------------------------------
Experience:
I want to classify a non-balanced data. Category 0 accounts for a large proportion. Category 1 accounts for a small proportion. What should I do at this time? I think you have encountered data imbalance problems when you do post classification.
You can only manually adjust the number of Class 1 in the training sample.
How much does it usually add?
About or, especially some samples near classification (typical samples)
Manually pick some?
Well, I used to do vest mining, but I also added some manual work. Later, the effect improved significantly.
------------------------------------------
Solution summary:
1. oversampling
The most common method for processing unbalanced data by sampling is to eliminate or reduce data imbalance by changing the distribution of training data.
The oversampling method increases the classification performance of a few classes by adding a few classes of samples. The simplest way is to simply copy a few classes of samples. The disadvantage is that it may lead to overfitting, no new information is added to a few classes. The improved oversampling method adds random Gaussian noise to a few categories or generates new synthetic samples.
Note: It seems useless for SVM

2. undersampling
The undersampling method improves the classification performance of a few classes by reducing the majority of samples. The simplest method is to randomly remove some majority samples to reduce the scale of most classes, the disadvantage is that some important information of most classes is lost and the existing information cannot be fully utilized.

3. Cost-sensitive methods are used at the algorithm level.
(1) Reconstruct the training set. Instead of changing existing algorithms, the original sample set is reconstructed by weight by assigning a weight value to each sample in the training set based on the different cost of the sample.
(2) introduce cost-Sensitive Factors to design cost-sensitive classification algorithms. Generally, small samples are given a higher price, while large samples are given a lower price, so as to balance the number differences between samples.

4. Feature Selection
When the number of samples is unevenly distributed, the feature distribution is also unbalanced. Especially in text classification, features that often appear in a category may not appear in rare categories at all. Therefore, selecting the most distinctive feature based on the characteristics of unbalanced classification helps to improve the recognition rate of rare classes.
Based on an empirical sample ratio, select plus or minus two sample sets, respectively select the feature set that best represents the samples, and then combine these feature sets as the final feature.

Refer:
Http://wenku.baidu.com/view/28946d32f111f18583d05a37.html
Http://wenku.baidu.com/view/83ab4beb6294dd88d0d26b7c

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.