The influence of sample unbalance on SVM-data mining

Source: Internet
Author: User
Tags svm
The influence of sample unbalance on SVM

Assuming that the positive class sample is far more than the negative class

1, linear can be divided into the situation

Suppose the real dataset is as follows:

Because the negative sample size is too small, this situation may occur

It makes the separating hyperplane biased towards the negative class. Strictly, this sample imbalance is not due to the problem of sample size, but to the change of the boundary point.

2, the linear cannot be divided the situation

The source data and the ideal hyperplane are as follows:

It is likely that the negative class is too small to appear in the following cases, and the hyperplane tends to be negative.

Solutions to imbalances:

"SVM is not very sensitive to imbalance itself."

"The hyperplane of the SVM is only related to the support vector, so it is not important how much data is in the hyperplane of the decision."

1, over sampling (random sampling)

2, less sampling (for most of the sample boundary samples) (both representative of most of the sample distribution characteristics, but also to the classification interface has some effect on the sample characteristics of the sampling method)

3, improve the algorithm itself (cost-sensitive)

Reference Blog

1, the positive example and the negative example given different C value, for example, the positive example is far less than the negative example, the positive example of the C-value achieved greater, the disadvantage of this method may deviate from the original data probability distribution;

2, the training set of data preprocessing that a small number of samples in a certain strategy to sample, increase its number or reduce the number of samples, a typical method such as: Random insertion method, the disadvantage is that may occur

Overfitting, the better is: Synthetic minority over-sampling Technique (smote), its disadvantage is that only in the specific feature space, not suitable for dealing with those who can not

The problem of eigenvector representation, of course, increases the sample also means that the training time may increase;

3), the unbalanced data processing based on kernel function.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.