The influence of sample unbalance on SVM
Assuming that the positive class sample is far more than the negative class
1, linear can be divided into the situation
Suppose the real dataset is as follows:
Because the negative sample size is too small, this situation may occur
It makes the separating hyperplane biased towards the negative class. Strictly, this sample imbalance is not due to the problem of sample size, but to the change of the boundary point.
2, the linear cannot be divided the situation
The source data and the ideal hyperplane are as follows:
It is likely that the negative class is too small to appear in the following cases, and the hyperplane tends to be negative.
Solutions to imbalances:
"SVM is not very sensitive to imbalance itself."
"The hyperplane of the SVM is only related to the support vector, so it is not important how much data is in the hyperplane of the decision."
1, over sampling (random sampling)
2, less sampling (for most of the sample boundary samples) (both representative of most of the sample distribution characteristics, but also to the classification interface has some effect on the sample characteristics of the sampling method)
3, improve the algorithm itself (cost-sensitive)
Reference Blog
1, the positive example and the negative example given different C value, for example, the positive example is far less than the negative example, the positive example of the C-value achieved greater, the disadvantage of this method may deviate from the original data probability distribution;
2, the training set of data preprocessing that a small number of samples in a certain strategy to sample, increase its number or reduce the number of samples, a typical method such as: Random insertion method, the disadvantage is that may occur
Overfitting, the better is: Synthetic minority over-sampling Technique (smote), its disadvantage is that only in the specific feature space, not suitable for dealing with those who can not
The problem of eigenvector representation, of course, increases the sample also means that the training time may increase;
3), the unbalanced data processing based on kernel function.