1, sample imbalance is solved by oversampling and under sampling
1, oversampling: Oversampling is also called Upper sampling (over-sampling). This method achieves sample equalization by increasing the number of samples in the classification. The most straightforward approach is to simply copy a few samples to form multiple records. For example, the positive and negative ratio is 1:10, then we can copy the positive example 9 times to achieve positive and negative ratio 1:1. However, the disadvantage of this method is that if the sample features are small, it may lead to overfitting, and an improved oversampling method can generate new synthetic samples by adding random noise, interfering data, or by certain rules to a few classes, such as the smote algorithm.
2, under sampling: Under-sampling is also called the lower sampling (under-sampling), this method by reducing the number of samples in the classification of most of the sample size to achieve sample equalization, the most straightforward way is to randomly remove some of the majority of class samples to reduce the size of most classes, the disadvantage is that the majority of samples of some important information.
In summary, oversampling and under-sampling are better suited to uneven distribution of large data, especially for the first (oversampling) application in general.
2, the sample imbalance is solved by the penalty weights of the positive and negative samples.
3, the sample imbalance is solved by combining the integrated method.
4. Solve sample imbalance by feature selection
The following three methods do not specifically expand the record, the specific steps see the reference address
Reference Address: https://www.zhihu.com/question/56662976