Original: http://blog.csdn.net/hero_fantao/article/details/35784773
Unbalanced Learning methods
The problem of sample imbalance in machine learning is broadly divided into two aspects:
(1) The sample ratios in the category are unbalanced, but the samples in several categories are sufficient;
(2) There are few categories of samples.
The second question, in fact, is not our focus, because the sample is not enough, the coverage space is very small, if the characteristics of enough, this data on the model learning value is not very much, so, for this problem, a good way is to find as many small sample to cover the sample space.
The first question is now mainly discussed.
One: Sampling method
1. Stochastic resampling (random oversampling):
When the sample is unbalanced, random resampling is done on the small sample to achieve a balance. This method is only a simple copy of the small sample, the disadvantage is easy to over-fit, such as in the decision tree classification, it is likely that a terminal leaf node sample is a copy of a sample, the lack of extensibility, which may improve the accuracy of model training, but the prediction of unknown test samples may be very poor.
2. Random Under-sampling (oversampling):
When the sample is unbalanced, the large class samples are randomly under-sampled, which is to take some large samples to achieve the balance. The problem with the under-sampling is that the sample reduction may be missing important data in the sample space, reducing accuracy.
3. Synthetic sampling with Data Generation
Approximate data sample generation for small class samples. A new sample is generated based on the distance of the K nearest sample to the current sample by calculating the KNN of the small class sample.
This method breaks through the original simple repeated sampling method, and enriches the sample space of small sample by creating a new small sample, which makes up the problem of insufficient space of small sample sample. The disadvantage is that it calculates the same KNN for all small sample types. Imagine a small sample that has a distinct degree of differentiation from a large class of samples, which is of little value to produce extra samples.
4. Adaptive Synthetic Sampling
Adaptive synthetic sampling is a correction method that attempts to increase the sample sampling in small samples that are similar to those in larger classes.
Here's how:
Two-Cost learning method
One is from a sample point of view, try to achieve sample balance, and then to use the model of learning. There is also the cost of setting different samples for miscalculation, such as the cost of setting a small sample for miscarriage. Personal feeling, this method is almost as good as a medium resampling, sacrificing one for another. Personally think a good way is that when the positive and negative samples are unbalanced, select a portion of the large class samples and all the small samples each time, try to balance and train a model. Repeat the above operations, training to get a number of models, these models to make a voting, to obtain the final prediction results, you can emulate AdaBoost, each model is weighted. In fact, the method of voting can achieve very little effect.
Reference documents:
[1] He H, Garcia E A. Learning from imbalanced Data[j]. Knowledge and Data Engineering, IEEE transactions on, 2009, 21 (9): 1263-1284.
[2] Https://github.com/fmfn/UnbalancedDataset (a module shared by the 2014/12/07 @phunter_lau)
Attach Adaptive Synthetic sampling source code:
[Python]View Plaincopy
- "' "
- Created on 2014/03/09
- @author: Dylan
- ‘‘‘
- From sklearn.neighbors import nearestneighbors
- Import NumPy as NP
- Import Random
- def get_class_count (y, Minorityclasslabel = 1):
- Minorityclasslabel_count = Len (np.where (y = = minorityclasslabel) [0])
- Maxclasslabel_count = Len (np.where (y = = (1-minorityclasslabel)) [0])
- return Maxclasslabel_count, Minorityclasslabel_count
- # @param: X the datapoints e.g.: [F1, F2, ..., FN]
- # @param: Y the Classlabels e.g: [0,1,1,1,0,..., Cn]
- # @param ms:the amount of samples in the minority group
- # @param ml:the amount of samples in the majority group
- # @return: The G value, which indicates how many samples should is generated in total, this can is tuned with beta
- def GETG (ml, MS, beta):
- return (ml-ms) *beta
- # @param: X the datapoints e.g.: [F1, F2, ..., FN]
- # @param: Y the Classlabels e.g: [0,1,1,1,0,..., Cn]
- # @param: Minorityclass:the Minority class
- # @param: K: The amount of neighbours for Knn
- # @return: Rlist:list of R values
- def getris (x,y,indicesminority,minorityclasslabel,k):
- Ymin = Np.array (y) [indicesminority]
- Xmin = Np.array (X) [indicesminority]
- Neigh = nearestneighbors (n_neighbors= K)
- Neigh.fit (X)
- Rlist = [0]*len (ymin)
- Normalizedrlist = [0]*len (ymin)
- For i in xrange (len (ymin)):
- indices = Neigh.kneighbors (xmin[i],k,False) [0]
- # print ' y[indices] = = (1-minorityclasslabel): '
- # print Y[indices]
- # Print Len (np.where (y[indices] = = (1-minorityclasslabel)) [0])
- Rlist[i] = Len (Np.where (y[indices] = = ( 1-minorityclasslabel)) [0])/(K + 0.0)
- Normconst = SUM (rlist)
- For J in Xrange (len (rlist)):
- NORMALIZEDRLIST[J] = (rlist[j]/normconst)
- return normalizedrlist
- def get_indicesminority (y, Minorityclasslabel = 1):
- Y_new = []
- For I in range (len (y)):
- if y[i] = = 1:
- Y_new.append (1)
- Else:
- Y_new.append (0)
- Y_new = Np.asarray (y_new)
- indicesminority = Np.where (y_new = = minorityclasslabel) [0]
- return indicesminority, Y_new
- def generatesamples (X, y, Minorityclasslabel = 1, K =5,beta = 0.3):
- syntheticdata_x = []
- Syntheticdata_y = []
- indicesminority, y_new = get_indicesminority (y)
- Ymin = y[indicesminority]
- Xmin = x[indicesminority]
- Rlist = Getris (X, Y_new, indicesminority, Minorityclasslabel, K)
- ML, MS = Get_class_count (y_new)
- G = GETG (ml,ms, beta = beta)
- Neigh = Nearestneighbors (n_neighbors=k)
- Neigh.fit (Xmin)
- For K in xrange (len (ymin)):
- g = Int (Np.round (rlist[k]*g))
- Neighb_indx = Neigh.kneighbors (xmin[k],k,False) [0]
- For L in xrange (g):
- IND = Random.choice (NEIGHB_INDX)
- s = xmin[k] + (xmin[ind]-xmin[k]) * RANDOM.RANDOM ()
- Syntheticdata_x.append (s)
- Syntheticdata_y.append (Ymin[k])
- print ' Asyn, raw X size: ', X.shape
- X = Np.vstack ((X,np.asarray (syntheticdata_x)))
- y = Np.hstack ((y,syntheticdata_y))
- print ' Asyn, post X size: ', X.shape
- return X, y
Theory and practical summary of unbalanced learning methods