Theory and practical summary of unbalanced learning methods

Last Update:2015-12-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: http://blog.csdn.net/hero_fantao/article/details/35784773

Unbalanced Learning methods

The problem of sample imbalance in machine learning is broadly divided into two aspects:

(1) The sample ratios in the category are unbalanced, but the samples in several categories are sufficient;

(2) There are few categories of samples.

The second question, in fact, is not our focus, because the sample is not enough, the coverage space is very small, if the characteristics of enough, this data on the model learning value is not very much, so, for this problem, a good way is to find as many small sample to cover the sample space.

The first question is now mainly discussed.

One: Sampling method

1. Stochastic resampling (random oversampling):

When the sample is unbalanced, random resampling is done on the small sample to achieve a balance. This method is only a simple copy of the small sample, the disadvantage is easy to over-fit, such as in the decision tree classification, it is likely that a terminal leaf node sample is a copy of a sample, the lack of extensibility, which may improve the accuracy of model training, but the prediction of unknown test samples may be very poor.

2. Random Under-sampling (oversampling):

When the sample is unbalanced, the large class samples are randomly under-sampled, which is to take some large samples to achieve the balance. The problem with the under-sampling is that the sample reduction may be missing important data in the sample space, reducing accuracy.

3. Synthetic sampling with Data Generation

Approximate data sample generation for small class samples. A new sample is generated based on the distance of the K nearest sample to the current sample by calculating the KNN of the small class sample.

This method breaks through the original simple repeated sampling method, and enriches the sample space of small sample by creating a new small sample, which makes up the problem of insufficient space of small sample sample. The disadvantage is that it calculates the same KNN for all small sample types. Imagine a small sample that has a distinct degree of differentiation from a large class of samples, which is of little value to produce extra samples.

4. Adaptive Synthetic Sampling

Adaptive synthetic sampling is a correction method that attempts to increase the sample sampling in small samples that are similar to those in larger classes.

Here's how:

Two-Cost learning method

One is from a sample point of view, try to achieve sample balance, and then to use the model of learning. There is also the cost of setting different samples for miscalculation, such as the cost of setting a small sample for miscarriage. Personal feeling, this method is almost as good as a medium resampling, sacrificing one for another. Personally think a good way is that when the positive and negative samples are unbalanced, select a portion of the large class samples and all the small samples each time, try to balance and train a model. Repeat the above operations, training to get a number of models, these models to make a voting, to obtain the final prediction results, you can emulate AdaBoost, each model is weighted. In fact, the method of voting can achieve very little effect.

Reference documents:

[1] He H, Garcia E A. Learning from imbalanced Data[j]. Knowledge and Data Engineering, IEEE transactions on, 2009, 21 (9): 1263-1284.

[2] Https://github.com/fmfn/UnbalancedDataset (a module shared by the 2014/12/07 @phunter_lau)

Attach Adaptive Synthetic sampling source code:

[Python]View Plaincopy

"' "
Created on 2014/03/09
@author: Dylan
‘‘‘
From sklearn.neighbors import nearestneighbors
Import NumPy as NP
Import Random
def get_class_count (y, Minorityclasslabel = 1):
Minorityclasslabel_count = Len (np.where (y = = minorityclasslabel) [0])
Maxclasslabel_count = Len (np.where (y = = (1-minorityclasslabel)) [0])
return Maxclasslabel_count, Minorityclasslabel_count
# @param: X the datapoints e.g.: [F1, F2, ..., FN]
# @param: Y the Classlabels e.g: [0,1,1,1,0,..., Cn]
# @param ms:the amount of samples in the minority group
# @param ml:the amount of samples in the majority group
# @return: The G value, which indicates how many samples should is generated in total, this can is tuned with beta
def GETG (ml, MS, beta):
return (ml-ms) *beta
# @param: X the datapoints e.g.: [F1, F2, ..., FN]
# @param: Y the Classlabels e.g: [0,1,1,1,0,..., Cn]
# @param: Minorityclass:the Minority class
# @param: K: The amount of neighbours for Knn
# @return: Rlist:list of R values
def getris (x,y,indicesminority,minorityclasslabel,k):
Ymin = Np.array (y) [indicesminority]
Xmin = Np.array (X) [indicesminority]
Neigh = nearestneighbors (n_neighbors= K)
Neigh.fit (X)
Rlist = [0]*len (ymin)
Normalizedrlist = [0]*len (ymin)
For i in xrange (len (ymin)):
indices = Neigh.kneighbors (xmin[i],k,False) [0]
# print ' y[indices] = = (1-minorityclasslabel): '
# print Y[indices]
# Print Len (np.where (y[indices] = = (1-minorityclasslabel)) [0])
Rlist[i] = Len (Np.where (y[indices] = = ( 1-minorityclasslabel)) [0])/(K + 0.0)
Normconst = SUM (rlist)
For J in Xrange (len (rlist)):
NORMALIZEDRLIST[J] = (rlist[j]/normconst)
return normalizedrlist
def get_indicesminority (y, Minorityclasslabel = 1):
Y_new = []
For I in range (len (y)):
if y[i] = = 1:
Y_new.append (1)
Else:
Y_new.append (0)
Y_new = Np.asarray (y_new)
indicesminority = Np.where (y_new = = minorityclasslabel) [0]
return indicesminority, Y_new
def generatesamples (X, y, Minorityclasslabel = 1, K =5,beta = 0.3):
syntheticdata_x = []
Syntheticdata_y = []
indicesminority, y_new = get_indicesminority (y)
Ymin = y[indicesminority]
Xmin = x[indicesminority]
Rlist = Getris (X, Y_new, indicesminority, Minorityclasslabel, K)
ML, MS = Get_class_count (y_new)
G = GETG (ml,ms, beta = beta)
Neigh = Nearestneighbors (n_neighbors=k)
Neigh.fit (Xmin)
For K in xrange (len (ymin)):
g = Int (Np.round (rlist[k]*g))
Neighb_indx = Neigh.kneighbors (xmin[k],k,False) [0]
For L in xrange (g):
IND = Random.choice (NEIGHB_INDX)
s = xmin[k] + (xmin[ind]-xmin[k]) * RANDOM.RANDOM ()
Syntheticdata_x.append (s)
Syntheticdata_y.append (Ymin[k])
print ' Asyn, raw X size: ', X.shape
X = Np.vstack ((X,np.asarray (syntheticdata_x)))
y = Np.hstack ((y,syntheticdata_y))
print ' Asyn, post X size: ', X.shape
return X, y

Theory and practical summary of unbalanced learning methods

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Theory and practical summary of unbalanced learning methods

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Theory and practical summary of unbalanced learning methods

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support