Theory and practical summary of unbalanced learning methods

Source: Internet
Author: User

Original: http://blog.csdn.net/hero_fantao/article/details/35784773

Unbalanced Learning methods

The problem of sample imbalance in machine learning is broadly divided into two aspects:

(1) The sample ratios in the category are unbalanced, but the samples in several categories are sufficient;

(2) There are few categories of samples.

The second question, in fact, is not our focus, because the sample is not enough, the coverage space is very small, if the characteristics of enough, this data on the model learning value is not very much, so, for this problem, a good way is to find as many small sample to cover the sample space.

The first question is now mainly discussed.

One: Sampling method

1. Stochastic resampling (random oversampling):

When the sample is unbalanced, random resampling is done on the small sample to achieve a balance. This method is only a simple copy of the small sample, the disadvantage is easy to over-fit, such as in the decision tree classification, it is likely that a terminal leaf node sample is a copy of a sample, the lack of extensibility, which may improve the accuracy of model training, but the prediction of unknown test samples may be very poor.

2. Random Under-sampling (oversampling):

When the sample is unbalanced, the large class samples are randomly under-sampled, which is to take some large samples to achieve the balance. The problem with the under-sampling is that the sample reduction may be missing important data in the sample space, reducing accuracy.

3. Synthetic sampling with Data Generation

Approximate data sample generation for small class samples. A new sample is generated based on the distance of the K nearest sample to the current sample by calculating the KNN of the small class sample.

This method breaks through the original simple repeated sampling method, and enriches the sample space of small sample by creating a new small sample, which makes up the problem of insufficient space of small sample sample. The disadvantage is that it calculates the same KNN for all small sample types. Imagine a small sample that has a distinct degree of differentiation from a large class of samples, which is of little value to produce extra samples.

4. Adaptive Synthetic Sampling

Adaptive synthetic sampling is a correction method that attempts to increase the sample sampling in small samples that are similar to those in larger classes.

Here's how:

Two-Cost learning method

One is from a sample point of view, try to achieve sample balance, and then to use the model of learning. There is also the cost of setting different samples for miscalculation, such as the cost of setting a small sample for miscarriage. Personal feeling, this method is almost as good as a medium resampling, sacrificing one for another. Personally think a good way is that when the positive and negative samples are unbalanced, select a portion of the large class samples and all the small samples each time, try to balance and train a model. Repeat the above operations, training to get a number of models, these models to make a voting, to obtain the final prediction results, you can emulate AdaBoost, each model is weighted. In fact, the method of voting can achieve very little effect.

Reference documents:

[1] He H, Garcia E A. Learning from imbalanced Data[j]. Knowledge and Data Engineering, IEEE transactions on, 2009, 21 (9): 1263-1284.

[2] Https://github.com/fmfn/UnbalancedDataset (a module shared by the 2014/12/07 @phunter_lau)

Attach Adaptive Synthetic sampling source code:

[Python]View Plaincopy
    1. "' "
    2. Created on 2014/03/09
    3. @author: Dylan
    4. ‘‘‘
    5. From sklearn.neighbors import nearestneighbors
    6. Import NumPy as NP
    7. Import Random
    8. def get_class_count (y, Minorityclasslabel = 1):
    9. Minorityclasslabel_count = Len (np.where (y = = minorityclasslabel) [0])
    10. Maxclasslabel_count = Len (np.where (y = = (1-minorityclasslabel)) [0])
    11. return Maxclasslabel_count, Minorityclasslabel_count
    12. # @param: X the datapoints e.g.: [F1, F2, ..., FN]
    13. # @param: Y the Classlabels e.g: [0,1,1,1,0,..., Cn]
    14. # @param ms:the amount of samples in the minority group
    15. # @param ml:the amount of samples in the majority group
    16. # @return: The G value, which indicates how many samples should is generated in total, this can is tuned with beta
    17. def GETG (ml, MS, beta):
    18. return (ml-ms) *beta
    19. # @param: X the datapoints e.g.: [F1, F2, ..., FN]
    20. # @param: Y the Classlabels e.g: [0,1,1,1,0,..., Cn]
    21. # @param: Minorityclass:the Minority class
    22. # @param: K: The amount of neighbours for Knn
    23. # @return: Rlist:list of R values
    24. def getris (x,y,indicesminority,minorityclasslabel,k):
    25. Ymin = Np.array (y) [indicesminority]
    26. Xmin = Np.array (X) [indicesminority]
    27. Neigh = nearestneighbors (n_neighbors= K)
    28. Neigh.fit (X)
    29. Rlist = [0]*len (ymin)
    30. Normalizedrlist = [0]*len (ymin)
    31. For i in xrange (len (ymin)):
    32. indices = Neigh.kneighbors (xmin[i],k,False) [0]
    33. # print ' y[indices] = = (1-minorityclasslabel): '
    34. # print Y[indices]
    35. # Print Len (np.where (y[indices] = = (1-minorityclasslabel)) [0])
    36. Rlist[i] = Len (Np.where (y[indices] = = ( 1-minorityclasslabel)) [0])/(K + 0.0)
    37. Normconst = SUM (rlist)
    38. For J in Xrange (len (rlist)):
    39. NORMALIZEDRLIST[J] = (rlist[j]/normconst)
    40. return normalizedrlist
    41. def get_indicesminority (y, Minorityclasslabel = 1):
    42. Y_new = []
    43. For I in range (len (y)):
    44. if y[i] = = 1:
    45. Y_new.append (1)
    46. Else:
    47. Y_new.append (0)
    48. Y_new = Np.asarray (y_new)
    49. indicesminority = Np.where (y_new = = minorityclasslabel) [0]
    50. return indicesminority, Y_new
    51. def generatesamples (X, y, Minorityclasslabel = 1, K =5,beta = 0.3):
    52. syntheticdata_x = []
    53. Syntheticdata_y = []
    54. indicesminority, y_new = get_indicesminority (y)
    55. Ymin = y[indicesminority]
    56. Xmin = x[indicesminority]
    57. Rlist = Getris (X, Y_new, indicesminority, Minorityclasslabel, K)
    58. ML, MS = Get_class_count (y_new)
    59. G = GETG (ml,ms, beta = beta)
    60. Neigh = Nearestneighbors (n_neighbors=k)
    61. Neigh.fit (Xmin)
    62. For K in xrange (len (ymin)):
    63. g = Int (Np.round (rlist[k]*g))
    64. Neighb_indx = Neigh.kneighbors (xmin[k],k,False) [0]
    65. For L in xrange (g):
    66. IND = Random.choice (NEIGHB_INDX)
    67. s = xmin[k] + (xmin[ind]-xmin[k]) * RANDOM.RANDOM ()
    68. Syntheticdata_x.append (s)
    69. Syntheticdata_y.append (Ymin[k])
    70. print ' Asyn, raw X size: ', X.shape
    71. X = Np.vstack ((X,np.asarray (syntheticdata_x)))
    72. y = Np.hstack ((y,syntheticdata_y))
    73. print ' Asyn, post X size: ', X.shape
    74. return X, y

Theory and practical summary of unbalanced learning methods

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.