Sometimes we often encounter in the actual classification data mining, the class sample is very uneven, the direct use of this unbalanced data will affect the classification of some models, such as the logistic REGRESSION,SVM, a solution is to balance the data sampling, here provides a recommended code implementation , the input and output data formats are required as label+tab+features, such as LIBSVM format
Usage:
Usage: {0} [options] DataSet Subclass_size [Output]options:-s Method:method of selection (default 0) 0--Over-sampl ING & under-sampling given subclass_size 1--over-sampling (subclass_size:any value) 2--under-sampling (su Bclass_size:any value)
Bash Example:
Python sampledataset.py-s 0 Heart_scale heart_scale.txt
Here the s parameter represents the sampling method,
-S 0:over sampling &under sampling, which is to reduce the number of categories to resample, the category of small resampling
-S 1:over sampling resampling of fewer categories, and the number of samples per sample is the same as the largest class
-S 2:under sampling to reduce the class number of samples, after sampling each type of sample and the minimum value of that class
Input data File Heart_scale
Output data File Heart_scale.txt
Here is the code file: sampledataset.py:
#!/usr/bin/env pythonfrom sklearn.datasets Import load_svmlight_filefrom sklearn.datasets import Dump_svmlight_ FileImport NumPy as Npfrom sklearn.utils import check_random_statefrom scipy.sparse import hstack,vstackimport os, sys, MA Th, Randomfrom collections import defaultdictif sys.version_info[0] >= 3:xrange = rangedef exit_with_help (argv): Print ("" "Usage: {0} [options] DataSet Subclass_size [Output]options:-s Method:method of selection (default 0) 0-- Over-sampling & under-sampling Given subclass_size 1--over-sampling (Subclass_size:any value) 2--Under-sa Mpling (subclass_size:any value) output:balance set file (optional) If output is omitted, the subset would be printed on th E screen. "". Format (Argv[0]) exit (1) def process_options (argv): argc = Len (argv) if argc < 3:exit_with_help (argv) # default method is Over-sampling & under-sampling method = 0 Balanceset_file = Sys.stdout i = 1 while I < ARGC:IF argv[i][0]! = "-": Break if argv[i] = = "-S": i = i + 1 method = Int (Argv[i]) If method not in [0,1,2]: Print ("Unknown selection method {0}". Format (method)) Exit_ With_help (argv) i = i + 1 DataSet = argv[i] balanceset_size = Int (argv[i+1]) if i+2 < Argc:ba Lanceset_file = open (argv[i+2], ' W ') return DataSet, Balanceset_size, method, Balanceset_filedef stratified_selection (d Ataset, Subset_size, method): Labels = [Line.split (none,1) [0] for line in open (DataSet)] Label_linenums = Defaultdic T (list) for I, label in Enumerate (labels): Label_linenums[label] + = [i] l = len (labels) remaining = subset _size ret = [] # classes with fewer data is sampled first; Label_list = sorted (label_linenums, Key=lambda X:len (Label_linenums[x])) Min_class = label_list[0] Maj_class = Labe L_list[-1] Min_class_num = Len (Label_linenums[min_class]) maj_class_num = Len (Label_linenums[maj_class]) random_state = check_random_state (for a label in label_list:linenums = L Abel_linenums[label] label_size = Len (linenums) If method = = 0:if Label_size<subset_size: RET + = Linenums Subnum = subset_size-label_size Else:subnum = subse t_size ret + = [Linenums[i] for I in Random_state.randint (low=0, High=label_size,size=subnum)] Elif Metho D = = 1:if Label = = Maj_class:ret + = Linenums Continue else: RET + = Linenums Subnum = maj_class_num-label_size ret + = [Linenums[i] For I in Random_state.randint (low=0, High=label_size,size=subnum)] Elif method = = 2:if Label = = MIN_CL Ass:ret + = linenums Continue else:subnum = Min_class_num RET + = [linenums[I] for i in Random_state.randint (low=0, High=label_size,size=subnum)] Random.shuffle (ret) return retdef main (argv=s YS.ARGV): DataSet, Subset_size, method, Subset_file = Process_options (argv) selected_lines = [] Selected_lines = Stratified_selection (DataSet, Subset_size,method) #select instances based on selected_lines DataSet = open (DataSet, ' R ') DataList = Dataset.readlines () for I in Selected_lines:subset_file.write (Datalist[i]) Subset_file.clo SE () dataset.close () if __name__ = = ' __main__ ': Main (SYS.ARGV)
Sample a balance dataset from imbalance DataSet and save it (to extract balanced data from unbalanced data and save)