Sample a balance dataset from imbalance DataSet and save it (to extract balanced data from unbalanced data and save)

Source: Internet
Author: User

Sometimes we often encounter in the actual classification data mining, the class sample is very uneven, the direct use of this unbalanced data will affect the classification of some models, such as the logistic REGRESSION,SVM, a solution is to balance the data sampling, here provides a recommended code implementation , the input and output data formats are required as label+tab+features, such as LIBSVM format

Usage:

Usage: {0} [options] DataSet Subclass_size [Output]options:-s Method:method of selection (default 0)     0--Over-sampl ING & under-sampling given subclass_size     1--over-sampling (subclass_size:any value)     2--under-sampling (su Bclass_size:any value)

Bash Example:

Python sampledataset.py-s 0 Heart_scale heart_scale.txt

Here the s parameter represents the sampling method,

-S 0:over sampling &under sampling, which is to reduce the number of categories to resample, the category of small resampling

-S 1:over sampling resampling of fewer categories, and the number of samples per sample is the same as the largest class

-S 2:under sampling to reduce the class number of samples, after sampling each type of sample and the minimum value of that class

Input data File Heart_scale

Output data File Heart_scale.txt

Here is the code file: sampledataset.py:

#!/usr/bin/env pythonfrom sklearn.datasets Import load_svmlight_filefrom sklearn.datasets import Dump_svmlight_ FileImport NumPy as Npfrom sklearn.utils import check_random_statefrom scipy.sparse import hstack,vstackimport os, sys, MA    Th, Randomfrom collections import defaultdictif sys.version_info[0] >= 3:xrange = rangedef exit_with_help (argv): Print ("" "Usage: {0} [options] DataSet Subclass_size [Output]options:-s Method:method of selection (default 0) 0-- Over-sampling & under-sampling Given subclass_size 1--over-sampling (Subclass_size:any value) 2--Under-sa Mpling (subclass_size:any value) output:balance set file (optional) If output is omitted, the subset would be printed on th E screen. "".    Format (Argv[0]) exit (1) def process_options (argv): argc = Len (argv) if argc < 3:exit_with_help (argv)  # default method is Over-sampling & under-sampling method = 0 Balanceset_file = Sys.stdout i = 1 while I < ARGC:IF argv[i][0]! = "-": Break if argv[i] = = "-S": i = i + 1 method = Int (Argv[i]) If method not in [0,1,2]: Print ("Unknown selection method {0}". Format (method)) Exit_ With_help (argv) i = i + 1 DataSet = argv[i] balanceset_size = Int (argv[i+1]) if i+2 < Argc:ba Lanceset_file = open (argv[i+2], ' W ') return DataSet, Balanceset_size, method, Balanceset_filedef stratified_selection (d Ataset, Subset_size, method): Labels = [Line.split (none,1) [0] for line in open (DataSet)] Label_linenums = Defaultdic T (list) for I, label in Enumerate (labels): Label_linenums[label] + = [i] l = len (labels) remaining = subset     _size ret = [] # classes with fewer data is sampled first; Label_list = sorted (label_linenums, Key=lambda X:len (Label_linenums[x])) Min_class = label_list[0] Maj_class = Labe L_list[-1] Min_class_num = Len (Label_linenums[min_class]) maj_class_num = Len (Label_linenums[maj_class]) random_state = check_random_state (for a label in label_list:linenums = L                Abel_linenums[label] label_size = Len (linenums) If method = = 0:if Label_size<subset_size: RET + = Linenums Subnum = subset_size-label_size Else:subnum = subse t_size ret + = [Linenums[i] for I in Random_state.randint (low=0, High=label_size,size=subnum)] Elif Metho                D = = 1:if Label = = Maj_class:ret + = Linenums Continue else:  RET + = Linenums Subnum = maj_class_num-label_size ret + = [Linenums[i] For I in Random_state.randint (low=0, High=label_size,size=subnum)] Elif method = = 2:if Label = = MIN_CL                Ass:ret + = linenums Continue else:subnum = Min_class_num RET + = [linenums[I] for i in Random_state.randint (low=0, High=label_size,size=subnum)] Random.shuffle (ret) return retdef main (argv=s YS.ARGV): DataSet, Subset_size, method, Subset_file = Process_options (argv) selected_lines = [] Selected_lines = Stratified_selection (DataSet, Subset_size,method) #select instances based on selected_lines DataSet = open (DataSet, ' R ') DataList = Dataset.readlines () for I in Selected_lines:subset_file.write (Datalist[i]) Subset_file.clo SE () dataset.close () if __name__ = = ' __main__ ': Main (SYS.ARGV)

Sample a balance dataset from imbalance DataSet and save it (to extract balanced data from unbalanced data and save)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.