Sample a balance dataset from imbalance DataSet and save it (to extract balanced data from unbalanced data and save)

Last Update:2016-12-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Sometimes we often encounter in the actual classification data mining, the class sample is very uneven, the direct use of this unbalanced data will affect the classification of some models, such as the logistic REGRESSION,SVM, a solution is to balance the data sampling, here provides a recommended code implementation , the input and output data formats are required as label+tab+features, such as LIBSVM format

Usage:

Usage: {0} [options] DataSet Subclass_size [Output]options:-s Method:method of selection (default 0)     0--Over-sampl ING & under-sampling given subclass_size     1--over-sampling (subclass_size:any value)     2--under-sampling (su Bclass_size:any value)

Bash Example:

Python sampledataset.py-s 0 Heart_scale heart_scale.txt

Here the s parameter represents the sampling method,

-S 0:over sampling &under sampling, which is to reduce the number of categories to resample, the category of small resampling

-S 1:over sampling resampling of fewer categories, and the number of samples per sample is the same as the largest class

-S 2:under sampling to reduce the class number of samples, after sampling each type of sample and the minimum value of that class

Input data File Heart_scale

Output data File Heart_scale.txt

Here is the code file: sampledataset.py:

#!/usr/bin/env pythonfrom sklearn.datasets Import load_svmlight_filefrom sklearn.datasets import Dump_svmlight_ FileImport NumPy as Npfrom sklearn.utils import check_random_statefrom scipy.sparse import hstack,vstackimport os, sys, MA    Th, Randomfrom collections import defaultdictif sys.version_info[0] >= 3:xrange = rangedef exit_with_help (argv): Print ("" "Usage: {0} [options] DataSet Subclass_size [Output]options:-s Method:method of selection (default 0) 0-- Over-sampling & under-sampling Given subclass_size 1--over-sampling (Subclass_size:any value) 2--Under-sa Mpling (subclass_size:any value) output:balance set file (optional) If output is omitted, the subset would be printed on th E screen. "".    Format (Argv[0]) exit (1) def process_options (argv): argc = Len (argv) if argc < 3:exit_with_help (argv)  # default method is Over-sampling & under-sampling method = 0 Balanceset_file = Sys.stdout i = 1 while I < ARGC:IF argv[i][0]! = "-": Break if argv[i] = = "-S": i = i + 1 method = Int (Argv[i]) If method not in [0,1,2]: Print ("Unknown selection method {0}". Format (method)) Exit_ With_help (argv) i = i + 1 DataSet = argv[i] balanceset_size = Int (argv[i+1]) if i+2 < Argc:ba Lanceset_file = open (argv[i+2], ' W ') return DataSet, Balanceset_size, method, Balanceset_filedef stratified_selection (d Ataset, Subset_size, method): Labels = [Line.split (none,1) [0] for line in open (DataSet)] Label_linenums = Defaultdic T (list) for I, label in Enumerate (labels): Label_linenums[label] + = [i] l = len (labels) remaining = subset     _size ret = [] # classes with fewer data is sampled first; Label_list = sorted (label_linenums, Key=lambda X:len (Label_linenums[x])) Min_class = label_list[0] Maj_class = Labe L_list[-1] Min_class_num = Len (Label_linenums[min_class]) maj_class_num = Len (Label_linenums[maj_class]) random_state = check_random_state (for a label in label_list:linenums = L                Abel_linenums[label] label_size = Len (linenums) If method = = 0:if Label_size<subset_size: RET + = Linenums Subnum = subset_size-label_size Else:subnum = subse t_size ret + = [Linenums[i] for I in Random_state.randint (low=0, High=label_size,size=subnum)] Elif Metho                D = = 1:if Label = = Maj_class:ret + = Linenums Continue else:  RET + = Linenums Subnum = maj_class_num-label_size ret + = [Linenums[i] For I in Random_state.randint (low=0, High=label_size,size=subnum)] Elif method = = 2:if Label = = MIN_CL                Ass:ret + = linenums Continue else:subnum = Min_class_num RET + = [linenums[I] for i in Random_state.randint (low=0, High=label_size,size=subnum)] Random.shuffle (ret) return retdef main (argv=s YS.ARGV): DataSet, Subset_size, method, Subset_file = Process_options (argv) selected_lines = [] Selected_lines = Stratified_selection (DataSet, Subset_size,method) #select instances based on selected_lines DataSet = open (DataSet, ' R ') DataList = Dataset.readlines () for I in Selected_lines:subset_file.write (Datalist[i]) Subset_file.clo SE () dataset.close () if __name__ = = ' __main__ ': Main (SYS.ARGV)

Sample a balance dataset from imbalance DataSet and save it (to extract balanced data from unbalanced data and save)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Sample a balance dataset from imbalance DataSet and save it (to extract balanced data from unbalanced data and save)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Sample a balance dataset from imbalance DataSet and save it (to extract balanced data from unbalanced data and save)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support