Use mapreduce to train the SVM model

Last Update:2018-12-03 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The SVM model has two important parameters: C and Gamma. C is the penalty coefficient, that is, the width of the error. The higher C, the more I can't tolerate errors. C is too large or too small, and the generalization ability is deteriorated.
Gamma is a parameter that comes with the function after the RBF function is selected as the kernel. Implicitly, the distribution after data is mapped to the new feature space. The larger the GAMMA value, the smaller the support vector, and the smaller the GAMMA value, the more support vectors. The number of SVM affects the training and prediction speed.

Grid Search

Grid search is a parameter Search Method Used in libsvm. It is easy to understand: in the two-dimensional parameter matrix composed of C and gamma, experiment the effects of each pair of parameters in sequence.

Although it is relatively simple to use grid search, it looks like Na has ve. But he does have two advantages:

Global Optimization can be obtained.
(C, gamma) independent from each other, facilitating Parallel Execution

This makes it possible to use mapreduce to train the model. The basic idea of grid search in mapreduce is that mapreduce is divided into two stages: map and reduce. Data is sliced in the map stage, distribute to different machines. In the reduce stage, <key, value> with the same key are summarized. Based on our current needs, our goal is to use Mr to distribute SVM parameter experiments on different machines. Ideally, each machine is subjected to a set of parameters, in this way, the time spent in parameter optimization is equivalent to the time spent on a set of parameters. There are two solutions:

Upload the parameter set and training sample as local files to hadoop, so that each mapper can directly read the two files and then select a set of parameters for training, and output the result.
The parameter set is used as the hadoop file uploaded to the local file, and the training sample is used as the input. In this way, hadoop automatically slices the training sample, and each mapper obtains only part of the data. Therefore, we need to summarize the data according to parameters in the map and train the SVM in the reduce stage.

Solution 1 is intuitive, but it cannot control the parameters trained by each mapper. Therefore, we chose the second solution. The procedure is as follows:

Calculate all the parameter pairs (C, gamma) and write them into the file. Each row contains a set of parameters and then upload them to hadoop, the system automatically distributes files to each machine.
The training sample is input, that is, each pair of <key, value> In the map represents a sample. The key is the default value of the system, and the value is the data of each row of the training sample. Read all the parameter pairs into the list.
In the map stage. The input data is used as the value, and each set of parameters in the list is output as the key.
In the reduce stage, hadoop sorts keys and stores data with the same keys on the same machine, therefore, in this phase, we need to summarize all values belonging to the same key to form the original training sample.
Train parsed parameters and sorted samples, and output cross-validation results.

The specific procedure is as follows:

#! /Usr/bin/ENV Python # _ * _ coding: UTF-8 _ * _ import sysimport ossys. path. append ('. ') from hstream import * sys. path. append (OS. path. join (OS. path. dirname (_ file _), "dependence") Import tms_svm '''first step: first run the entire program flow through seconde step: parameterize the input parameter file and svm_type. '''Def read_param (filename): Params = List () for line in file (filename): Params. append (line. strip () return paramsdefault_param_file = ". /Params "svm_type =" libsvm "Params = read_param (default_param_file) Class svmtrain (hstream ): '''input is the data required for SVM training ''' # def _ init _ (self, param_file = default_param_file): # pass # self. parse_args () # print self. default_param_file # self. param_file = param_file # self. params = self. read_param (self. param_file) def mapper (self, key, value): ''' mapper function uses the parameter as the key and the sample as the value to output ''' for Param in Params: Self. write_output (Param, value) def reducer (self, key, values): '''cer CER function''' prob_y = [] prob_x = [] line = key. split (none, 1) # Set the SVM training parameter if sum ([1 For I in line]) = 1: svm_param = "-V 5-c" + STR (line [0]) else: If sum ([1 For I in line])> = 2: svm_param = "-V 5-c" + STR (line [0]) + "-G" + STR (line [1]) # For value in values: value = value. split (none, 1) If Len (value) = 1: Value + = [''] label, features = value xi = {} For E in features. split (): Ind, val = E. split (":") XI [int (IND)] = float (VAL) prob_y + = [float (Label)] prob_x + = [XI] # Train tms_svm.set_svm_type (svm_type) ratio = tms_svm.train (prob_y, prob_x, svm_param) self for the obtained parameters and training samples. write_output (Key, STR (ratio) # def parse_args (Self): # parser = optionparser (usage = "") # parser. add_option ("-P", "-- paramfile", help = "Param FILENAME", DEST = "paramfile") # options, argS = parser. parse_args () # If options. paramfile: # self. default_param_file = options. paramfile if _ name _ = '_ main _': svmtrain ()

With the above program, we can get the result by running hadoop.

$HADOOP jar $STREAMINGJAR -D mapred.job.name='Svm_Train' \    -D mapred.reduce.tasks=100 \    -files hstream.py,svm_train.py,params,${depen_path}/svm.py,${depen_path}/svmutil.py,${depen_path}/liblinear.py,${depen_path}/liblinearutil.py,${depen_path}/measure.py,${depen_path}/segment.py,${depen_path}/strnormalize.py,${depen_path}/tms_svm.py,${depen_path}/liblinear.so.64,${depen_path}/libsvm.so.64 \    -mapper "svm_train.py -m" \    -reducer "svm_train.py -r" \    -input ${INPUT} \    -output ${OUTPUT}

Summary This article uses the mapreduce method to perform parallel operations on the SVM parameter optimization process, which can greatly shorten the SVM training time. In addition, the methods in this article are worth noting:

In the svmtrain program, the three variables default_param_file, svm_type, and svm_param were randomly set by the user (parse_args function), but the experiment failed several times.
In fact, there is another way to accelerate grid search: first, use coarse-grained search on the subset of the training sample, if the optimal value is (1, 0.5 ), then, fine-grained searches are performed in a matrix consisting of [1-step, 1 + step] and [0.5-step, 0.5 + step] in the Gamma interval, it means that the step size is different. Using this method can speed up the search and obtain the global optimum in most cases.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More