Use mapreduce to train the SVM model

Source: Internet
Author: User
Tags svm

The SVM model has two important parameters: C and Gamma. C is the penalty coefficient, that is, the width of the error. The higher C, the more I can't tolerate errors. C is too large or too small, and the generalization ability is deteriorated.
Gamma is a parameter that comes with the function after the RBF function is selected as the kernel. Implicitly, the distribution after data is mapped to the new feature space. The larger the GAMMA value, the smaller the support vector, and the smaller the GAMMA value, the more support vectors. The number of SVM affects the training and prediction speed.

Grid Search

Grid search is a parameter Search Method Used in libsvm. It is easy to understand: in the two-dimensional parameter matrix composed of C and gamma, experiment the effects of each pair of parameters in sequence.

Although it is relatively simple to use grid search, it looks like Na has ve. But he does have two advantages:

  1. Global Optimization can be obtained.
  2. (C, gamma) independent from each other, facilitating Parallel Execution
This makes it possible to use mapreduce to train the model. The basic idea of grid search in mapreduce is that mapreduce is divided into two stages: map and reduce. Data is sliced in the map stage, distribute to different machines. In the reduce stage, <key, value> with the same key are summarized. Based on our current needs, our goal is to use Mr to distribute SVM parameter experiments on different machines. Ideally, each machine is subjected to a set of parameters, in this way, the time spent in parameter optimization is equivalent to the time spent on a set of parameters. There are two solutions:
  1. Upload the parameter set and training sample as local files to hadoop, so that each mapper can directly read the two files and then select a set of parameters for training, and output the result.
  2. The parameter set is used as the hadoop file uploaded to the local file, and the training sample is used as the input. In this way, hadoop automatically slices the training sample, and each mapper obtains only part of the data. Therefore, we need to summarize the data according to parameters in the map and train the SVM in the reduce stage.
Solution 1 is intuitive, but it cannot control the parameters trained by each mapper. Therefore, we chose the second solution. The procedure is as follows:
  1. Calculate all the parameter pairs (C, gamma) and write them into the file. Each row contains a set of parameters and then upload them to hadoop, the system automatically distributes files to each machine.
  2. The training sample is input, that is, each pair of <key, value> In the map represents a sample. The key is the default value of the system, and the value is the data of each row of the training sample. Read all the parameter pairs into the list.
  3. In the map stage. The input data is used as the value, and each set of parameters in the list is output as the key.
  4. In the reduce stage, hadoop sorts keys and stores data with the same keys on the same machine, therefore, in this phase, we need to summarize all values belonging to the same key to form the original training sample.
  5. Train parsed parameters and sorted samples, and output cross-validation results.
The specific procedure is as follows:
#! /Usr/bin/ENV Python # _ * _ coding: UTF-8 _ * _ import sysimport ossys. path. append ('. ') from hstream import * sys. path. append (OS. path. join (OS. path. dirname (_ file _), "dependence") Import tms_svm '''first step: first run the entire program flow through seconde step: parameterize the input parameter file and svm_type. '''Def read_param (filename): Params = List () for line in file (filename): Params. append (line. strip () return paramsdefault_param_file = ". /Params "svm_type =" libsvm "Params = read_param (default_param_file) Class svmtrain (hstream ): '''input is the data required for SVM training ''' # def _ init _ (self, param_file = default_param_file): # pass # self. parse_args () # print self. default_param_file # self. param_file = param_file # self. params = self. read_param (self. param_file) def mapper (self, key, value): ''' mapper function uses the parameter as the key and the sample as the value to output ''' for Param in Params: Self. write_output (Param, value) def reducer (self, key, values): '''cer CER function''' prob_y = [] prob_x = [] line = key. split (none, 1) # Set the SVM training parameter if sum ([1 For I in line]) = 1: svm_param = "-V 5-c" + STR (line [0]) else: If sum ([1 For I in line])> = 2: svm_param = "-V 5-c" + STR (line [0]) + "-G" + STR (line [1]) # For value in values: value = value. split (none, 1) If Len (value) = 1: Value + = [''] label, features = value xi = {} For E in features. split (): Ind, val = E. split (":") XI [int (IND)] = float (VAL) prob_y + = [float (Label)] prob_x + = [XI] # Train tms_svm.set_svm_type (svm_type) ratio = tms_svm.train (prob_y, prob_x, svm_param) self for the obtained parameters and training samples. write_output (Key, STR (ratio) # def parse_args (Self): # parser = optionparser (usage = "") # parser. add_option ("-P", "-- paramfile", help = "Param FILENAME", DEST = "paramfile") # options, argS = parser. parse_args () # If options. paramfile: # self. default_param_file = options. paramfile if _ name _ = '_ main _': svmtrain ()

With the above program, we can get the result by running hadoop.

$HADOOP jar $STREAMINGJAR -D mapred.job.name='Svm_Train' \    -D mapred.reduce.tasks=100 \    -files hstream.py,svm_train.py,params,${depen_path}/svm.py,${depen_path}/svmutil.py,${depen_path}/liblinear.py,${depen_path}/liblinearutil.py,${depen_path}/measure.py,${depen_path}/segment.py,${depen_path}/strnormalize.py,${depen_path}/tms_svm.py,${depen_path}/liblinear.so.64,${depen_path}/libsvm.so.64 \    -mapper "svm_train.py -m" \    -reducer "svm_train.py -r" \    -input ${INPUT} \    -output ${OUTPUT}

Summary This article uses the mapreduce method to perform parallel operations on the SVM parameter optimization process, which can greatly shorten the SVM training time. In addition, the methods in this article are worth noting:

  1. In the svmtrain program, the three variables default_param_file, svm_type, and svm_param were randomly set by the user (parse_args function), but the experiment failed several times.
  2. In fact, there is another way to accelerate grid search: first, use coarse-grained search on the subset of the training sample, if the optimal value is (1, 0.5 ), then, fine-grained searches are performed in a matrix consisting of [1-step, 1 + step] and [0.5-step, 0.5 + step] in the Gamma interval, it means that the step size is different. Using this method can speed up the search and obtain the global optimum in most cases.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.