LIBSVM use Experience (for new edition usage modification)
(New Libsvm,svm-train, the old version is Svmtrain Windows users, gnuplot need to install, not decompression, EXE file name Gnuplot.exe, the old version after the decompression for the Pgnuplot.exe)
(
PS: This article for the reprint modification, may some places have not been revised according to the new edition, only for reference)
First download LIBSVM, Python, and gnuplot:
L LIBSVM's homepage http://www.csie.ntu.edu.tw/~cjlin/libsvm/download LIBSVM (I use version 3.21 myself)
L python's homepage http://www.python.org/download python (I use version 3.5.2)
L gnuplot's homepage http://www.gnuplot.info/download gnuplot (I use version 5.0.4)
The general steps used by LIBSVM are:
1 Prepare the dataset according to the format required by the LIBSVM package;
2 The data for simple scaling operation;
3 first consider the use of RBF kernel function;
4 using cross validation to select the best parameters C and G;
(5) using the best parameters C and G to train the whole training set to obtain the support vector machine model;
6) using the obtained model to test and forecast.
1 data format used by LIBSVM
The training data and validation data file format used by the software is as follows:
[Label] [Index1]:[value1] [index2]:[value2] ...
[Label] [Index1]:[value1] [index2]:[value2] ...
One line of data, such as:
+1 1: 0.708 2: 1 3: 1 4: -0.320 5: -0.105 6:-1
Here (x,y) à ((0.708,1,1,-0.320,-0.105,-1), + 1)
label or class, is the kind you want to classify, usually some integers.
Index is an ordered indexed, usually a sequential integer.
Value is the data that is used to train, usually a bunch of real numbers.
2 The data for simple scaling operations
Scan the data. Because the original data may be too large or too small, svmscale can be scale (scaled) to the appropriate range to make training and forecasting faster.
svm-scale.exe usage: svmscale.exe feature.txt feature.scaled (I did not test)
The default normalized range is [ -1,1], which can be adjusted with parameter-L and-u, respectively, to the upper and the next, Feature.txt is the normalized feature name of the input feature filename output feature.scaled
3) Consider using the RBF kernel function
The training data form model, in essence, calculates the w,b in the wx+b=0.
Svmtrain usage: svmtrain [options] training_set_file [Model_file]
The options have the following meanings:
-S SVM type: Set the SVM type, the default value is 0, the optional types are:
0--C-svc
1--Nu-svc
2--ONE-CLASS-SVM
3--E-svr
4--Nu-svr
-T kernel function type: Sets the kernel function type, the default value is 2, and the optional types are:
0--Linear Core: U ' *v
1-Polynomial Core: (G*u ' *v+ coef0) degree
2--RBF nucleus: exp (-| | u-v| | *|| u-v| | /G*G)
3--sigmoid nucleus: tanh (g*u ' *v+ coef 0)
-D Degree: the degree setting in the kernel function, the default value is 3;
-G R (GAMA): A function setting in a kernel function (default 1/k);
-R COEF 0: Set the COEF0 in the kernel function, the default value is 0;
-C Cost: Set C-svc, E-svr, n-svr from the penalty factor C, the default value is 1;
-N nu: Set nu-svc, ONE-CLASS-SVM and nu-svr parameter nu, default value 0.5;
-P E: Kernel width, set the E-svr loss function of E, the default value is 0.1;
-M CacheSize: Set cache memory size in MB (default 40):
-E: Sets tolerable deviations in the termination criteria, with a default value of 0.001;
-H Shrinking: Use heuristic, optional value is 0 or 1, the default value is 1;
-B probability estimate: whether to calculate the probability estimate of Svc or SVR, optional value 0 or 1, default 0;
-wi weight: The penalty coefficient C-weighted for various samples, the default value is 1;
-V n:n Cross validation mode.
The k in the-G option refers to the number of attributes in the input data. The operation parameter-v randomly divides the data into N parts and calculates the cross test accuracy and RMS error. These parameter settings can be grouped according to the type of the SVM and the parameters supported by the kernel function, the program will not accept the parameter if the parameter is not affected by the function or the SVM type, and the parameter will take the default value if the appropriate parameter is set incorrectly. A training_set_file is a dataset to be trained; Model_file is the model file that is generated after the training is finished, and can be set to its own idiomatic filename if it is not set to take the default file name. Here are some examples:
C:/libsvm-3-21/windows>svm-train Heart_scale
*
Optimization finished, #iter = 162
Nu = 0.431029
obj = -100.877288, rho = 0.424462
NSV = 132, NBSV = 107
Total NSV = 132
The screen echo information is simply described:
#iter为迭代次数,
The Nu is the same as the previous action parameter-n nu.
The minimum value that obj converts to the two solver for the SVM file,
Rho is the constant term B of the decision function,
NSV is the number of support vectors,
NBSV is the number of support vectors on the boundary,
Total NSV is the number of support vectors.
The training model is saved as a file *.model and is opened with Notepad as follows:
Svm_type c_svc % of the SVM type used in training, here is C-svc
Kernel_type RBF % training using the kernel function type, here is the RBF kernel
Gamma 0.0769231 % sets the G in kernel function, the default value is 1/k
Number of categories in the Nr_class 2 category, here is two classification problem
total number of support vectors for TOTAL_SV 132%
Rho 0.424462 % of constant item B in decision function
Label 1-1% category labels
NR_SV the number of support vectors corresponding to each type of label
SV % below is support vector
1 1:0.166667 2:1 3:-0.333333 4:-0.433962 5:-0.383562 6:-1 7:-1 8:0.0687023 9:-1 10:-0.903226 11:-1 12:-1 13:1
0.5104832128985164 1:0.125 2:1 3:0.333333 4:-0.320755 5:-0.406393 6:1 7:1 8:0.0839695 9:1 10:-0.806452 12:-0.333333 13:0.5
1 1:0.333333 2:1 3:-1 4:-0.245283 5:-0.506849 6:-1 7:-1 8:0.129771 9:-1 10:-0.16129 12:0.333333 13:-1
1 1:0.208333 2:1 3:0.333333 4:-0.660377 5:-0.525114 6:-1 7:1 8:0.435115 9:-1 10:-0.193548 12:-0.333333 13:1
4 Cross-validation to select the best parameters C and g (mainly modify this part)
Generally speaking, the most important parameter is gamma (-g) and cost (-C). The parameters of cross validation (-V) are often 5.
so how to choose the best parameters C and g? (Windows user, need to install gnuplot, not decompression )
LIBSVM's tools (the original Python, wrong) directory of the grid.py can help us. At this point, you need to install python3.5 (typically installed by default to c:/python35-32), Modify the environment variables after the installation is successful, and add the Python installation path to the computer's environment variable path. ; Install gnuplot (I am installed to c:/gnuplot504)
After installation, enter the/libsvm/tools directory, with a text editor ( preferably idle) to modify the grid.py file, find the gnuplot path (c:/gnuplot504/ BIN) (its default path is Gnuplot_exe=r "C:/tmp/gnuplot/bin/pgnuplot.exe"), modified according to the actual path, this example is "c:/gnuplot504/bin/ Gnnuplot.exe"instead of" C:/tmp/gnuplot/bin/pgnuplot.exe "and save.
( note : See the Bin directory under the EXE file name, the newer version of the difference )
The grid.py and sample file (Heart_scale) files are then placed in the same directory.
Open cmd, first navigate to the location of the folder where grid.py resides:
CMD window type: cd C:\libsvm-3.21\tools
> Python grid.py heart-scale is executed, the optimal parameters C and G can be obtained.
(Note:grid.) The path to the gnuplot path in the py file must be modified according to the actual path.
If you do not modify the path, you need to execute the program to give detailed path information:
>grid.py-log2c-5,5,1-svmtrain "C:\libsvm-3.21\windows\svm-train.exe"-gnuplot C:\gnuplot504\bin\gnuplot.exe-v Ten Heart_scale
If you can see the results of the program execution, the interface between LIBSVM and Python is already configured, and you can then call the LIBSVM function directly in the Python program.
result output : grid.py output two files
1. dataset.png:the CV accuracy contour plot generated by gnuplot, test Heart_scale, output heart_scale.png image in the Tools directory, The image shows the dataset name as well as the best (c,g) and best_rate.
2. dataset.out:the CV accuracy at each (log2 (C), log2 (gamma)), outputting each (c,g) group and cross-validation results.
The following are intermediate execution results:
In addition, as for the next LIBSVM and Python interface issues, in libsvm2.86, Lin has helped us solve the/libsvm/windows/python directory with Svmc.pyd This file, copy the file file to the libsvm/ Python directory, and also copy the Python.exe file to the directory, type the following command to verify the effect
Python svm_test.py ( original not modified, I have not yet tested )
( Note: in order to facilitate testing run a series of procedures, you can move the tools under the *.py are moved to the Windows folder, the following training, testing will not have to relocate the folder, It is worth mentioning that to execute and program and sample files are located under the same path . )
5 using the best parameters C and G to train the whole training set to obtain the support vector machine model
$ svm-train–c x–g x–v x training_set_file [Model_file]
X is the value of the optimal parameter C and G obtained above, and the value of V generally takes 5.
6) using the acquired model to test and predict
Use the Svm-train trained model for testing. Enter a new x value, give the SVM predicted Y-value
$ svm-predict test_file model_file output_file
such as:./svm-predict heart_scale Heart_scale.model heart_scale.out
accuracy = 86.6667% (234/270) (classification)
The result is shown here.
An example of a specific use.
Using the Heart_scale in LIBSVM as training data and test data, the Python has been installed to C disk and the default value for Gnuplot path in grid.py file is modified to the actual uncompressed path,
Heart_scale, grid.py, and Python.exe are copied to the/libsvm/windows folder.
./svm-train Heart_scale
Optimization finished, #iter = 162
Nu = 0.431029
obj = -100.877288, rho = 0.424462
NSV = 132, NBSV = 107
Total NSV = 132
At this point, a heart_scale.model has been obtained to predict:
./svm-predict Heart_scale Heart_scale.model heart_scale.out
accuracy = 86.6667% (234/270) (classification)
The correct rate is accuracy = 86.6667%.
./python grid.py Heart_scale
The optimal parameter c=2048,g=0.0001220703125 is obtained.
./svm-train-c 2048-g 0.0001220703125 Heart_scale get model, by./svm-predict Heart_scale Heart_scale.model T get the right
Rate for accuracy = 85.1852%. This piece is still a bit confusing. Why the correct rate is lower.
Of course, we can combine subset.py and easy.py to realize the automation process.
If you want to train more than once, you can write a batch program to save a lot of things.
Here's an example:
:: @ echo off
Cls
:: Split the data and output the results
FOR/L%%i in (1,1,1000) do python subset.py b59.txt 546 b59 (%%i). In8 b59 (%%i). Out2
FOR/L%%i in (1,1,1000) do python easy.py b59 (%%i). In8 b59 (%%i). Out2 >> Result89.txt
This batch code first calls subset.py to perform a 1000-tiered random sampling of the file B59.txt (80-20% segmentation of the data) and then calls easy.py for 1000 times parameter optimization and writes the record results to the Result89.txt
(including 1000 training of classification accuracy rate and parameter pairs).
You can also call fselect.py for feature selection and invoke plotroc.py to draw the ROC curve.