Libsvm Quick Start

Source: Internet
Author: User
Tags svm

Original: Lin hongde

Ding color: http://www.docin.com/p-10558528.html #

Why write this guide?

I have always thought that SVM is a very interesting thing, but I have never been able to listen to the data mining and SVM courses of Lin zhiren's old brother. Later I came to see some files on the Internet, later, after I heard about the usage of libsvm from kcwu, I want to sort it out. This is an entry for people who do not need to know the complete SVM theory to use libsvm. The readme and FAQ of the original libsvm are also good files, but you may have to understand SVM and the process first (I have such a feeling when I look at it ); this article is intended for people from scratch.

However, please remember that some of the following statements may not be true, but for those who just want to use SVM, I think this statement will be easy to understand. In principle, this article is intended for writing.ProgramThey also gave me a memo, not having to have too much mathematical foundation or having to have any preparatory knowledge about SVM.

SVM:What is SVM and what can it do for us?

SVM, Support Vector MachineIn short, it is something that originated from a bit like an artificial neural network. Now it is most often used for classification. That is to say, if I have a bunch of things that have already been divided into classes (However, the classification is based on unknown), Then when we receive new things, SVM can benchmark which pile of new data is to be distributed. It sounds amazing (if you don't think it's amazing, rethink about what this sentence means:Classification is based on unknown!If it is not amazing, please write a program to solve the problem above), but SVM is based onStatistical Learning Theory,You can solve this problem in a reasonable time.

In a graphical example, it is assumed that I have won a bunch of vertices in the space and the vertices are classified by the hidden colors, and the positions are their data, then SVM can find programs that separate these points, and thus separate the areas one by one. When new points (data) are obtained, you only need to compare the location in which area to find out which color (class) it should be. Of course, SVM is not as simple as partitioning.

Single, but let's take a look at the above example to learn about what SVM is doing. To learn a little more about SVM, refer to the slides: PDF or PS of cjlin in the data mining course. We can use SVM as a black box and merge the data for processing. Then we can use it again.

Where can we get SVM?

Libsvm is of course the most perfect tool. The following points: libsvm.zip or libsvm.tar.gz

. Zip is basically the same as .tar.gz, but it only depends on your operating system. It is more convenient to use. Zip for Windows (because of WinZip and WinRAR), and .tar.gz for Unix

Compile libsvm

After decoding, it is assumed that it is a UNIX system and you can simply make it. If you cannot compile it, please read the instructions and use common sense in detail. because this is a guide, I don't have to spend much time talking about it, and there are very few situations that cannot be compiled. It must be a question for your system. Other subdirectories can be ignored, as longSVM-train, SVM-scale, SVM-predictYou can have three execution files. You can also re-compile the Windows runtime by yourself, but there are already compiled execution files in it: Check the windows subdirectory and there should beSvmtrain.exe, svmscale.exe, svmpredict.exe, svmtoy.exe.

SVMUse

Libsvm has many usage methods. This guide is intended only for the simple section.

Program

Svmtrain

Training data. this program name is also used to run SVM. train accepts input in a specific format to generate a "model" file. you can think of this model as the internal data of SVM, because the prediction requires the model to be used for prediction and cannot directly take raw data. it is reasonable to think about it. Assume that train itself is a very time-consuming action, and train can store internal data in some form, the next time you want to make predictions, loading those internal data will be much faster.

Svmpredict

Based on the trained model and the given input (new value), the class corresponding to the new value of the benchmark test is output.

Svmscale

Scan data. Because the original data range may be too large or too small, svmscale can first scale the data to the appropriate range to make training and prediction faster.

For more information about the file format, see "heart_scale" in libsvm: This is the input file format of SVM.

[Label][Index1]: [value1] [index2]: [value2]...

[Label][Index1]: [value1] [index2]: [value2]...

Record Data in one row, for example:

+ 1 1:0. 708 4:-0.320 5:-0.105 6:-1

Label

Or class, which is the type you want to classify. It is usually an integer.

Index

It is an index with an ascending order, usually a continuous integer.

Value

Is the data used for train, usually a bunch of real numbers.

Each row is composed of the above structure, meaning: I have a row of data, such as value1, value2 ,.... value (and their sort order has been specified by index separately). The classification result of this row of data is label.

Maybe you don't know why value1, value2? This involves the SVM principle. You can think like this (I didn't say this is correct), its name is support "vector" machine, so the input training data is "vector ), that is, a row of x1, x2, X3 ,... these values are values, while n of X [N] is specified by index. These are also called "(attribute) Attribute ".

In most cases, the given data may have many "Features" or "attributes", so the input will be in a group. For example, in the previous example of dot partitioning, do we not have the coordinates of X and Y for each vertex? Therefore, it has two attributes. Assume that I have two points: () and (), but not in label (class) 1 and 2.

1
2
Similarly, the three-digit coordinate in a space has three sets of attributes. The biggest benefit of this file format is that sparse matrix can be used, or some data attributes can be missing.

Run libsvm

Next, explain how to use the libsvm program. You can use heart_scale attached to libsvm as the input, and take it as an example below:

You should also understand the process of using SVM as follows:

1. Prepare the data and make it in the specified format (svmscale is required if necessary)

2. Use svmtrain to train a model.

    1. For new input, svmpredic is used to predict the category of new data.

Svmtrain

Svmtrain syntax is:

Svmtrain [Options] training_set_file [model_file]

Training_set_file is the previous format, and if model_file is not given, it will be called [training_set_file]. model. Options can be left empty first.

The heart_scale.model file is generated as a result of the following program execution: (the screen output is not very important, and it is fine if there is no error)

./Svmtrain heart_scale

Optimization finished, # iter = 219.

Nu = 0.431030.

OBJ =-100.877286, Rn = 0.424632
NSv = 132, nbsv = 107

Total nSv = 132

Svmpredict

The syntax of svmpredict is:

Svmpredict test_file model_file output_file

Test_file is the data we want to predict. Its format is the same as that of svmtrain input, that is, training_set_file. However, the label at the top of each line can be omitted (because prediction is the label to be predicted ). However, if test_file has a label value, after predict is complete, the predict value will be used to compare the value with the value written in test_file, which indicates: the label written in test_file is the true classification result. We can compare it with our prediction result to know the prediction result. Therefore, we can use the original training set as test_file and then explain it to svmpredict for prediction (because the format is the same) to see how high the accuracy is, so that we can easily tune the parameters later. Other parameters are easy to understand:Model_fileIs the file generated by svmtrain,Output_fileIs the file case for storing the output result. The output format is very simple. Each line has a label that corresponds to each row in your test_file. Heart_scale.out is generated when the following programs are executed:

./SVM-predict heart_scale heart_scale.model heart_scale.out

Accuracy = 86.6667% (234/270) (classification)

Mean maid error = 0.533333 (regression)

Squared correlation coefficient = 0.532639 (regression)

We will return the original input token to predict. The accuracy in the first line is the accuracy rate of the benchmark test. If no label is entered, it is a real prediction. Basically, you can use SVM to do things: you only need to write the program to output the data in the correct format, hand it to SVM for train, and then predict and read the result.

Advanced topics

It can be said that there are some more advanced sections in the future. I may not be very clear about them, because I want to express some ideas and explain some words that you may easily encounter when reading relevant documents.

Scaling

SVM-scale is currently not very useful, but it is necessary. Because proper scanning is helpful for parameter selection, as well as the speed of solving SVM. Svmscale scans each attribute. The range is specified by-L and-u, usually [0, 1] or [-1, 1]. Output in stdout. Note that testing data and training data must be scanned together. However, the most difficult part of SVM-scale is that testing data/training data (different files) cannot be specified and then scanned together.

Arguments

As mentioned above, some parameters can be used in train. (Directly executing SVM-train without specifying the input file and parameters will list all parameters and syntax descriptions) These parameters correspond to some parameters of the original SVM formula, so the prediction is correct or not.

For example, change c = 10:
./SVM-train-C 10 heart_scale
Then we can predict that the accuracy rate will immediately change to 92.2% (249/270 ).

Cross Validation

Generally, SVM uses the following method (when determining parameters:

1. First there is a pile of data that has been divided into classes

2. Randomly split into several training sets

    1. Use a set of parameters to train and predict other groups, and check the accuracy rate.
    2. If the accuracy rate is not enough, repeat training/prediction with Parameters

After finding a set of good parameters, you can use these parameters to create a model and use it to predict the unknown data. This entire process is calledCross Validation, That is, cross comparison. In the process of searching for parameters, we can use the internal cross-match function of svmtrain to help:

-V n: N-fold cross validation
N is to split it into several groups, for example, n = 3 is split into three groups, and then train with 1 and 2 and predict 3 to get the correct rate; then we train 2 and 3 and predict 1. Finally, we train 3 and predict 2. And so on. If there is no cross-matching, it is easy to find the parameters that are only valid for a specific input. We get 92.2% for C = 10, but let's take a look at-V 5:

./SVM-train-V 5-C 10 heart_scale

Cross Validation accuracy = 80.3704%

The average is only 80.37%, which is worse than the first 86.

What arguments rules?

Generally, the most important parameter isGamma (-g)AndCost (-C). The crossover (-v) parameter is commonly used 5. Cost records is set to 1, Gamma records is set to 1/K, and K is equal to the number of input data entries. So how do we know how much we need to use as a parameter?

Try is to find a better parameter value. The try parameter process is to increase the number of parameters with less limit by means of exponential growth, that is, 2 ^ n (n power of 2 ). Because there are two sets of parameters, it is equal to trying N * n = n ^ 2 times. This process is not continuous growth, so we can imagine finding a group of lattice points in the range specified on a X-Y plane (GridIf you don't quite understand it, think of it as a square or we place all the integer intersections on the plane at a point, that's it ), after converting the x and y values of each grid point (such as 2 ^ X and 2 ^ y), the cost and gamma values are used for cross-matching. So now you should understand that there is a grid under the python subdirectory of libsvm. what does py do: It automates the above process and calls SVM-train within the scope you specify to try all the parameter values. Grid. py will also draw the result so that you can find the parameters for the consumer. Libsvm has many components that are combined with python. It can be seen that python is a very convenient tool. Many amazing functions, such as automatically logging into multiple machines to run the grid in parallel, are all supported by python. However, SVM itself does not need python at all, but it is more convenient. Run grid (basically use grid. py is of course the most convenient to run, but if you don't understand Python and find it difficult to do it, you need to generate parameters to run it on your own.) The normal range is[C, G] = [2 ^-10, 2 ^ 10] * [2 ^-10, 2 ^ 10]In additionGridUse[-8, 8]That's enough.

Regression(Attenuation)

Another thing worth mentioning is regression (attenuation ). In short, SVM is used for classification, so the label values are discrete data or known fixed values. Regression is used to find continuous values or unknown values. You can also say that it is generally a problem of binary classification, and regression can be used to test a real number.

For example, I know that the stock index is affected by some factors, and then I want to benchmark the stock market. The stock index is our label, and those factors become attributes after quantification. In the future, if we collect those attributes for SVM, it will predict the index (which may be a number that has not been found), which requires regression. What about the lottery number? Because they are fixed with known numbers, it is obvious that we should use the general SVM classification for prediction. (This is a real example-llwang has written something like this.) So label also needs to be scanned.SVM-scale-y lower upperHowever, grid. py does not support regression, and poor comparison is often not very effective for regression.

All in all, regression is very interesting, but it is also a relatively advanced usage. We will not talk about it here. If you are interested, please refer to other SVM and libsvm files.

Conclusion

Now I have briefly explained how to use libsvm. For more complete usage, see libsvm instructions and cjlin's website ,. For beginners of SVM, libsvmtools has many advantages. Like SVM for Dummies, it is easy to observe the libsvm process.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.