In fact, before the Shanghai Jiaotong University model Analysis and Machine Intelligence Lab to the 2.6 version of the svm.cpp made a partial annotation, (where?). Google, you know). However, this annotation is only for the code and comments, the whole article look down, you will find that in addition to understanding the meaning of a few parameters, or will be LIBSVM confused. Of course, as an auxiliary material to understand the procedure, it is of great use. In particular, the description of several structures is relatively clear. But be aware of what the program does, or follow the procedure.
Since SVM involves a lot of mathematical knowledge, we just talk about some basic ideas, so from the most basic c-svc type SVM, kernel functions using commonly used RBF function. LIBSVM the 2.6 version of the good, because the subsequent version of the author added a lot of content, difficult to understand the original idea of the author. I do pattern recognition, mainly from the angle of classification to parse the function of the call process, we look from the SVMTRAIN.C, the function of its invocation is as follows:
is the entire C-SVC calculation process, the following are some important content to specify:
1. Svm_group_class
In the 2.6 version of this function is not, its function directly in the Svm_train implementation, in order to enhance readability, in the 2.89 version of the function set, in fact, the work is the same. It is necessary to note that the perm only stores the ordinal number of each sample in its original position, rather than the data, when it is rearranged. There are two benefits to this:
1) You do not have to destroy the original data (that is, the data that is read in x);
2) Easy to retrieve, only need l-dimensional data retrieval, get the serial number, and then locate the corresponding position in the original data can be.
Perm is the sorting order of the various types in the original sample, not necessarily in accordance with the label ordinal of your original sample, if the original sample label is { -1,0,1}, and the first appearance of the label is 1 of the sample, Then the perm in the label 1 as Class 0 first ranked. And start is recorded in each type of starting sequence number, and this sequence number is in the perm ordinal.
2. One-against-one of multi-class discriminant
SVM is used to distinguish the dividing line (polygon), there is only one boundary between the two categories (polygons), so there are only 1 classifiers, either 1 or 2 classes. But for multiple classes, there are many ways to classify them. At present, the existing methods are mainly:
1) 1-v-r mode
For the K -Class problem , one of the classes of n training samples is considered a class, all other categories are classified as another category, so there is a total of K classifiers. At the end of the forecast, Discriminant uses a competitive approach, which is the class in which the votes are more.
2) 1-v-1 mode
That's what we call the One-against-one way. This method constructs a classifier for any two of these classes, with a common (k-1) XK/2 classifier. The final forecast also uses a competitive approach.
3) A Direction-free graph (DAG-SVM)
In the training stage, the method adopts the 1-v-1 method, while the discriminant stage adopts a two-direction and non-circular graph.
LIBSVM uses the 1-v-1 approach, because this approach is simple, and many practices prove to be better than the 1-v-r approach.
is a 5 class 1-v-1 combination, red is a combination of class 0 and other classes, purple is a combination of 1 class and the remainder class, Green is a combination of 2 class and right end, blue only 3 and 4 combination. Therefore, the combination of Nr_class classes is as follows:
for (i = 0; i < nr_class; i + +)
{
for (j = i+1; i < Nr_class; J + +)
{Class i–v– Class J}
}
3. Memory processing of the Hessian matrix
Because SVM is based on minimal structural risk, it has the advantage of having a more traditional approach based on the least empirical risk in classification recognition. But SVM also has a fatal flaw, because to calculate the Hessian matrix Qij consumes the huge memory, is not advantageous in the practice application. At present, how to reduce the use of memory is still the research topic of SVM. LIBSVM to the Hessian matrix processing strategy is to define a memory processing class cache class, pre-considered allocating a certain amount of memory, storage computing good QIJ, its serial number of the retrieval using a two-way linked list, speed up the retrieval speed. The most important functions are:
int cache::get_data (const int index, qfloat **data, int len)
Len is the length of data, and data is the first address of the memory returned and the row of index Qij.
Every time you look for the QI in the list that behaves as index, if it has already been computed, it returns the computed memory address and inserts the list node of the first address into the tail of the list. If it is not calculated, allocate memory and calculate, and when the remaining memory is not enough, it is necessary to reclaim the memory of the linked table head. Here, some may ask, is it useless to calculate before?? In fact, because Qij is sparse matrix, in the training process as long as its corresponding alpha[i] no longer change (then alpha[i]=0 or alpha[i]=c), its corresponding QI will not be chosen to train, so the original calculation of Qi is useless. In fact, the order of the list represents the frequency of the selection, most of the head is the most unlikely to be selected, because then alpha[i]=0 or alpha[i]=c, and the most likely to be the tail of the most easily selected.
4. Data Selection Select_working_set (I,J)
For a large number of samples (thousands of), the memory required by SVM is not affordable to the computer. At present, there are two main ways to solve this problem: block algorithm and decomposition algorithm. Here, LIBSVM uses the SMO (serial minimization) method in the decomposition algorithm, which selects only two samples per training. We do not have a specific discussion of SMO, but we would like to have a closer look at the relevant information, and only talk about the procedures.
The dual problems of general SVM are:
s.t. (4.1)
The sufficient and necessary conditions for the convergence of SVM are the kkt condition, which manifests as:
(4.2)
The derivative of the 4.1 formula is available:
(4.3)
Further derivation indicates:
(4.4)
That is, as long as all the samples are 4.4-way, the solution is the optimal value. Therefore, in each round of training, just select two samples (serial number I and J), is the most violation of the kkt condition (i.e., 4.4) of the sample, you can ensure that other samples also meet the kkt conditions. The sequence numbers I and J are selected as follows:
(4.5)
5. Stop guidelines
In the LIBSVM program, the stop criterion is contained in the function Select_working_set (I,J) return value. That is, when a 4.5-type sample is not found, the optimal solution is theoretically reached. However, the actual programming, due to the kkt conditions are very harsh, to be appropriate to relax. Make:
(4.6)
By 4.4, when all samples meet kkt conditions, GI≤-GJ
Add an appropriate loose range ε, which is the EPS in the program, the default is 0.001, then the final stop criterion is:
Gi≤-gj +ε→gi + gj≤ε (4.7)
6. Update of Factor alpha
Since SMO selects only 2 samples at a time, the 4.1 equation constraint can be converted to a straight line constraint:
(4.8)
The conversion to a graphical representation is:
The 4.8-type α1 is represented by α2, namely: the range of values that can be α2 by analytic geometry:
(4.9)
After a series of transformations, you can get the α2 update value α2new:
(4.10)
Combine 4.9 and 4.10 to get α2new final expression:
(4.11)
After getting α2new, it can be obtained from the 4.8-type α1new.
Here, the specific operation of the time, the selected number I and J instead of here 1 and 2 can be. Of course, when programming, these formulas are still too abstract. For the 4.9 type, a specific subdivision is also required. For example, for y1 y2 = 1 when l = max (0,α2-α1), is 0 large and α2-α1 is a big problem. A total of 8 cases are required.
7. Data Scaling do_shrinking ()
It says that the memory used by SVM is huge, and the other flaw is the computational speed, because the data is large and the computational amount is large, and obviously the computational speed will decrease. Therefore, a good way is to gradually remove the data that does not participate in the calculation process. Because, practice proves, in the training process, Alpha[i] once reached the boundary (alpha[i]=0 or Alpha[i]=c), the alpha[i] value will not change, with the training, the number of participating operations will be less, the SVM end result support Vector (0< ALPHA[I]<C) tend to account for a very small part.
The strategy used by LIBSVM is to detect the Alpha[i] value in the active_size during the calculation, and if Alpha[i] is to the boundary, then the corresponding sample should be removed (into inactived) and placed at the end of the stack to gradually reduce the active_ Size of a size.
8. Calculation of intercept B
The basic formula for the B calculation is:
(4.12)
Theoretically, the value of B is variable. When the program is optimized, the B-value can be obtained by using any standard support vector machine (0<ALPHA[I]<C) sample to bring in 4.12. At present, there are many ways to ask for B. In LIBSVM, each of the two classes of support vectors for y=+1 and y=-1 are evaluated by B, and then averaged:
(4.13)
At this point, libsvm the whole idea of our simple pass over, which involves a lot of theoretical knowledge, many details need to view the relevant SVM books. To tell the truth, the author is also a novice, some theories also did not make very clear, I can only I know as far as possible to speak out. Hopefully, some of the things you want to know about SVM are useful.
LIBSVM Study (iv)--Step Into LIBSVM (turn)