The SMO algorithm for the implementation of SVM
终于到SVM的实现部分了。那么神奇和有效的东西还得回归到实现才可以展示其强大的功力。SVM有效而且存在很高效的训练算法,这也是工业界非常青睐SVM的原因。 也就是说找到一组αi可以满足上面的这些条件的就是该目标的一个最优解。所以我们的优化目标是找到一组最优的αi*。一旦求出这些αi*,就很容易计算出权重向量w*和b,并得到分隔超平面了。 这是个凸二次规划问题,它具有全局最优解,一般可以通过现有的工具来优化。但当训练样本非常多的时候,这些优化算法往往非常耗时低效,以致无法使用。从SVM提出到现在,也出现了很多优化训练的方法。其中,非常出名的一个是1982年由Microsoft Research的John C. Platt在论文《Sequential Minimal Optimization: A Fast Algorithm for TrainingSupport Vector Machines》中提出的Sequential Minimal Optimization序列最小化优化算法,简称SMO算法。SMO算法的思想很简单,它将大优化的问题分解成多个小优化的问题。这些小问题往往比较容易求解,并且对他们进行顺序求解的结果与将他们作为整体来求解的结果完全一致。在结果完全一致的同时,SMO的求解时间短很多。在深入SMO算法之前,我们先来了解下坐标下降这个算法,SMO其实基于这种简单的思想的。
8.1, the coordinate descent (rise) method
Here, we need to solve the m variable like, generally by gradient descent (here is the maximum value, so it should be called up) and other algorithms each iteration of all m variables like that is α vector optimization. Each iteration of the error adjusts the value of each element in the alpha vector. While the coordinate ascent method (coordinate ascent and coordinate descent can be seen as a pair, the idea that the coordinate ascent is used to solve max optimization problem, the coordinate descent is used for min optimization problem) is that each iteration adjusts only one variable like value, and the value of other variables is fixed in this iteration.
The meaning of the innermost statement is to fix all αj except like (I is not equal to J), when W can be regarded as just a function of like, then the direct derivative of like can be optimized. Here we are maximizing the derivation of the order I is from 1 to M, you can change the order of optimization to enable W to increase and converge faster. If W is able to achieve the best in the inner loop, then the coordinate rising method is a very efficient method to find the extremum.
Suppose our initial point is a (figure is a contour plot of the function projecting to the Xoy plane, the darker the color is, the smaller the value), the more we need to reach f* place. The quickest way is the path of the yellow line in the diagram, one-time arrived, in fact, this is Newton optimization method, but if it is high-dimensional, this method is not very efficient (because the inverse of the matrix needs to be solved, this is not discussed here). We can also follow the path indicated by red. Starting with a, the X is fixed first, along the Y axis toward the point where f (x, y) is reduced to B, then Y, along the x axis to the point where f (x, Y) values are reduced to C, continuously circulating until the f* is reached. Anyway, whenever we have to let F (x, Y) value of the place to go on the line, so down-to-earth, step by step, each step to make f (x, y) slowly become small, one day, emperor not negative. Arriving at F* is also a matter of time. Here you may say that the red line is a lot worse than the Yellow line, the gap between rich and poor. Because this is a simple two-dimensional situation. If the case is high-dimensional, and the objective function is very complex, coupled with a lot of sample sets, then in the gradient descent, the objective function of all like gradient or in Newton method to inverse matrix, it is very time-consuming. At this point, the coordinate descent method may be more efficient if the W is optimized for a single like very quickly.
8.2. SMO algorithm
SMO算法的思想和坐标下降法的思想差不多唯一不同的是,SMO是一次迭代优化两个α而不是一个。为什么要优化两个呢? 因此,我们需要一次选取两个参数做优化,比如αi和αj,此时αi可以由αj和其他参数表示出来。这样回代入W中,W就只是关于αj的函数了,这时候就可以只对αj进行优化了。在这里就是对αj进行求导,令导数为0就可以解出这个时候最优的αj了。然后也可以得到αi。这就是一次的迭代过程,一次迭代只调整两个拉格朗日乘子αi和αj。SMO之所以高效就是因为在固定其他参数后,对一个参数优化过程很高效(对一个参数的优化可以通过解析求解,而不是迭代。虽然对一个参数的一次最小优化不可能保证其结果就是所优化的拉格朗日乘子的最终结果,但会使目标函数向极小值迈进一步,这样对所有的乘子做最小优化,直到所有满足KKT条件时,目标函数达到最小)。 总结下来是:
Repeat the following process until the convergence {
(1) Selecting two Lagrange multipliers like and αj;
(2) Fixed other Lagrange multiplier αk (k not equal to I and j), only to like and αj optimization W (α);
(3) According to the optimized like and αj, the value of intercept B is updated;
}
那训练里面这两三步骤到底是怎么实现的,需要考虑什么呢?下面我们来具体分析下:
(1) Select like and αj:
我们现在是每次迭代都优化目标函数的两个拉格朗日乘子αi和αj,然后其他的拉格朗日乘子保持固定。如果有N个训练样本,我们就有N个拉格朗日乘子需要优化,但每次我们只挑两个进行优化,我们就有N(N-1)种选择。那到底我们要选择哪对αi和αj呢?选择哪对才好呢?想想我们的目标是什么?我们希望把所有违法KKT条件的样本都纠正回来,因为如果所有样本都满足KKT条件的话,我们的优化就完成了。那就很直观了,哪个害群之马最严重,我们得先对他进行思想教育,让他尽早回归正途。OK,那我们选择的第一个变量αi就选违法KKT条件最严重的那一个。那第二个变量αj怎么选呢? 我们是希望快点找到最优的N个拉格朗日乘子,使得代价函数最大,换句话说,要最快的找到代价函数最大值的地方对应的N个拉格朗日乘子。这样我们的训练时间才会短。就像你从广州去北京,有飞机和绿皮车给你选,你选啥?(就算你不考虑速度,也得考虑下空姐的感受嘛,别辜负了她们渴望看到你的期盼,哈哈)。有点离题了,anyway,每次迭代中,哪对αi和αj可以让我更快的达到代价函数值最大的地方,我们就选他们。或者说,走完这一步,选这对αi和αj代价函数值增加的值最多,比选择其他所有αi和αj的结合中都多。这样我们才可以更快的接近代价函数的最大值,也就是达到优化的目标了。再例如,,我们要从A点走到B点,按蓝色的路线走c2方向的时候,一跨一大步,按红色的路线走c1方向的时候,只能是人类的一小步。所以,蓝色路线走两步就迈进了成功之门,而红色的路线,人生曲折,好像成功遥遥无期一样,故曰,选择比努力更重要! 真啰嗦!说了半天,其实就一句话:为什么每次迭代都要选择最好的αi和αj,就是为了更快的收敛!那实践中每次迭代到底要怎样选αi和αj呢?这有个很好听的名字叫启发式选择,主要思想是先选择最有可能需要优化(也就是违反KKT条件最严重)的αi,再针对这样的αi选择最有可能取得较大修正步长的αj。具体是以下两个过程:
1) The choice of the first variable like:
SMO称选择第一个变量的过程为外层循环。外层训练在训练样本中选取违法KKT条件最严重的样本点。并将其对应的变量作为第一个变量。具体的,检验训练样本(xi, yi)是否满足KKT条件。 该检验是在ε范围内进行的。在检验过程中,外层循环首先遍历所有满足条件0<αj<C的样本点,即在间隔边界上的支持向量点,检验他们是否满足KKT条件,然后选择违反KKT条件最严重的αi。如果这些样本点都满足KKT条件,那么遍历整个训练集,检验他们是否满足KKT条件,然后选择违反KKT条件最严重的αi。 优先选择遍历非边界数据样本,因为非边界数据样本更有可能需要调整,边界数据样本常常不能得到进一步调整而留在边界上。由于大部分数据样本都很明显不可能是支持向量,因此对应的α乘子一旦取得零值就无需再调整。遍历非边界数据样本并选出他们当中违反KKT 条件为止。当某一次遍历发现没有非边界数据样本得到调整时,遍历所有数据样本,以检验是否整个集合都满足KKT条件。如果整个集合的检验中又有数据样本被进一步进化,则有必要再遍历非边界数据样本。这样,不停地在遍历所有数据样本和遍历非边界数据样本之间切换,直到整个样本集合都满足KKT条件为止。以上用KKT条件对数据样本所做的检验都以达到一定精度ε就可以停止为条件。如果要求十分精确的输出算法,则往往不能很快收敛。 对整个数据集的遍历扫描相当容易,而实现对非边界αi的扫描时,首先需要将所有非边界样本的αi值(也就是满足0<αi<C)保存到新的一个列表中,然后再对其进行遍历。同时,该步骤跳过那些已知的不会改变的αi值。
2) The second variable αj choice:
在选择第一个αi后,算法会通过一个内循环来选择第二个αj值。因为第二个乘子的迭代步长大致正比于|Ei-Ej|,所以我们需要选择能够最大化|Ei-Ej|的第二个乘子(选择最大化迭代步长的第二个乘子)。在这里,为了节省计算时间,我们建立一个全局的缓存用于保存所有样本的误差值,而不用每次选择的时候就重新计算。我们从中选择使得步长最大或者|Ei-Ej|最大的αj。
(2) Optimize like and αj:
选择这两个拉格朗日乘子后,我们需要先计算这些参数的约束值。然后再求解这个约束最大化问题。 (4)凸优化问题终止条件: SMO算法的基本思路是:如果说有变量的解都满足此最优化问题的KKT条件,那么这个最优化问题的解就得到了。因为KKT条件是该最优化问题的充分必要条件(证明请参考文献)。所以我们可以监视原问题的KKT条件,所以所有的样本都满足KKT条件,那么就表示迭代结束了。但是由于KKT条件本身是比较苛刻的,所以也需要设定一个容忍值,即所有样本在容忍值范围内满足KKT条件则认为训练可以结束;当然了,对于对偶问题的凸优化还有其他终止条件,可以参考文献。
8.3. Python implementation of SMO algorithm
8.3.1, Python prep work
我使用的Python是2.7.5版本的。附加的库有Numpy和Matplotlib。而Matplotlib又依赖dateutil和pyparsing两个库,所以我们需要安装以上三个库。前面两个库还好安装,直接在官网下对应版本就行。但我找后两个库的时候,就没那么容易了。后来发现,其实对Python的库的下载和安装可以借助pip工具的。这个是安装和管理Python包的工具。感觉它有点像ubuntu的apt-get,需要安装什么库,直接下载和安装一条龙服务。 首先,我们需要到pip的官网:https://pypi.python.org/pypi/pip下载对应我们python版本的pip,例如我的是pip-1.4.1.tar.gz。但安装pip需要另一个工具,也就是setuptools,我们到https://pypi.python.org/pypi/setuptools/#windows下载ez_setup.py这个文件回来。然后在CMD命令行中执行:(注意他们的路径)
Python ez_setup.py
At this point, the. egg files are automatically downloaded and then installed to complete.
然后我们解压pip-1.4.1.tar.gz。进入到该目录中,执行:
Python setup.py Install
The PIP is automatically installed in the Scripts folder in your Python directory. Mine is C:\Python27\Scripts.
在里面我们可以看到pip.exe,然后我们进入到该文件夹中:
CD C:\Python27\Scriptspip install Dateutilpip install pyparsing
This will allow the additional libraries to be downloaded back. Very high-end atmosphere on the grade!
Python implementation of 8.3.2 and SMO algorithm
在代码中已经有了比较详细的注释了。不知道有没有错误的地方,如果有,还望大家指正(每次的运行结果都有可能不同,另外,感觉有些结果似乎不太正确,但我还没发现哪里出错了,如果大家找到有错误的地方,还望大家指点下,衷心感谢)。里面我写了个可视化结果的函数,但只能在二维的数据上面使用。直接贴代码:
svm.py
[Python] View plain copy
View code slices to my Code slice
################################################# # svm:support Vector Machine # author:zouxy # date:2013-12-12 # homepage:http://blog.csdn.net/zouxy09 # Email: [email protected] ########################################### ###### from numpy Import * Import time import matplotlib.pyplot as PLT # calulate kernel value def calckernelvalue (m Atrix_x, Sample_x, kerneloption): Kerneltype = kerneloption[0] NumSamples = matrix_x.shape[0] KernelValue = Mat (Zeros ((NumSamples, 1)) if Kerneltype = = ' Linear ': Kernelvalue = matrix_x * sample_x.t elif kernel Type = = ' RBF ': Sigma = kerneloption[1] if sigma = = 0:sigma = 1.0 for I in xrange ( numsamples): diff = matrix_x[i,:]-sample_x Kernelvalue[i] = exp (diff * diff). T/( -2.0 * sigma**2)) else:raise Nameerror (' not support kernel type! can use linear or rbf! ') Return kernelvalue # Calculate kernel MatriX given train set and kernel type Def calckernelmatrix (train_x, kerneloption): numsamples = train_x.shape[0] Ke Rnelmatrix = Mat (Zeros ((NumSamples, NumSamples))) for I in Xrange (numsamples): kernelmatrix[:, i] = Calckern Elvalue (train_x, Train_x[i,:], kerneloption) return Kernelmatrix # define a struct just for storing variables and D ATA class Svmstruct:def __init__ (self, dataset, labels, C, Toler, kerneloption): self.train_x = dataSet # Each row stands for a sample self.train_y = labels # corresponding the label self. c = C # slack variable self.toler = toler # termination condition for Iteration Self.num Samples = dataset.shape[0] # number of Samples Self.alphas = Mat (Zeros ((Self.numsamples, 1)) # Lagrange factors For all samples self.b = 0 Self.errorcache = Mat (Zeros ((Self.numsamples, 2))) Self.kernelopt = Kerneloption Self.kernelmat = CalckernelMatrix (self.train_x, self.kernelopt) # Calculate the error for Alpha K def calcerror (SVM, alpha_k): Output_k = Floa T (Multiply (Svm.alphas, svm.train_y). T * svm.kernelmat[:, Alpha_k] + svm.b) Error_k = Output_k-float (Svm.train_y[alpha_k]) return Error_k # Update The error cache for Alpha K after optimize Alpha K def updateerror (SVM, alpha_k): Error = Calcerror (SVM, Alpha_k) Svm.errorcache[alpha_k] = [1, ERROR] # Select Alpha J which has the biggest step def selectalpha_j (SVM, alpha_i, err Or_i): svm.errorcache[alpha_i] = [1, Error_i] # mark as valid (has been optimized) Candidatealphalist = nonzero (s Vm.errorcache[:, 0]. A) [0] # mat. A return Array maxstep = 0; Alpha_j = 0; Error_j = 0 # Find the alpha with Max iterative step if Len (candidatealphalist) > 1:for alpha_k in Candidatealphalist:if Alpha_k = = Alpha_i:continue Error_k = Calcerror (SVM, a Lpha_k) If ABS (ErroR_k-error_i) > Maxstep:maxstep = ABS (error_k-error_i) Alpha_j = Alpha_k Error_j = Error_k # If came in the This loop first time, we select Alpha J randomly else: Alpha_j = alpha_i while Alpha_j = = Alpha_i:alpha_j = Int (random.uniform (0, Svm.numsamples)) Error_j = Calcerror (SVM, Alpha_j) return Alpha_j, Error_j # The Inner loop for optimizing alpha I and Alpha J def innerloop (SVM, alpha_i): Error_i = Calcerror (SVM, alpha_i) # # Check and pick up the alpha who violates the KKT condition # satisfy KKT condition # 1) yi*f (i) >= 1 and Alpha = = 0 (outside the boundary) # 2) yi*f (i) = = 1 and 0<alpha< C (on the boundary) # 3) yi*f (i) <= 1 and Alpha = = C (between the boundary) # # VI Olate KKT Condition # because y[i]*e_i = y[i]*f (i)-y[i]^2 = Y[i]*f (i)-1, so # 1) if y[i]*e_i < 0, so Yi*f (i) < 1, if Alpha < C, violate! (alpha = C'll be correct) # 2) If y[i]*e_i > 0, so yi*f (i) > 1, if Alpha > 0, violate! (alpha = 0 'll be correct) # 3) If y[i]*e_i = 0, so yi*f (i) = 1, it's on the boundary, needless optimized if ( Svm.train_y[alpha_i] * Error_i <-svm.toler) and (Svm.alphas[alpha_i] < SVM. C) or\ (svm.train_y[alpha_i] * error_i > Svm.toler) and (Svm.alphas[alpha_i] > 0): # step 1:selec T Alpha J Alpha_j, Error_j = Selectalpha_j (SVM, alpha_i, error_i) Alpha_i_old = svm.alphas[alpha_i].copy () Alpha_j_old = Svm.alphas[alpha_j].copy () # Step 2:calculate the boundary L and H for Alpha J If svm.train_y[alpha_i]! = Svm.train_y[alpha_j]: L = max (0, Svm.alphas[alpha_j]-svm.alphas[alpha_i]) H = min (SVM. C, SVM. C + Svm.alphas[alpha_j]-svm.alphas[alpha_i]) else:l = max (0, Svm.alphas[alpha_j] + Svm.alphas[alp Ha_i]-SVM. C) H = min (SVM. C Svm.alphas[alpha_j] + svm.alphas[alpha_i]) if L = = H:return 0 # step 3:calculate eta (th E Similarity of sample I and j) ETA = 2.0 * svm.kernelmat[alpha_i, Alpha_j]-svm.kernelmat[alpha_i, alpha_i] \ -Svm.kernelmat[alpha_j, Alpha_j] If ETA >= 0:return 0 # Step 4:UPDA Te Alpha J Svm.alphas[alpha_j] = Svm.train_y[alpha_j] * (error_i-error_j)/ETA # step 5:clip Alpha J if Svm.alphas[alpha_j] > H:svm.alphas[alpha_j] = H if Svm.alphas[alpha_j] < L: Svm.alphas[alpha_j] = L # step 6:if Alpha J not moving enough, just return if ABS (Alpha_ J_old-svm.alphas[alpha_j]) < 0.00001:updateerror (SVM, Alpha_j) return 0 # Step 7: Update Alpha I after optimizing Aipha J Svm.alphas[alpha_i] + + svm.train_y[alpha_i] * Svm.train_y[alpha_j] \ * (Alpha_j_old-svm.alphas[alpha_j]) # step 8:update threshold B B1 = Svm.b-error_i-svm.trai N_y[alpha_i] * (svm.alphas[alpha_i]-alpha_i_old) \ * svm.kernelmat[ Alpha_i, alpha_i] \-Svm.train_y[alpha_j] * (Svm.alphas[alpha_j]-alpha_j_old) \ * svm.kernelmat[alpha_i, Alpha_j] B2 = svm.b-error_j-svm.train_y[ Alpha_i] * (svm.alphas[alpha_i]-alpha_i_old) \ * Svm.kernelmat[alph a_i, Alpha_j] \-Svm.train_y[alpha_j] * (Svm.alphas[alpha_j]-alpha_j_old) \ * Svm.kernelmat[alpha_j, Alpha_j] if (0 < svm.alphas[alpha_i]) and (SV M.alphas[alpha_i] < SVM. C): Svm.b = B1 elif (0 < Svm.alphas[alpha_j]) and (Svm.alphas[alpha_j] < SVM. C): SVM.B = B2 Else:svm.b = (B1 + b2)/2.0 # Step 9:update error cache for Alpha I, j aft Er optimize alpha I, J and b Updateerror (SVM, Alpha_j) updateerror (SVM, alpha_i) return 1 Else:return 0 # The main training procedure Def TRAINSVM (train_x, train_y, C, Toler, maxiter, kerneloption = ( ' RBF ', 1.0): # Calculate training Time StartTime = Time.time () # init data struct for SVM SVM = Svmst Ruct (Mat (train_x), Mat (train_y), C, Toler, kerneloption) # start Training Entireset = True Alphapairschange d = 0 Itercount = 0 # iteration termination condition: # condition 1:reach Max Iteration # Condit Ion 2:no Alpha changed after going through all samples, # in other words, all alpha (samples) Fit KKT Condition while (Itercount < Maxiter) and ((alphapairschanged > 0) or entireset): alphapairschanged = 0 # Update ALPHave over all training examples if entireset:for i in Xrange (svm.numsamples): Alpha Pairschanged + = Innerloop (SVM, i) print '---iter:%d entire set, alpha pairs changed:%d '% (Itercount, Alphapa irschanged) Itercount + = 1 # update alphas over examples where Alpha was not 0 & not C (not on Bo undary) else:nonboundalphaslist = Nonzero ((svm.alphas.a > 0) * (SVM.ALPHAS.A < SVM. C)) [0] for i in nonboundalphaslist:alphapairschanged + = Innerloop (SVM, i) prin T '---iter:%d non boundary, alpha pairs changed:%d '% (Itercount, alphapairschanged) Itercount + = 1 # alternate loop over all examples and non-boundary examples if Entireset:entireset = False elif Alphapairschanged = = 0:entireset = True print ' Congratulations, training complete! Took%fs! '% (Time.time ()-startTime) retUrn SVM # Testing your trained SVM model given test set def TESTSVM (SVM, test_x, test_y): Test_x = Mat (test_x) Test_y = Mat (test_y) numtestsamples = test_x.shape[0] Supportvectorsindex = nonzero (svm.alphas.a > 0) [0] Supportvectors = Svm.train_x[supportvectorsindex] Supportvectorlabels = Svm.train_y[supportvectorsindex] Supportvectoralphas = Svm.alphas[supportvectorsindex] MatchCount = 0 for I in Xrange (numtestsamples): K Ernelvalue = Calckernelvalue (Supportvectors, Test_x[i,:], svm.kernelopt) predict = kernelvalue.t * Multiply (Supp Ortvectorlabels, Supportvectoralphas) + SVM.B if sign (predict) = = sign (Test_y[i]): MatchCount + = 1 accuracy = float (matchcount)/numtestsamples return accuracy # Show your trained SVM model only available with 2-d Data def SHOWSVM (SVM): if svm.train_x.shape[1]! = 2:print "sorry! I can not draw because the dimension of your data are not 2! " Return 1 # Draw all samples for I in Xrange (svm.numsamples): if svm.train_y[i] = = 1: Plt.plot (svm.train_x[i, 0], svm.train_x[i, 1], ' or ') elif svm.train_y[i] = = 1:plt.plot (svm.tr Ain_x[i, 0], svm.train_x[i, 1], ' OB ') # mark support vectors supportvectorsindex = nonzero (svm.alphas.a > 0) [ 0] for I in SupportVectorsIndex:plt.plot (svm.train_x[i, 0], svm.train_x[i, 1], ' Oy ') # Draw the Classi FY line w = zeros ((2, 1)) for i in supportvectorsindex:w + = multiply (svm.alphas[i] * Svm.train_y[i], S Vm.train_x[i,:]. T) min_x = min (svm.train_x[:, 0]) [0, 0] max_x = max (svm.train_x[:, 0]) [0, 0] y_min_x = float (-svm.b-w[0] * min_x)/w[1] y_max_x = float (-svm.b-w[0] * max_x)/w[1] Plt.plot ([min_x, max_x], [y_min_x, y_max_x], '-G ' ) Plt.show ()
Guangdong Ocean University electronic 1151-hole Yanfei Python language programming 12th week