The solution of Perceptron, logistic regression and SVM

Last Update:2016-06-10 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article will introduce the Perceptron, the solution of logistic regression and the partial solution of SVM, including some proofs. Some of the basic knowledge in this article has been pointed out in the gradient descent, Newton method and Lagrange duality, and the problems to be solved are from "perceptron to SVM", "from linear regression to logistic regression" two articles.

Perceptron:

As already mentioned in the previous article, the target function of the Perceptron is as follows:

$min \ L (w,b) $

Among them, $L (w,b) =-\sum_{i=1}^{n}[y_i* (w*x_i+b)]$

For the above unconstrained optimization problem, the gradient descent method is generally used, however, considering that the gradient descent rate is relatively slow, the use is random gradient descent.

Its basic idea is: Select a wrong point at a time to make its gradient drop, instead of the same as the gradient drop, the use of all the wrong points , so that can greatly reduce the computational capacity, improve the computational efficiency.

However, it is important to note that a random gradient descent may not converge to a minimum point, but rather hover around a minimum point.

Below, the objective function is solved. The first is biased for W and b respectively:

$\TRIANGLEDOWN_WL (w,b) =-\sum \limits_{x_i \in m}y_ix_i$

$\TRIANGLEDOWN_BL (w,b) =-\sum \limits_{x_i \in m}y_i$

Randomly select a point of error to update W and B:

$w \gets W+\eta y_ix_i$

$b \gets B+\eta y_i$

The algorithm ends until there are no missed points in the training set.

Logistic regression :

Target function:

$min \ L (w) $

$L (W) =-\sum_{i=1}^{n}[y_i* (w*x_i)-ln (1+exp (w*x_i))]$

Use gradient descent:

$\TRIANGLEDOWN_WL (w) =-$ $\sum_{i=1}^{n} (Y_i*x_i-\frac{exp (w*x_i)}{1+exp (w*x_i)}*x_i) $

$w \gets W+\eta \TRIANGLEDOWN_WL (w) $

Until there is no point of error.

SVM :

The difference between SVM and perceptron and logistic regression is that SVM is a constrained optimization problem. Our goal is to maximize the objective function under constrained conditions. The general approach is to simplify the objective function by Lagrange duality and then use the SMO (Minimum sequence optimization) algorithm to solve the problem.

The first is the simple objective function of Lagrange duality.

Before dealing with different objective functions, explain why Lagrange duality can simplify the objective function:

The Lagrangian function is assumed to satisfy the KKT condition:

$L (x,a,b) =f (x) +\sum_{i=1}^{k}a_ic_i (x) +\sum_{j=1}^{l}b_jh_j (x) $

Note: $\theta_p (x) =\max \limits_{a,b:a_i\geq 0} \ L (x,a,b) $

$\theta_d (A, b) =\min \limits_{x\in r^{n}} \ L (x,a,b) $

Prove:

$\because \theta_d (A, b) =\min \limits_{x\in r^{n}} \ L (x,a,b) \leq L (x,a,b) \leq \max \limits_{a,b:a_i\geq 0} \ L (x,a,b) =\t Heta_p (x) $

$\therefore \theta_d (A, B) \leq \theta_p (x) $

$\therefore \max \limits_{a,b:a_i\geq 0} \theta_d (A, b) \leq \min \limits_{x\in R^{n}} \theta_p (x) $

Because the "(X,A,B) $ satisfies the kkt condition, the existence of a solution $x^*,a^*,b^*$ makes the above equals sign set."

As to why the KKT condition is the condition that the above equals is set up, I have not found the proof yet. But this has been proved by the predecessors, just follow a conclusion to remember.

Therefore, once the kkt condition is satisfied, the original problem can be transformed directly into duality problem, and the problem can be simplified directly according to the KKT condition. Each of the following examples will elaborate:

According to the SVM processing different data, we break one by one. The first is a linear, scalable support vector machine:

$min \ \frac{1}{2}| | w| | ^2$

$1-y_i * (W * x_i + b) \leq0$ $ \ \ \ $ $i =1,2,3......n$

The generalized Lagrangian function is first written:

$L (w,b,a) =\frac{1}{2}| | w| | ^2+\sum_{i=1}^{n}a_i (1-y_i * (w * x_i + B)) $

$a _i \geq 0, \ i=0,1,..., n$

The original problem is: $\min \limits_{w,b} \max \limits_{a:a_i\geq 0} \ L (w,b,a) $

Dual problem is: $\max \limits_{a:a_i\geq 0} \min \limits_{w,b} \ L (w,b,a) $

We simplify the problem of duality , according to KKT conditions are available ( KKT the relevant details of the condition can be seen in the article on Lagrange duality):

$\TRIANGLEDOWN_WL (w,b,a) =w-\sum_{i=1}^{n}a_iy_ix_i=0$

$\TRIANGLEDOWN_BL (w,b,a) =-\sum_{i=1}^{n}a_iy_i=0$

Therefore, you can get:

$w =\sum_{i=1}^{n}a_iy_ix_i$

$\sum_{i=1}^{n}a_iy_i=0$

Substituting the above two equations into the generalized Lagrangian function, the simplification can be obtained:

$\min \limits_{w,b} \ L (w,b,a) =-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}a_ia_jy_iy_j (x_i \centerdot x_j) +\sum_{i=1}^ {n}a_i$

Instead: $\max \limits_{a} \ \min \limits_{w,b} \ L (w,b,a) =\max \limits_{a} \-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}a_ia_jy_ Iy_j (x_i \centerdot x_j) +\sum_{i=1}^{n}a_i$

Therefore, the original problem is transformed into the following duality problem:

$\max \limits_{a} \-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}a_ia_jy_iy_j (x_i \centerdot x_j) +\sum_{i=1}^{n}a_i$

$s. T. \ \sum_{i=1}^{n}a_iy_i=0$

$a _i \geq 0, \ i=1,2,...., n$

In order to ensure continuity of thinking, here first not to solve the duality above, but first to analyze the remaining situation how to transform the problem.

Objective functions of linear support vector machines:

$min \ \frac{1}{2}| | w| | ^2 + c\sum_{i=1}^{n}\zeta_i$

$s. T. \ y_i* (w*x_i + b) \geq 1-\zeta_i, \ \ \ i=1,2,3......n$

$\zeta_i \geq 0, \ \ i=1,2,3......n$

The generalized Lagrangian function is written first:

$L (W,B,\ZETA,A,\MU) =\frac{1}{2}| | w| | ^2+c\sum_{i=1}^{n}\zeta_i+\sum_{i=1}^{n}a_i (1-\zeta_i-y_i * (w * x_i + b)) +\sum_{i=1}^{n}\mu_i\zeta_i$

of which: $a _i \geq 0,\mu_i \geq 0$

The original problem was: $\min \limits_{w,b,\zeta} \ \max \limits_{a,\mu} \ \ L (W,B,\ZETA,A,\MU) $

Dual problem is: $\max \limits_{a,\mu} \ \min \limits_{w,b,\zeta} \ \ L (W,B,\ZETA,A,\MU) $

According to the KKT condition, the duality problem is simplified:

$\TRIANGLEDOWN_WL (W,B,\ZETA,A,\MU) =w-\sum_{i=1}^{n}a_iy_ix_i=0$

$\TRIANGLEDOWN_BL (W,B,\ZETA,A,\MU) =-\sum_{i=1}^{n}a_iy_i=0$

$\triangledown_{\zeta}l (W,B,\ZETA,A,\MU) =c-a_i-\mu_i=0$

Get:

$w =\sum_{i=1}^{n}a_iy_ix_i$

$\sum_{i=1}^{n}a_iy_i=0$

$C-a_i-\mu_i=0$

Therefore, $\max \limits_{a} \ \min \limits_{w,b,\zeta} \ L (W,B,\ZETA,A,\MU) =\max \limits_{a} \-\frac{1}{2}\sum_{i=1}^{n}\sum_ {J=1}^{n}a_ia_jy_iy_j (x_i \centerdot x_j) +\sum_{i=1}^{n}a_i$

Among them, $\mu_i$ can be eliminated according to the equation $c-a_i-\mu_i=0$. That is $\mu_i=c-a_i$, by $\mu_i\geq 0$ get $a_i\leq C $.

Therefore, you can get the following duality problem:

$\max \limits_{a} \-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}a_ia_jy_iy_j (x_i \centerdot x_j) +\sum_{i=1}^{n}a_i$

$s. T. \ \sum_{i=1}^{n}a_iy_i=0$

$0\leq A_i\leq C, \ i=1,2,...., n$

Nonlinear support vector machines:

We already know that the biggest difference between nonlinear support vector machines and linear support vector machines is that nonlinear support vector machines transform nonlinear problems into linear problems through a mapping function. Therefore, the difference between the dual problem is also the difference of a mapping function:

$\max \limits_{a} \-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}a_ia_jy_iy_j (\phi (x_i) \centerdot \phi (X_j)) +\sum_{i=1} ^{n}a_i$

$s. T. \ \sum_{i=1}^{n}a_iy_i=0$

$0\leq A_i\leq C, \ i=1,2,...., n$

However, in the actual calculation, the predecessors found that $\phi (x_i) \centerdot\phi (x_j) $ calculation is more difficult, but this step is essential, through the study, the predecessors found a way, which is nuclear skills :

The idea of nuclear techniques is to define only kernel functions in learning and prediction, without explicitly defining mapping functions . In other words: only defined: $K (x_i,x_j) =\phi (x_i) \centerdot\phi (X_j) $.

So we get the definition of the kernel function: $K (x,z) =\phi (x) \centerdot\phi (z) $

The main reasons for this definition are: $k (x,z) $ calculation $\phi (x) \centerdot\phi (z) $ is easier, and $\phi (\centerdot\phi) $ is more difficult to calculate by $k (x) x,z (z) $. Therefore, the above simplification of the even function becomes:

$\max \limits_{a} \-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}a_ia_jy_iy_jk (X_i,x_j) +\sum_{i=1}^{n}a_i$

$s. T. \ \sum_{i=1}^{n}a_iy_i=0$

$0\leq A_i\leq C, \ i=1,2,...., n$

One Class SVM:

Target function:

$\min \ r^2+c\sum_{i}\zeta_i$

$s. T \ (x_i-a) ^t (x_i-a) \leq r^2+\zeta_i$

$\zeta_i\geq 0, \ \ \ i=1,2,..., n$

The generalized Lagrangian function is written first:

$L (R,A,\ZETA,B,\MU) =r^2+c\sum_{i}\zeta_i+\sum_{i=1}^{n}b_i (x_i^2-2ax_i+a^2-r^2-\zeta_i) +\sum_{i=1}^{n}\mu_{i}\ zeta_i$

$b _i\geq 0,\mu_i\geq 0, \ \ i=1,2,3...,n$

The original problem was: $\min \limits_{r,a,\zeta} \ \max \limits_{b,\mu} \ \ L (W,A,\ZETA,B,\MU) $

Dual problem is: $\max \limits_{b,\mu} \ \min \limits_{w,a,\zeta} \ \ L (W,A,\ZETA,B,\MU) $

Obtained by KKT conditions:

$\TRIANGLEDOWN_RL (R,A,\ZETA,B,\MU) =2r-\sum_{i=1}^{n}2b_ir=0 \ (1) $

$\triangledown_al (W,A,\ZETA,B,\MU) =\sum_{i=1}^{n} (2ab_i-2b_ix_i) =0 \ \ (2) $

$\triangledown_{\zeta}l (W,A,\ZETA,B,\MU) =c-b_i-\mu_i=0 \ (3) $

(3) The formula is for each of the $\zeta_i$ deviation guide.

Available from (1) type, $\sum_{i=1}^{n}b_i=1$

From (2) Type: $\sum_{i=1}^{n} (2ab_i-2b_ix_i) =0$, because $2ab_i-2b_ix_i=0$ cannot be guaranteed, therefore, the formula is opened:

$2a\sum_{i=1}^{n}b_i-2\sum_{i=1}^{n}b_ix_i=0$

Therefore, $a =\frac{\sum_{i=1}^{n}b_ix_i}{\sum_{i=1}^{n}b_i}=\sum_{i=1}^{n}b_ix_i$

, you can tell from the result of (1) that the denominator equals 1.

By (3) type available: $C-b_i-\mu_i=0,\ \ i=1,2,..., n$

By the (3) type elimination $\mu_i$ can get: $0\leq b_i \leq C $

Substituting (1) (3) into the Lagrangian function, just eliminating the items containing R and:

$L (R,A,\ZETA,B,\MU) =\sum_{i=1}^{n}b_i (x_i^2-2ax_i+a^2) $

The derivation result of the formula (2) is: $a =\sum_{i=1}^{n}b_ix_i$ can be obtained by the following formula:

$\min \limits_{r,a,\zeta} \ L (R,A,\ZETA,B,\MU) =\sum_{i=1}^{n}b_i (x_i*x_i)-\sum_{i=1}^{n}\sum_{j=1}^{n}b_ib_j (x_i* X_j) $

If you use a nuclear technique, you can also write the following expression:

$\min \limits_{r,a,\zeta} \ L (R,A,\ZETA,B,\MU) =\sum_{i=1}^{n}b_ik (x_i,x_i)-\sum_{i=1}^{n}\sum_{j=1}^{n}b_ib_jk (x_ I,x_j) $

Therefore, the duality problem of the original problem can be written in the following form :

$\max \limits_{b_j} \ \sum_{i=1}^{n}b_ik (x_i,x_i)-\SUM_{I=1}^{N}\SUM_{J=1}^{N}B_IB_JK (X_i,x_j) $

$s. T. \ \sum_{i=1}^{n}b_i=1$

$0\leq B_i\leq C, \ i=1,2,..., n$

from the previous summary of the 4 case, the duality of all cases will eventually become similar to the following form: (using nonlinear support vector machines as an example)

$\max \limits_{a_i} \ L (a,x,y) $

$s. T. \ \sum_{i=1}^{n}a_iy_j=k$

$0\leq A_i\leq C, \ i=1,2,..., n$

Among them, the first equation is the objective function, is generally maximized or minimized the objective function, can control the variable only one, such as A;

The second formula is an equation that sums all the cases from I=1 to n equal to a constant.

The third is about the control of the range of variables (such as a).

In a similar manner, it is generally possible to solve the problem in the same way.

The solution to this problem, if you continue to write in this article, the length is too long, affecting the reading effect, so the final step of the solution, will be described in the next article, please look forward to.

The solution of Perceptron, logistic regression and SVM

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More