This paper presents a unique strategy, which takes the ultralow bit weights as a discrete constrained non-convex optimization problem (MIP). With the help of the ADMM algorithm, the main idea of this paper is to decouple the continuous variables in the discrete constraints using the auxiliary variables in the discrete space. Unlike the previous quantization method, which modifies a specific gradient in a continuous parameter, this paper optimizes both continuous and discrete spaces, and uses the augmented Lagrangian function to connect the solutions of two spaces.
ADMM algorithmMain reference this blog http://mullover.me/2016/01/19/admm-for-distributed-statistical-learning/
If the optimization problem can be expressed as
$$ Min\quad f (x) +g (z) \quad s.t.\quad ax+bz=c\quad (1) $$
Among them $x\in \mathbb r^s,z\in \mathbb r^n, a\in \mathbb r^{p \times n}, b\in \mathbb r^{p \times m}; and; C\in \MATHBB r^p.$ $x $ with $z$ is an optimization variable, $f (x) +g (z) $ is the objective function, $Ax +bz=c$ is an equality constraint.
The augmented Lagrangian function (augmented Lagrangian) of the equation $ (1) $ can be expressed as:
$ $L _\rho (x,z,y) =f (x) +g (z) +y^t (ax+bz-c) + (\RHO/2) \vert ax+bz-c \vert^2_2 \quad (2) $$
Where $y$ is Lagrange multiplier, $\rho>0$ is the penalty parameter. Augmented is due to the addition of two penalty items.
The ADMM consists of three-step iterations:
$ $x ^{k+1}:=arg\min\limits_xl_\rho (x,z^k,y^k) $$
$ $z ^{k+1}:=arg\min\limits_zl_\rho (x^{k+1},z,y^k) $$
$ $y ^{k+1}:=y^k+\rho (ax^{k+1}+bz^{k+1}-c) $$
As you can see, each iteration is divided into three steps:
1. Solve the minimization problem related to $x$, update the variable $x$
2. Solve the minimization problem related to $x$, update the variable $x$
3. Update $\rho$
Objective functionAs $f (W) $ for an nn loss fuction, $W =${$W _1,w_2,..., w_l$}, $W _i$ represents the weight of the $i$ layer of the network, $W _i$ is a $d_i$-dimensional vector $\mathbb r^{d_i}$, $W \in \ Mathbb r^d, d=\sum_id_i$. In this paper, because the training of ultra-low bit quantization, in particular, the network weights is strictly limited to 0 or 2 power, so that the float-point operation can be replaced by a faster bit shift operation. Suppose training a ternary network {$-1,0,+1$}, so that training such a network can be used as MIP on mathematical formulas:
$$\min\limits_w;f (W) \quad s.t.; W\in c={-1,0,+1}^d$$
Since weights is strictly limited to a power of 0 or 2, $c={-2^n,...-2^{-1},-2^0,0,+2^0,+2^1,..., +2^n}$, and N is the number of bits. Introduce a scaling factor parameter $\alpha$ here to limit $c$, $C ={...,-2\alpha,-\alpha,0+\alpha,+2\alpha,...} $.$\alpha>0$. It should be noted here that different layers of $\alpha$ are different. In other words, for a $l$ layer network, the actual introduction of L different scaling factors ${\alpha_1,..., \alpha_l}$. The objective function of this low bit quantization network can be expressed as:
$$\min\limits_w;f (W) \quad s.t.w\in c=c_1\times c_2 \times ... \times c_l \quad (3) $$
$C _i={0,\pm\alpha_i,\pm2\alpha_i,..., \pm2^n\alpha_i},\alpha_i>0.$ In fact $\alpha_i$ in each layer does not need to spend more computational capacity, Since it is possible to multiply this factor at the end. $\alpha_i$ can help extend constraint space. For example,
In two-dimensional space, is a ternary network, is limited to ${-1,0,+1}$, the possible solution is nine discrete points, when the $\alpha$ factor is added, the constraint space is expanded to four lines. The large expansion of this constrained space can be optimized to be easier.
Decouple with ADMMThe formula (3) is actually a NP-hard problem because weights is limited to a discrete space. An auxiliary variable is introduced here, limited by the discrete space, which is equal to the original variable. First, define a indicator function $I _c, w\in C $, the target function (3) becomes
$$\min\limits_w\quad F (W) +i_c (w) \quad (4) $$
where if $w\in C $, $I _c (w) =0$, otherwise $i_c (W) =+\infty$
Introduce the auxiliary variable $g$, rewrite the formula (4),
$$\min\limits_{w,g} \quad F (W) +i_c (G), s.t. W=g \quad (5) $$, the augmented Lagrangian function of the formula (5) is:
$ $L _\rho (W,G,\MU) =f (W) +i_c (G) + (\RHO/2) \vert w-g \vert^2+<\mu,w-g> \quad (6) $$
$\lambda= (1/\rho) \mu$, the formula (6) can be converted to:
$ $L _\rho (W,G,\MU) =f (W) +i_c (G) + (\RHO/2) \vert w-g +\lambda\vert^2-(\RHO/2) \vert\lambda\vert^2 \quad (7) $$
According to ADMM, the solution iteration steps for this problem:
$ $W ^{k+1}:=arg\min\limits_wl_{\rho} (w,g^k,\lambda^k) $$
$ $G ^{k+1}:=arg\min\limits_wl_{\rho} (w^{k+1},g,\lambda^k) $$
$$\lambda_{k+1}:=\lambda^k+w^{k+1}-g^{k+1}$$
These three sections are proximal step, projection step and dual Update. Unlike previous work, this is an optimization problem in both continuous space and discrete space, so that the solution of two spaces is connected by the ADMM algorithm in the learning process.
Algorithm subroutines
1.Proximal Step
For this step, in continuous space optimization (?? Why is continuous space, w and g are discrete space ah ~ can't think of), need to minimize the following formula
$ $L _{\rho} (w,g^k,\lambda^k) =f (W) + (\RHO/2) \vert w-g^k+\lambda^k\vert^2$$
The gradient of $w$ can be obtained by using the standard gradient descent method:
$$\partial_wl=\partial_wf+\rho (w-g^k+\lambda^k) $$
However, it is found that the Venilla gradient descent method converges very slowly, because the second square term is very large in the overall lost, so SGD quickly pulls the optimizer back to the current quantization weight so that the second disappears (Since the second quadratic term Occupies a large proportion of the whole lost, SGD would quickly pull the optimizer to the currently quantized weights so t Hat the second term vanishes, and stacks in that point don't quite understand?). This leads to the suboptimal solution of loss, because the loss function of the network is not fully optimized.
In order to overcome this difficulty, the Extragradient method is adopted in this paper. A single iteration of the Extragradient method consists of two simple steps, prediction and correction:
$ $W ^{(p)}:=w-\beta_p\partial_wl (W) $$
$ $W ^{(c)}:=w-\beta_c\partial_wl (w^{(P)}) $$
Among them, $\beta_p$ and $\beta_c$ are learning rates. The salient feature of the Extragradient method is the use of additional gradient steps, which can be used as a guide in the optimization process. This extra iteration takes the curvature information into account, resulting in a better convergence than the Stardard gradient descent method. For prediction step, the algorithm will move quickly to a point near the $g^k-\lambda^k$ value, resulting in the disappearance of the square term, and for correction step, the algorithm minimizes the loss function $f (W) $. These two steps avoid falling into the local minimum. In practice, this method greatly accelerates the convergence.
2.projection Step
For auxiliary variable $g$, all $g_i$ are decoupled, so the auxiliary variable $g_i$ for each layer can be independently optimized. Looking for a Euclidean mapping can map $ (w_i^{k+1}+\lambda_i^k) $ to discrete $c_i$. This is recorded as $ (w_i^{k+1}+\lambda_i^k) $ for $v_i$. This mapping can be expressed as:
$$\min\limits_{g_i,\alpha_i}\quad \vert v_i-g_i\vert^2$$
$ $s. T. g_i\in{0,\pm\alpha_i,\pm2\alpha_i,..., \pm2^n\alpha_i}^{d_i}$$
Take the scaling factor away and change to:
$$\min\limits_{g_i,\alpha_i}\quad \vert v_i-\alpha_i \cdot q_i\vert^2$$
$ $s. T. q_i\in{0,\pm1,\pm2,..., \pm2^n}^{d_i}$$
In this paper, we propose a quantization iterative method, which alternately optimizes $\alpha_i$ and $q_i$, and one fixed another. When the $q_i$ is fixed, the problem becomes a single-variable optimization, $\alpha_i={v_i^tq_i \over q_i^tq_i}$. When $\alpha_i$ is fixed, $Q _i$ is actually a projection of ${v_i \over \alpha_i}$ on ${0,\pm1,\pm2,..., \pm2_n}$, $Q _i=\prod_{{0,\pm1,\pm2,..., \pm2_N} } ({v_i \over \alpha_i}) $.
Experiment 1. Image classification
It can be seen that no matter in which network, this article ultra-low bit quantization is the most prominent.
Need to focus on IS, why Googlenet will be reduced more, this paper's Speculation + experiment is due to the $1\times1$ convolution core with such a strong regulator to optimize the network will cause the lack of fit. It is more effective to quantify different bits in different parts of the network.
2.Object Detection