Extremely low bit neural network:squeeze the last Bit off with ADMM

Source: Internet
Author: User

Abstract

With the development of deeplearning, the attention of model compression is increasing. Following bwn and TWN, this paper is a new Thought article in the field of ultra-low bit quantization, published in AAAI2018, the author is Ali's.
The main idea of this paper is to model the ultra-low bit quantization as a discrete constrained optimization problem (discretely constrained optimization problem). With the idea of ADMM (alternating Direction Method of mutipliers), the continuous parameters are separated from the discrete constraints of the network, and the original problems are transformed into several sub-problems. In this paper, extragradient and iterative quantization algorithm (faster convergence than traditional algorithm) are adopted for these sub-problems. Experiments are better than the latest methods in image classification and target detection (SSD).

What did it solve? Why did you write this article?
  • The previous work will pretrained weights quantization to 4-12bit effect is very good, but when only 1bit or 2bit to represent weights, only in the small datasets (MNIST, CIFAR10) performance is very good, Large datasets usually produce very large loss.
  • This paper presents a unique strategy, which takes the ultralow bit weights as a discrete constrained non-convex optimization problem (MIP). With the help of the ADMM algorithm, the main idea of this paper is to decouple the continuous variables in the discrete constraints using the auxiliary variables in the discrete space. Unlike the previous quantization method, which modifies a specific gradient in a continuous parameter, this paper optimizes both continuous and discrete spaces, and uses the augmented Lagrangian function to connect the solutions of two spaces.

    ADMM algorithm

    Main reference this blog http://mullover.me/2016/01/19/admm-for-distributed-statistical-learning/
    If the optimization problem can be expressed as
    $$ Min\quad f (x) +g (z) \quad s.t.\quad ax+bz=c\quad (1) $$
    Among them $x\in \mathbb r^s,z\in \mathbb r^n, a\in \mathbb r^{p \times n}, b\in \mathbb r^{p \times m}; and; C\in \MATHBB r^p.$ $x $ with $z$ is an optimization variable, $f (x) +g (z) $ is the objective function, $Ax +bz=c$ is an equality constraint.
    The augmented Lagrangian function (augmented Lagrangian) of the equation $ (1) $ can be expressed as:
    $ $L _\rho (x,z,y) =f (x) +g (z) +y^t (ax+bz-c) + (\RHO/2) \vert ax+bz-c \vert^2_2 \quad (2) $$
    Where $y$ is Lagrange multiplier, $\rho>0$ is the penalty parameter. Augmented is due to the addition of two penalty items.
    The ADMM consists of three-step iterations:
    $ $x ^{k+1}:=arg\min\limits_xl_\rho (x,z^k,y^k) $$
    $ $z ^{k+1}:=arg\min\limits_zl_\rho (x^{k+1},z,y^k) $$
    $ $y ^{k+1}:=y^k+\rho (ax^{k+1}+bz^{k+1}-c) $$
    As you can see, each iteration is divided into three steps:
    1. Solve the minimization problem related to $x$, update the variable $x$
    2. Solve the minimization problem related to $x$, update the variable $x$
    3. Update $\rho$

    Objective function

    As $f (W) $ for an nn loss fuction, $W =${$W _1,w_2,..., w_l$}, $W _i$ represents the weight of the $i$ layer of the network, $W _i$ is a $d_i$-dimensional vector $\mathbb r^{d_i}$, $W \in \ Mathbb r^d, d=\sum_id_i$. In this paper, because the training of ultra-low bit quantization, in particular, the network weights is strictly limited to 0 or 2 power, so that the float-point operation can be replaced by a faster bit shift operation. Suppose training a ternary network {$-1,0,+1$}, so that training such a network can be used as MIP on mathematical formulas:
    $$\min\limits_w;f (W) \quad s.t.; W\in c={-1,0,+1}^d$$
    Since weights is strictly limited to a power of 0 or 2, $c={-2^n,...-2^{-1},-2^0,0,+2^0,+2^1,..., +2^n}$, and N is the number of bits. Introduce a scaling factor parameter $\alpha$ here to limit $c$, $C ={...,-2\alpha,-\alpha,0+\alpha,+2\alpha,...} $.$\alpha>0$. It should be noted here that different layers of $\alpha$ are different. In other words, for a $l$ layer network, the actual introduction of L different scaling factors ${\alpha_1,..., \alpha_l}$. The objective function of this low bit quantization network can be expressed as:
    $$\min\limits_w;f (W) \quad s.t.w\in c=c_1\times c_2 \times ... \times c_l \quad (3) $$
    $C _i={0,\pm\alpha_i,\pm2\alpha_i,..., \pm2^n\alpha_i},\alpha_i>0.$ In fact $\alpha_i$ in each layer does not need to spend more computational capacity, Since it is possible to multiply this factor at the end. $\alpha_i$ can help extend constraint space. For example,

    In two-dimensional space, is a ternary network, is limited to ${-1,0,+1}$, the possible solution is nine discrete points, when the $\alpha$ factor is added, the constraint space is expanded to four lines. The large expansion of this constrained space can be optimized to be easier.

    Decouple with ADMM

    The formula (3) is actually a NP-hard problem because weights is limited to a discrete space. An auxiliary variable is introduced here, limited by the discrete space, which is equal to the original variable. First, define a indicator function $I _c, w\in C $, the target function (3) becomes
    $$\min\limits_w\quad F (W) +i_c (w) \quad (4) $$
    where if $w\in C $, $I _c (w) =0$, otherwise $i_c (W) =+\infty$
    Introduce the auxiliary variable $g$, rewrite the formula (4),
    $$\min\limits_{w,g} \quad F (W) +i_c (G), s.t. W=g \quad (5) $$, the augmented Lagrangian function of the formula (5) is:
    $ $L _\rho (W,G,\MU) =f (W) +i_c (G) + (\RHO/2) \vert w-g \vert^2+<\mu,w-g> \quad (6) $$
    $\lambda= (1/\rho) \mu$, the formula (6) can be converted to:
    $ $L _\rho (W,G,\MU) =f (W) +i_c (G) + (\RHO/2) \vert w-g +\lambda\vert^2-(\RHO/2) \vert\lambda\vert^2 \quad (7) $$
    According to ADMM, the solution iteration steps for this problem:
    $ $W ^{k+1}:=arg\min\limits_wl_{\rho} (w,g^k,\lambda^k) $$
    $ $G ^{k+1}:=arg\min\limits_wl_{\rho} (w^{k+1},g,\lambda^k) $$
    $$\lambda_{k+1}:=\lambda^k+w^{k+1}-g^{k+1}$$
    These three sections are proximal step, projection step and dual Update. Unlike previous work, this is an optimization problem in both continuous space and discrete space, so that the solution of two spaces is connected by the ADMM algorithm in the learning process.

    Algorithm subroutines
  • 1.Proximal Step
    For this step, in continuous space optimization (?? Why is continuous space, w and g are discrete space ah ~ can't think of), need to minimize the following formula
    $ $L _{\rho} (w,g^k,\lambda^k) =f (W) + (\RHO/2) \vert w-g^k+\lambda^k\vert^2$$
    The gradient of $w$ can be obtained by using the standard gradient descent method:
    $$\partial_wl=\partial_wf+\rho (w-g^k+\lambda^k) $$
    However, it is found that the Venilla gradient descent method converges very slowly, because the second square term is very large in the overall lost, so SGD quickly pulls the optimizer back to the current quantization weight so that the second disappears (Since the second quadratic term Occupies a large proportion of the whole lost, SGD would quickly pull the optimizer to the currently quantized weights so t Hat the second term vanishes, and stacks in that point don't quite understand?). This leads to the suboptimal solution of loss, because the loss function of the network is not fully optimized.
    In order to overcome this difficulty, the Extragradient method is adopted in this paper. A single iteration of the Extragradient method consists of two simple steps, prediction and correction:
    $ $W ^{(p)}:=w-\beta_p\partial_wl (W) $$
    $ $W ^{(c)}:=w-\beta_c\partial_wl (w^{(P)}) $$
    Among them, $\beta_p$ and $\beta_c$ are learning rates. The salient feature of the Extragradient method is the use of additional gradient steps, which can be used as a guide in the optimization process. This extra iteration takes the curvature information into account, resulting in a better convergence than the Stardard gradient descent method. For prediction step, the algorithm will move quickly to a point near the $g^k-\lambda^k$ value, resulting in the disappearance of the square term, and for correction step, the algorithm minimizes the loss function $f (W) $. These two steps avoid falling into the local minimum. In practice, this method greatly accelerates the convergence.
    2.projection Step
    For auxiliary variable $g$, all $g_i$ are decoupled, so the auxiliary variable $g_i$ for each layer can be independently optimized. Looking for a Euclidean mapping can map $ (w_i^{k+1}+\lambda_i^k) $ to discrete $c_i$. This is recorded as $ (w_i^{k+1}+\lambda_i^k) $ for $v_i$. This mapping can be expressed as:
    $$\min\limits_{g_i,\alpha_i}\quad \vert v_i-g_i\vert^2$$
    $ $s. T. g_i\in{0,\pm\alpha_i,\pm2\alpha_i,..., \pm2^n\alpha_i}^{d_i}$$
    Take the scaling factor away and change to:
    $$\min\limits_{g_i,\alpha_i}\quad \vert v_i-\alpha_i \cdot q_i\vert^2$$
    $ $s. T. q_i\in{0,\pm1,\pm2,..., \pm2^n}^{d_i}$$
    In this paper, we propose a quantization iterative method, which alternately optimizes $\alpha_i$ and $q_i$, and one fixed another. When the $q_i$ is fixed, the problem becomes a single-variable optimization, $\alpha_i={v_i^tq_i \over q_i^tq_i}$. When $\alpha_i$ is fixed, $Q _i$ is actually a projection of ${v_i \over \alpha_i}$ on ${0,\pm1,\pm2,..., \pm2_n}$, $Q _i=\prod_{{0,\pm1,\pm2,..., \pm2_N} } ({v_i \over \alpha_i}) $.

    Experiment 1. Image classification




    It can be seen that no matter in which network, this article ultra-low bit quantization is the most prominent.
    Need to focus on IS, why Googlenet will be reduced more, this paper's Speculation + experiment is due to the $1\times1$ convolution core with such a strong regulator to optimize the network will cause the lack of fit. It is more effective to quantify different bits in different parts of the network.

    2.Object Detection

Extremely low bit neural network:squeeze the last Bit off with ADMM

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.