Stochastic Optimization Techniques
Neural networks is often trained stochastically, i.e. using a method where the objective function changes at each Iterati On. This stochastic variation are due to the model being trained on different data during each iteration. This is motivated by (at least)-Factors:first, the dataset used as training data is often too large-fit in memory and/or is optimized over efficiently. Second, the objective function is typically nonconvex, so using different data at each iteration can help prevent the mode L from settling in a local minimum. Furthermore, training neural networks is usually do using only the first-order gradient of the parameters with respect T o The loss function. This was due to the large number of parameters present in a neural network, which for practical purposes prevents the Compu Tation of the Hessian matrix. Because Vanilla gradient descent can diverge or converge incredibly slowly if its learning rate hyperparameter is set Inap Propriately, many alternative methodsThere are been proposed which is intended to produce desirable convergence with less dependence on hyperparameter settings. These methods often effectively compute and utilize a preconditioner on the gradient, adaptively change the learning rate Over time or approximate the Hessian matrix.
In the following, we'll use $\theta_t$ to denote some generic parameter of the "model at iteration $t $, to be optimized a Ccording to some loss function $\mathcal{l}$ which are to be minimized.
Stochastic Gradient Descent
Stochastic gradient descent (SGD) simply updates each parameter by subtracting the gradient of the loss with respect to th e parameter, scaled by the learning rate $\eta$, a hyperparameter. If $\eta$ is too large, SGD would diverge; If it ' s too small, it'll converge slowly. The update rule is simply $$ \theta_{t + 1} = \theta_t-\eta \nabla \mathcal{l} (\theta_t) $$
Momentum
In SGD, the gradient $\nabla \mathcal{l} (\theta_t) $ often changes rapidly at each iteration $t $ due to the fact, the L OSS is being computed over different data. This was often partially mitigated by re-using the gradient value from the previous iteration, scaled by a momentum hyperpa Rameter $\mu$, as follows:
\begin{align*} v_{t + 1} &= \mu v_t-\eta \nabla \mathcal{l} (\theta_t) \ \theta_{t + 1} &= \theta_t + v_{t+1} \e nd{align*}
It has been argued that including the previous gradient step have the effect of approximating some second-order information About the gradient.
Nesterov ' s accelerated Gradient
In Nesterov's accelerated Gradient (NAG), the Gradient of the loss at each step are computed at $\theta_t + \mu v_t$ Instea D of $\theta_t$. In momentum, the parameter update could is written $\theta_{t + 1} = \theta_t + \mu v_t-\eta \nabla \mathcal{l} (\theta_t ) $, so NAG effectively computes the gradient in the new parameter location but without considering the gradient term. In practice, this causes NAG to behave more stably than regular momentum in many situations. A more thorough analysis can is found in 1). The update rules are and then as follows:
\begin{align*} v_{t + 1} &= \mu v_t-\eta \nabla\mathcal{l} (\theta_t + \mu v_t) \ \theta_{t + 1} &= \theta_t + V _{t+1} \end{align*}
Adagrad
Adagrad effectively rescales the learning rate for each parameter according to the history of the gradients for that Param Eter. This is do by dividing each term in $\nabla \mathcal{l}$ by the square root of the sum of squares of its historical grad Ient. Rescaling in this is effectively lowers the learning rate for parameters which consistently has large gradient values. It also effectively decreases the learning rate over time, because the sum of squares would continue to grow with the Itera tion. After setting the rescaling term $g = 0$, the updates is as follows: \begin{align*} g_{t + 1} &= g_t + \nabla \MATHCA L{l} (\theta_t) ^2 \ \theta_{t + 1} &= \theta_t-\frac{\eta\nabla \mathcal{l} (\theta_t)}{\sqrt{g_{t + 1}} + \epsilon} \end{align*} where division is elementwise and $\epsilon$ are a small constant included for numerical stability. It has a nice theoretical guarantees and empirical results 2) 3).
Rmsprop
In its originally proposed form 4), Rmsprop are very similar to Adagrad. The only difference are the $g _t$ term is computed as a exponentially decaying average instead of a accumulated sum. This makes $g _t$ a estimate of the second moment of $\nabla \mathcal{l}$ and avoids the fact that the learning rate effec Tively shrinks over time. The name "Rmsprop" comes from the fact, the update step is normalized by a decaying RMS of recent gradients. The update is as follows:
\begin{align*} g_{t + 1} &= \gamma g_t + (1-\gamma) \nabla \mathcal{l} (\theta_t) ^2 \ \theta_{t + 1} &= \theta_t -\frac{\eta\nabla \mathcal{l} (\theta_t)}{\sqrt{g_{t + 1}} + \epsilon} \end{align*}
In the original lecture slides where it was proposed, and $\gamma$ is set to $.9$. In 5), it was shown that the $\sqrt{g_{t + 1}}$ term approximates (in expectation) the diagonal of the absolute value of th e Hessian Matrix (assuming the update steps is $\mathcal{n} (0, 1) $ distributed). It is also argued that the absolute value of the Hessian are better to use for non-convex problems which could have many Sadd Le points.
Alternatively, in 6), a first-order moment approximator $m _t$ is added. It is included in the denominator of the preconditioner so, the learning rate are effectively normalized by the Standar D deviation $\nabla \mathcal{l}$. There is also a $v _t$ term included for momentum. This gives
\begin{align*} m_{t + 1} &= \gamma m_t + (1-\gamma) \nabla \mathcal{l} (\theta_t) \ G_{t + 1} &= \gamma g_t + (1 -\gamma) \nabla \mathcal{l} (\theta_t) ^2 \ V_{t + 1} &= \mu v_t-\frac{\eta \nabla \mathcal{l} (\theta_t)}{\sqrt{g_{ T+1}-m_{t+1}^2 + \epsilon}} \ \theta_{t + 1} &= \theta_t + v_{t + 1} \end{align*}
Adadelta
Adadelta 7) uses the same exponentially decaying moving average estimate of the gradient second moment $g _t$ as Rmsprop. It also computes a moving average $x _t$ of the updates $v _t$ similar to momentum, if updating this quantity it Squar Es the current step, which I don ' t has any intuition for.
\begin{align*} g_{t + 1} &= \gamma g_t + (1-\gamma) \nabla \mathcal{l} (\theta_t) ^2 \ V_{t + 1} &=-\frac{\sqrt{ x_t + \epsilon} \nabla \mathcal{l} (\theta_t)}{\sqrt{g_{t+1} + \epsilon}} \ \ x_{t + 1} &= \gamma x_t + (1-\gamma) v_{ T + 1}^2 \ \theta_{t + 1} &= \theta_t + v_{t + 1} \end{align*}
Adam
Adam is somewhat similar to adagrad/adadelta/rmsprop in that it computes a decayed moving average of the gradient and Squa Red gradient (first and second moment estimates) at each time step. It differs mainly in the Ways:first, the first order moment moving average coefficient are decayed over time. Second, because the first and Second order moment estimates was initialized to zero, some bias-correction was used to count Eract the resulting bias towards zero. The use of the first and second order moments, in most cases, ensure that typically the gradient descent step size is $\ap ProX \pm \eta$ and that in magnitude it's less than $\eta$. However, as $\theta_t$ approaches a true minimum, the uncertainty of the gradient would increase and the step size would Dec Rease. It is also invariant to the scale of the gradients. Given hyperparameters $\gamma_1$, $\gamma_2$, $\lambda$, and $\eta$, and setting $m _0 = 0$ and $g _0 = 0$ (Note that the PA per denotes $\gamma_1$ as $\beta_1$, $\gamma_2$ as $\beta_2$, $\eta$ as $\alpha$ and $g _t$ as $v _t$), the update rule is as Follows:8)
\begin{align*} m_{t + 1} &= \gamma_1 m_t + (1-\gamma_1) \nabla \mathcal{l} (\theta_t) \ G_{t + 1} &= \gamma_2 g_ T + (1-\gamma_2) \nabla \mathcal{l} (\theta_t) ^2 \ \hat{m}_{t + 1} &= \frac{m_{t + 1}}{1-\gamma_1^{t + 1}} \ \hat {g}_{t + 1} &= \frac{g_{t + 1}}{1-\gamma_2^{t + 1}} \ \ \theta_{t + 1} &= \theta_t-\frac{\eta \hat{m}_{t + 1}}{ \sqrt{\hat{g}_{t + 1}} + \epsilon} \end{align*}
Esgd
9)
Adasecant
10)
Vsgd
01H
Rprop
02H
1) sutskever, Martens, Dahl, and Hinton, "on the importance of initialization and momentum in deep learning" (ICML 2013) 2) Dyer, "Notes on Adagrad" 3) Duchi, Hazan, and Singer, "Adaptive subgradient Methods for Online learning and Stochastic Opt Imization "(COLT) 4) Hinton, Srivastava, and Swersky," rmsprop:divide the gradient by a running average of its recent Magnitude "5), 9) Dauphin, Vries, Chung and Bengion," Rmsprop and equilibrated adaptive learning rates for Non-convex opt Imization "6) Graves," Generating sequences with recurrent neural Networks "7) Zeiler," Adadelta:an Adaptive learning Rate Method "8) Kingma and Ba," Adam:a Method for Stochastic optimization "ten) Gulcehre and Bengio," Adasecant:robust Adaptive secant Method for Stochastic Gradient "one") Schaul, Zhang, LeCun, "No more Pesky Learning Rates") Riedmiller and Bruan, "A Direct Adaptive Method for Faster backpropagation learning:the rprop algorithm "
Stochastic Optimization Techniques