Hulu machine learning questions and Answers series | 17: Classical variant of random gradient descent algorithm

Source: Internet
Author: User
Tags square root svm

This is the second machine study this week, is also the 17th article of the Hulu Face Test series ~ All the content before can be found in the menu bar "machine learning", may you warm, know new.

Today's content is

"Classical variants of the random gradient descent algorithm"

Scenario Description

Referring to the optimization method in deep learning, people would think of stochastic Gradient descent (SGD), but SGD is not the ideal balm, but sometimes it becomes a pit. When you design a deep neural network, if you only know to train with SGD, and in many cases you get a poor training result, you give up and continue to devote your energy to this depth model. However, the possible reason is that SGD is not in the process of optimization, resulting in the loss of a new opportunity for discovery.

Problem description

The most commonly used optimization method in deep learning is SGD, but SGD sometimes fails to give satisfactory training results, which is why? What are the changes that the researchers have made to improve SGD, and what are the characteristics of the SGD variants?

Background knowledge hypothesis: Gradient descent Method,

Stochastic Gradient Descent Method

Solutions and Analysis

(1) Reasons for SGD failure--touch the stone down the hill

To answer the first question, let's start with an image metaphor: Imagine that you're a downhill person with a good eye and can see the slope of your position, then go down the slope and eventually you'll go to the bottom of the mountain. Now that you are blindfolded, you can only judge the slope of the current position by stepping on the stone at the foot, and the accuracy is greatly reduced. Sometimes you think the slope, in fact, may not be slope, walk for a period of time to find no downhill, or twists walked a lot of detours before the mountain.

We return to SGD, the traditional Gradient descent (also known as batch Gradient descent) is with the eyes down the hill, while SGD is blindfolded down the hill. Every step of the Gradient descent, in order to get an accurate gradient, load the entire training set into the model, the time spent and the memory overhead are very large, and cannot be used in real-world large data sets and large model scenarios. SGD gave up the pursuit of gradient accuracy, each step just randomly sampled a small number of samples to calculate the gradient, the computation speed, the memory cost is small, but because each step receives the limited amount of information, the estimation of the gradient is often biased, resulting in the goal function curve convergence is very unstable, accompanied by severe fluctuations, and sometimes even the absence of convergence. Figure 1 shows GD and SGD in the optimization process of the parameter trajectory, you can see GD stable approximation to the lowest point, and SGD twists is simply "The Yellow River 18 bend."

Figure 1 Parameter optimization trajectory for GD and SGD

Further, some people will say deep learning optimization problem itself is very difficult, there are too many local optimal points of the trap. Yes, this trap is common to both SGD and GD. For SGD, however, it is not the local optimal point, but the two types of terrain-the valley and the saddle Point [1]. The valley as its name implies is a narrow mountain path, the left and right sides are cliffs, the saddle point is shaped like a saddle, two in one direction, two in the other direction, and the central area is an almost horizontal flat.

Why is SGD most afraid of encountering these two types of terrain? In the valley, the exact gradient direction is along the Hill Road downward, a slight deviation will hit the mountain wall, and SGD rough gradient estimate makes it bounce back and forth between the two mountains, not along the direction of the hill to rapidly decline, resulting in convergence instability and slow convergence; at the saddle point, sgd walks into a flat place (far from the bottom) It is also called plateau), imagine blindfolded only with the foot of the feet feel the slope, if the slope is obviously so inaccurate can estimate the approximate direction of the mountain, but if the slope is not obvious so likely to go in the wrong direction, the same in the near zero gradient region, the gradient of small changes can not be accurately detected in the SGD circle, The results stagnated.

(2) Solutions-inertia and environmental awareness

SGD essentially uses an iterative approach to update parameters, and each iteration is based on the current position, taking a small step in one direction to the next position, and then doing so in the next position. The updated formula for SGD is:

One of the currently estimated gradient ﹣GT represents the direction of the step, and the learning rate η controls the size of the footstep. Conversion of SGD is still based on this update form.

Variant 1:momentum

To solve the problem of SGD Valley turbulence and saddle point stagnation, we do a simple thought experiment. Imagine the paper regiment in the valley and the saddle point of the movement trajectory, in the Valley Paper Regiment by the gravity action along the hill roller, both sides are irregular mountain wall, paper regiment inevitably hit the wall, due to the quality of small by the wall elastic interference, from one side of the mountain wall bounce back to hit the other side of the mountain wall, the results are rolling When the paper regiment came to a flat place at the saddle point, the speed was quickly reduced to zero due to the small mass. The case of the paper group is just like the problem with SGD. Intuitively, if replaced by an iron ball, when rolling down the valley, it will not be easy to be disturbed by the side forces on the way, the trajectory more stable and straight, when the center of the saddle Point, continue to move forward under the inertia, so that there is a chance to rush out of this flat trap. So we have the momentum method [2] and update the formula to

The forward step ﹣VT consists of two parts: (1) The learning rate η and the current estimated gradient gt, (2) attenuation of the previous step under VT-1. Here, inertia is reflected in the re-use of the information of the previous steps. Take a high school physics analogy, the current gradient is like the current moment of force generated acceleration, the previous step is similar to the speed of the previous moment, the current pace is like the current moment of speed, in order to calculate the current moment of speed, we should consider the speed of the previous moment and the current acceleration of the results of the interaction, so VT Directly dependent on the VT-1 and GT, not just the GT. In addition, the attenuation factor γ plays a role in resistance.

Middle School physics also tells us that the physical amount of inertia is the momentum, which is also the origin of the algorithm name. The iron ball that rolls down the valley is subjected to force in two directions: the force along the ramp and the elasticity of the collision with the left and right mountain walls. The downward force is stable, the momentum generated is constantly accumulating, the speed is increasing, the elasticity of the left and right is always switching, and the result of momentum accumulation is offset by each other, naturally weakening the ball's back and forth shocks. Therefore, compared with SGD, the convergence speed of momentum is fast and the convergence curve is stable.

Variant 2:adagrad

The acquisition of inertia is based on historical information. So what else can we get in addition to getting a lot more tavita from past paces? We expect to get a sense of our surroundings, and we should be able to figure out some information, even if blindfolded, depending on the feeling of previous strides, such as the direction is always bumpy, and that direction may be flat.

Specific to SGD, the perception of the environment refers to the empirical judgement on the different parameter directions in the parameter space, and determines the adaptive learning rate of this parameter, that is, the pace size of updating different parameters should be different. In some tasks, such as text processing to train word embeddings parameters, some words frequently appear, and some rarely appear, the sparse data results in the corresponding parameters of the gradient sparse, that is, the words parameter is not frequent, most of the situation gradient is zero, so that these parameters are updated very low frequency, So we want to update them a little bit, and update the parameters for the frequently updated steps. ADAGRAD[2] Using the past gradient squared sum

The way to measure the gradient sparsity of different parameters, and the smaller it indicates the more sparse. The updated formula for Adagrad is:

wherein θ T+1,i represents the first parameter of θ T+1. In addition, the denominator-neutralization form implements the annealing process, which is a common strategy in many optimization techniques, meaning that over time, the learning rate

and thus ensure the ultimate convergence of the optimization.

Variant 3:adam

The Adam method [4] combines the two advantages of inertia-preserving and environmental perception. On the one hand, Adam records the first moment of the gradient, the average of the past gradient and the current gradient, which embodies the inertia hold, while Adam also records the gradient of the second moment, which is the average of the square of the past gradient squared with the current gradient, which is similar to Adagrad, It embodies the environment perception and produces adaptive learning rate for different parameters. First moment and second moment the idea of averaging like a sliding window averaging, focusing on the current gradient and the average of the gradient over time, exponential attenuation of the current average contribution by the long-time gradient, using exponential decay averaging (exponential Decay average) technology, the formula is:

where β1, β2 is the attenuation factor.

How do you understand first moment and second moment? First moment is equivalent to estimating that because the current gradient GT is the result of a random sampling estimate, we are more concerned about its statistical expectations than the GT ; second moment is the equivalent of the estimate, unlike Adagrad, Not from the beginning to the present, but rather to its expectations. Their physical meaning is: when ‖m-t‖ Large vt , the gradient is large and stable, indicating that there is an obvious large slope, the direction of progress is clear; when ‖mT‖ Zero vt Large, gradient instability, may encounter a canyon, prone to rebound shocks; when ‖ M-t‖ Large vt is zero, this situation cannot occur; When ‖mT‖ Zero When the VT is zero, the gradient will go down, it may reach the local lowest point, or it may come to a flat slope that is very slow, and avoid falling into plateau. In addition, Adam also considered the offset correction of mT, VT in the case of 0 initial value. The update formula for Adam is:

Extended Reading

In addition to the three types of SGD variants mentioned above, the researchers also suggested other methods:

1. Nesterov accelerated Gradient: Expands the momentum method, along the inertial direction, calculates the future possible position gradient instead of the current position gradient, this "advance quantity" design lets the algorithm have the ability to pre-contract to the front environment.

2. Adadelta and Rmsprop: These two methods are very similar and are improvements to the Adagrad. The Adagrad uses the square root of all the squares of the past gradients to do the denominator, the denominator increases monotonically with time, the adaptive learning rate is too radical to decay with time, so Adadelta and rmsprop use the exponential decay averaging method to replace their sum with the mean of the past gradient.

3. Adamax: Based on a variant of Adam, the processing of the gradient squared is changed from exponential decline to exponential decay for max.

4. Nadam: Can be seen as Nesterov accelerated gradient version of Adam.

Reference documents:

[1] Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional nonconvex optimization. ArXiv, pages 1–14, 2014

[2] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural Networks:the Official Journal of the International Neural Network Society, 12 (1): 145–151, 1999

[3] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient Methods for Online learning and Stochastic optimization. Journal of machine learning, 12:2121–2159, 2011

[4] Diederik P. Kingma and Jimmy Lei Ba. Adam:a Method for Stochastic optimization. Internationalconference on learning Representations, pages 1–13, 2015.

Next Topic Preview

"svm– kernel function and relaxation variable"

Scenario Description

When we deal with the linearly irreducible data in SVM, the kernel function can map the data, so that the original problem has a more comparable degree under some measure, and by introducing the relaxation variable, we can discard some outliers to make the classification plane not be affected too much. Combining these two technologies with SVM is one of the reasons why SVM classifier is simple and powerful.

Problem description

    1. A Gaussian core is used

      Training SVM (Support Vector machine), the test proves that if there are no two points in the same position in a given training set, there is a set of parameters {α1, ... αm, b} and parameter gamma make the SVM training error of 0.

    2. If we use the parameter gamma in question 1 to train a SVM that does not join the relaxation variable, whether it can guarantee the obtained SVM, still have the result of training error 0, try to explain your point of view.

    3. If we use the SMO (Sequential Minimal optimization) algorithm to train a SVM with a relaxation variable, and the penalty factor C is any pre-unknown constant, can we still get the result of a training error of 0 and try to illustrate your point of view.

Hulu machine learning questions and Answers series | 17: Classical variant of random gradient descent algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.