Wunda-Deep Learning-Course NOTE-7: Optimization algorithm (Week 2)

Source: Internet
Author: User

1 Mini-batch Gradient Descent

When doing gradient descent, not all samples of the training set are selected to calculate the loss function, but instead cut into a number of equal parts, each part is called a mini-batch, we calculate the cost of a mini-batch data, finish the gradient drop, And then the next mini-batch do a gradient drop. For example, 500w data, a mini-batch set to 1000, we do 5,000 times gradient descent (5,000 mini-batch, each Mini-batch sample number is 1000, a total of 500w samples).

For batch gradient descent (all samples are computed each time), the cost decreases as the number of iterations increases. For the Mini-batch gradient descent, the cost of the mini-batch iteration is a decrease in the oscillation (sometimes the rise is sometimes decreased), because only part of the sample is considered at each drop, and the descent direction may be incorrect.

Mini-batch the size of M is the batch gradient drop, it is the disadvantage of M very large when a single training takes a long time.

A size of 1 o'clock is a random gradient descent (only a sample at a time), the random gradient drop at each gradient drop may be far away from the most advantageous, may be close to the most advantageous, on average will be close to the best, but sometimes the direction is wrong. The random gradient drop eventually does not converge, but fluctuates near the most advantageous. The drawback is that it loses the acceleration advantage of vectorization, because it only trains one sample at a time and uses m cycles to iterate.

Therefore, the selection of the Mini-batch size in practice, so that both the advantages of using the vectorization, but also to avoid the m too large to bring the training time too long disadvantages. It can also oscillate close to the most advantageous, but it is better than a random gradient drop, which may eventually fluctuate in the vicinity of the most advantageous, and at this time adjust the learning rate to improve the problem.

When the data set is small, such as M less than 2000, generally directly with batch gradient decline; Generally mini-batch size is considered between the 64,128,256,512 values (taking into account the way the computer's memory is set, the second-party training is set to 2 faster) Remember to make sure the size of the Mini-batch fits your CPU/GPU size.

2 exponential weighted average (exponentially weighted averages)

Considering time and temperature scatter graphs, the daily temperature is θ1,θ2,... For V, there is V0 = 0,v (t) = 0.9 * V (T-1) + 0.1 *θt, which is used (the day before the V multiplied by 0.9) plus (the current temperature multiplied by 0.1).

For the V plot, as shown in the red line in, it roughly represents the average temperature of 10 days.

More generally, the 0.9 is expressed in beta, 0.1 (1-β), we say this V is the average 1/(1-β) days of temperature. As shown, the red line indicates that β is 0.9, averaging 10 days, the Green line β is 0.98, the average 50 days, the yellow lines β is 0.5, and the average is 2 days.

Why is this, as shown, unfold the V100, you can find it is the temperature of each day given a certain weight probability and then accumulate, the more the front of the day weight lower, actually do a weight attenuation. So it is an average of the temperature, the average of the recent x days, this x is how much, there is a calculation method is β X is equal to 1/e, in order to find X, that actually this x is 1/(1-β).

In addition, Ng points out that the exponential weighted average is not the best, nor is it a precise way to calculate the average, but it does not need to keep all of the recent data and consumes less memory, which is a good way of doing it efficiently.

3 deviation correction of the exponential weighted average (Bias correction in exponentially weighted averages)

As shown, β=0.98, the actual implementation of the above calculation steps we will get the purple curve, rather than the green curve, the reason is the first time v=0, so the beginning of a few points will be low, can not make a good estimate, so the need to do a certain deviation correction is to let VT divided by (1-β^t), The first time can be modified, to the back of the t increases, as the denominator (1-β^t) toward 0, the original VT is obtained.

When doing machine learning, often do not care about deviation correction, most people choose to survive the start period, if you care about the starting period, deviation correction can help you get better estimates.

4 Momentum gradient descent (Gradient descent with momentum)

We can improve the problem by applying the weighted average of previous studies to the gradient drop.

Consider the contour of the cost function, as shown, when using Mini-batch because the direction is not necessarily the optimal direction, so may be near the most advantageous time will fluctuate (blue line), this time we want to use a smaller learning step in the upper and lower direction, in the right direction using a larger learning step, Then can calculate the gradient to do exponential weighted average of the operation to get an average, when the upper and lower fluctuations to average, a positive and negative average of 0, to the right forward average or to the right, on the way to the best advantage we reduce turbulence, to the ideal step to close to the best (red line).

How do you do it specifically? See the formula on the left, each time you do a gradient drop, seek the exponential weighted average, use this average to do gradient descent.

This is the gradient descent method that adds momentum, it is not like the previous gradient drop (each drop is independent of the previous step), at this time there are two super-parameters, α and β,β are generally set to 0.9, indicating an average of the last 10 iterations of the speed.

For the understanding of momentum: The differential DW is understood to be the acceleration of the ball to the mountain roll, the momentum V to understand the speed, the ball because the acceleration is faster, the β is less than 1 can be understood as the obstruction of friction, so the ball will not infinitely accelerated.

About the deviation correction, in fact, generally not, because 10 iterations after the need not be amended, we usually do not have less than 10 times of study.

There is another version of the formula, see the formula on the right, is to remove (1-β), reduce (1-β) times, Ng prefers the formula on the left, because the formula on the right is not so natural, because the lower α is based on (1-β) to make corresponding changes.

5 Rmsprop (root mean Square)

There is also an algorithm called Rmsprop can also be used to accelerate the mini-batch gradient decline, it is on the basis of MOMENTUAM modified, the formula as shown, DW into the square of the DW, in the fall when more divided by a radical. Can be understood as the vertical direction of the differential term is relatively large, so divided by a larger number, the horizontal direction of the differential term is relatively small, so divided by a relatively small number, so that can eliminate the downward swing, can be used to quickly learn the higher learning rate. To ensure that the denominator divided by is not 0, a small number of ε is added to the actual operation.

6 Adam

In the field of deep learning, many new optimization algorithms are often mentioned and then questioned. Adam and Rmsprop are few of the most tested optimization algorithms and have been proven to be suitable for different deep learning structures and are used to solve a lot of problems well.

The ADAM algorithm is basically a combination of momentum and rmsprop. As shown, the basic calculation formula and the correction of the deviation, v according to the original Momentum method calculation, s according to the original method of Rmsprop calculation, in the gradient drop when the item minus the point of change, the combination of V and S.

This method has a lot of hyper-parameters, such as the learning rate α needs to adjust itself, β1 generally for 0.9,β2, Adam's thesis authors recommend the 0.999,ε, Adam's thesis authors recommend 10-8 times. When using Adam, the general beta and ε use the default values.

Adam represents the average of adaptive moment estimate,β1 used to calculate the differential of the DW, called the first moment,β2 to calculate the average of the square of the DW, called the second Moment,adam.

6 Learning rate decay (learning rates decay)

The reason for learning rate attenuation is very simple, is the gradient drop early away from the most advantage is far, you can use a larger step near, to the late, the most advantageous when very close to be careful, because you can cross a big step may cross the head, away from the most advantages, so to the back step to become smaller, that is, the learning rate to be

The learning rate attenuation formula as shown, 1/(1+ decay rate * epoch_num) is multiplied by the initial learning ratio α, the decay rate as a new parameter, and Epoch_num represents the iteration of the first round.

Learning rate attenuation There are other algorithms, see, K is a constant, T represents the first batch of Mini-batch. Some people will also choose to manually decay.

7 Local optimal problem (the problem of local Optima)

In the early days of deep learning, people were always worried that optimization problems would fall into local optimality. With the development of deep learning theory, we have also changed the understanding of local optimization, and our understanding of it has been developing.

As to the left, it is natural to understand that an optimization problem may occur multiple local optimal points, we will worry about whether the algorithm will fall into the local optimal and can not make the correct solution.

But these understandings are not correct, in fact, if you want to create a neural network, the point where the gradient is usually 0 is not the local optimal point on the left, but the saddle point on the right. A saddle point can be understood as a maximum point in one direction and a minimum point on the other, which can be imagined as a saddle on horseback.

A point with a gradient of 0 in a high-dimensional space may be a convex or concave function in each direction. For example, in 2w space you want the local optimal, all the 2w directions need to be such, the probability is too small. We are more likely to encounter some direction upward bending, while others are bent downward, such as the right, so in the high-dimensional space we are more likely to encounter a saddle point.

We learn from the history of deep learning that our intuition on low-dimensional spaces, such as the left, cannot be applied to high-dimensional spaces, such as the right.

So in deep learning (assuming you have a large number of parameters, the cost function is defined in the high-dimensional space) you are unlikely to fall into the local extremum point, but in a stationary location such as the saddle point you may be learning very slowly, so the Momentum,rmsprop,adam algorithm can be used to speed up the operation.

Wunda-Deep Learning-Course NOTE-7: Optimization algorithm (Week 2)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.