Learning Rate: The effect of learning rate from gradient learning algorithm--how to adjust the learning rate

Source: Internet
Author: User

In machine learning, supervised learning (supervised learning) by defining a model and estimating the optimal parameters based on the data on the training set. The gradient descent method (Gradient descent) is a parametric optimization algorithm widely used to minimize model errors. The gradient descent method uses multiple iterations and minimizes the cost function in each step to estimate the model's parameters (weights).

The pseudo-code for gradient descent is as follows:

Repeat the process until the convergence is reached {

Ωj =ωj-λ? F (ωj)/? ωj

}

Description: (1) ΩJ is the model parameter, F () is the cost function,? F (ωj)/? Ωj is the first derivative of ωj, λ is the learning rate

(2) if f () is a monotone function, the minimum cost function is obtained after multiple iterations, and if f () is not monotonic, then we are likely to fall into local optimality, and a simple solution is to compare the values of cost functions under different estimation parameters by experimenting with different ωj initial values. To find out if the local optimal.

(3) Gradient descent method is not necessarily the best method to calculate weight parameters, but as a simple and fast method, it is often used. Refer to Andrew Ng's Stanford Open Course.

The gradient descent process is illustrated as follows:

Adjustment of the learning rate

In order to make the gradient descent method have good performance, we need to set the value of the learning rate in the appropriate range. The learning rate determines how quickly the parameter moves to the optimal value. If the learning rate is too large, it is likely to pass the optimal value, but if the learning rate is too small, the efficiency of optimization may be too low, long time algorithm can not converge. So the learning rate is very important for the performance of the algorithm.

Adjust different learning rates for data sets of different sizes

Depending on the cost function f () We have chosen, the problem will differ. When the squared error and (Sum of squared Errors) are used as cost functions? F (ωj)/? Ωj will become larger with the increase in training set data, so the learning rate needs to be set to a correspondingly smaller value.

One way to solve this type of problem is to multiply the learning rate λ by 1/n,n the amount of data in the training set. In this way, each updated formula changes to the following form:

Ωj =ωj-(λ/n) *? F (ωj)/? ωj

Refer to: Wilson et al paper "The General inef?ciency of batch training for gradient descent learning"

Another solution is to select a cost function that is not affected by the number of samples in the training set, such as the mean squared difference (Mean squared Errors).

Adjust the different learning rates in each iteration

Adjusting the value of learning rate in each iteration is another good learning rate adaptive method. The basic idea of such a method is that the farther away you are from the optimal value, the more you need to move toward the optimal value, that is, the higher the learning rate is, and the reverse.

But here's the problem, we don't know where the best value is actually, and we don't know how far we are from the optimal value in each iteration.

The workaround is that we use the estimated model parameters at the end of each iteration to check the value of the error function. If the error rate decreases relative to the previous iteration, the learning rate can be increased by 5%; If the error rate increases (meaning that the optimal value is skipped) relative to the previous iteration, then the value of the previous iteration ωj should be reset and the learning rate reduced to the previous 50%. This method is called the Bold Driver.

Recommendation: Normalization of input vectors

Normalization of input vectors is a common method in machine learning problems. In some applications, because of the use of distance or feature variance, it is required that the input vector be normalized, because if not normalized, the result will be seriously affected by the characteristics of poor and different scales. Normalized inputs can help to achieve faster, more accurate convergence of numerical optimal methods (for example, gradient descent method).

Although there are some different methods of normalized variables, [0,1] normalization (also called Min-max) and Z-score normalization are the two most widely used.

Xminmaxnorm = (x-min (x))/(max (x)-min (x));

Xzscorenorm = (X-mean (x))/STD (x);

Description: This article is http://blog.datumbox.com/tuning-the-learning-rate-in-gradient-descent/'s translation version, the original author Vasilis Vryniotis.

Hope to help you understand and use!

Learning Rate: The effect of learning rate from gradient learning algorithm--how to adjust the learning rate

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.