In machine learning, supervised learning (supervised learning) by defining a model and estimating the optimal parameters based on the data on the training set. The gradient descent method (Gradient descent) is a parametric optimization algorithm widely used to minimize model errors. The gradient descent method uses multiple iterations and minimizes the cost function in each step to estimate the model's parameters (weights).
The pseudo-code for gradient descent is as follows:
Repeat the process until the convergence is reached {
Ωj =ωj-λ? F (ωj)/? ωj
}
Description: (1) ΩJ is the model parameter, F () is the cost function,? F (ωj)/? Ωj is the first derivative of ωj, λ is the learning rate
(2) if f () is a monotone function, the minimum cost function is obtained after multiple iterations, and if f () is not monotonic, then we are likely to fall into local optimality, and a simple solution is to compare the values of cost functions under different estimation parameters by experimenting with different ωj initial values. To find out if the local optimal.
(3) Gradient descent method is not necessarily the best method to calculate weight parameters, but as a simple and fast method, it is often used. Refer to Andrew Ng's Stanford Open Course.
The gradient descent process is illustrated as follows:
Adjustment of the learning rate
In order to make the gradient descent method have good performance, we need to set the value of the learning rate in the appropriate range. The learning rate determines how quickly the parameter moves to the optimal value. If the learning rate is too large, it is likely to pass the optimal value, but if the learning rate is too small, the efficiency of optimization may be too low, long time algorithm can not converge. So the learning rate is very important for the performance of the algorithm.
Adjust different learning rates for data sets of different sizes
Depending on the cost function f () We have chosen, the problem will differ. When the squared error and (Sum of squared Errors) are used as cost functions? F (ωj)/? Ωj will become larger with the increase in training set data, so the learning rate needs to be set to a correspondingly smaller value.
One way to solve this type of problem is to multiply the learning rate λ by 1/n,n the amount of data in the training set. In this way, each updated formula changes to the following form:
Ωj =ωj-(λ/n) *? F (ωj)/? ωj
Refer to: Wilson et al paper "The General inef?ciency of batch training for gradient descent learning"
Another solution is to select a cost function that is not affected by the number of samples in the training set, such as the mean squared difference (Mean squared Errors).
Adjust the different learning rates in each iteration
Adjusting the value of learning rate in each iteration is another good learning rate adaptive method. The basic idea of such a method is that the farther away you are from the optimal value, the more you need to move toward the optimal value, that is, the higher the learning rate is, and the reverse.
But here's the problem, we don't know where the best value is actually, and we don't know how far we are from the optimal value in each iteration.
The workaround is that we use the estimated model parameters at the end of each iteration to check the value of the error function. If the error rate decreases relative to the previous iteration, the learning rate can be increased by 5%; If the error rate increases (meaning that the optimal value is skipped) relative to the previous iteration, then the value of the previous iteration ωj should be reset and the learning rate reduced to the previous 50%. This method is called the Bold Driver.
Recommendation: Normalization of input vectors
Normalization of input vectors is a common method in machine learning problems. In some applications, because of the use of distance or feature variance, it is required that the input vector be normalized, because if not normalized, the result will be seriously affected by the characteristics of poor and different scales. Normalized inputs can help to achieve faster, more accurate convergence of numerical optimal methods (for example, gradient descent method).
Although there are some different methods of normalized variables, [0,1] normalization (also called Min-max) and Z-score normalization are the two most widely used.
Xminmaxnorm = (x-min (x))/(max (x)-min (x));
Xzscorenorm = (X-mean (x))/STD (x);
Description: This article is http://blog.datumbox.com/tuning-the-learning-rate-in-gradient-descent/'s translation version, the original author Vasilis Vryniotis.
Hope to help you understand and use!
Learning Rate: The effect of learning rate from gradient learning algorithm--how to adjust the learning rate