Gradient rise and gradient descent of the grads algorithm

Source: Internet
Author: User

Gradient rise and gradient descent of the grads algorithm
    • Party Wizard Number

When discussing the change rate of the function in any direction, the definition of the derivative of the direction is derived, i.e., the derivative of a point in a direction.

In the definition of derivative and partial derivative, the change rate of the function is discussed in the positive direction of the axis. Then, when discussing the change rate of the function in any direction, it also leads to the definition of the direction derivative, namely: a certain point in the direction of a certain trend of the value.

The popular explanation is that we not only need to know the rate of change in the positive direction of the function on the axis (that is, the partial derivative), but also try to find the rate of change in other specific directions of the function. The number of square wizards is the rate at which the function changes in other specific directions.

    • Gradient

The gradient of a function at a point is a vector whose direction is consistent with the direction of the maximum number of square guides, while its modulus is the maximum of the directional derivative.
Note the point:
1) gradient is a vector
2) The direction of the gradient is the direction of the largest square wizard
3) The value of the gradient is the value of the Max-Party wizard number

    • Gradient descent and gradient rise

In the machine learning algorithm, when minimizing loss function, the minimization loss function and corresponding parameter value can be obtained by means of gradient descent, in turn, if maximal loss function is required, it can be obtained by gradient rising thought.
Gradient Descent

Several concepts about gradient descent

Algebraic method Description of gradient descent

Matrix mode description of gradient descent

Gradient Rise

The gradient rise and gradient descent are analyzed in a consistent way, except that the minus sign in the θθ update is changed to a plus sign.

Algorithm optimization for gradient descent

    1. The step selection of the algorithm. In the previous algorithm description, I mentioned that the step size is 1, but the actual value depends on the data sample, you can take a number of values, from large to small, run the algorithm separately, to see the iteration effect, if the loss function is smaller, indicating that the value is valid, otherwise you want to increase the step size. Said it earlier. The step size is too large to cause iterations to be too fast and may even miss the optimal solution. The step size is too small, the iteration speed is too slow, and the algorithm cannot end for a long time. Therefore, the step of the algorithm needs to be run multiple times to get a better value.

    2. The initial value selection for the algorithm parameter. The initial value is different, the minimum value obtained may also be different, so the gradient descent is only the local minimum value, of course, if the loss function is a convex function, it must be the optimal solution. Because of the risk of local optimal solution, we need to run the algorithm multiple times with different initial values, the minimum value of the key loss function, and select the initial values of minimizing the loss function.

    3. Normalization. Because the range of different characteristics of the sample varies, it can cause the iteration to be slow, in order to reduce the effect of the feature value, the characteristic data can be normalized, that is, for each feature x, its mean value xˉ and standard deviation STD (x), and then converted to:

The new expectation of this feature is 0, the new variance is 1, and the number of iterations can be greatly accelerated.

Batch gradient descent of gradient algorithm, random gradient descent and low batch gradient descent

In the Machine learning field, the body gradient descent algorithm is divided into three kinds

    • Batch gradient descent algorithm (Bgd,batch gradient descent algorithm)
    • Random gradient descent algorithm (sgd,stochastic gradient descent algorithm)
    • Low-volume gradient descent algorithm (Mbgd,mini-batch gradient descent algorithm)

Batch gradient descent algorithm

BGD is the most primitive gradient descent algorithm, with each iteration using the entire sample, the weighted iteration formula (the formula uses θ instead of θi),

The M here represents all the samples, representing the traversal from the first sample to the last sample.

Characteristics:

    • can achieve the global optimal solution, easy to parallel implementation
    • The training process is slow when the number of samples is large

Stochastic gradient descent algorithm

The idea of SGD is to update each parameter with a sample, that is, the formula (1) in M is 1. Each update parameter uses only one sample for multiple updates. In this case, if the sample size is very large, the optimal solution may be obtained by using only a subset of the samples.
However, one of the problems associated with SGD is that the noise is more bgd, making SGD not each iteration toward the overall optimization direction.

Characteristics:

    • Fast training speed
    • Reduced accuracy, not optimal solution, not easy to implement in parallel

Low-volume gradient descent algorithm

MBGD's algorithm idea is to update each parameter with a subset of the samples, which means that the value of M in the formula (1) is greater than 1 less than the number of all samples.

The mini-batch gradient decreases with respect to the stochastic gradient, which reduces the variance of parameter updating and makes the update more stable. It increases the speed of each study relative to the batch gradient descent. And it does not have to worry about memory bottlenecks so that it can be efficiently calculated using matrix operations. In general, each update randomly selects [50,256] samples for learning, but also to be based on specific problems to choose, in practice, you can do many experiments, choose a update speed and shepherds times are more suitable for the number of samples. Mini-batch gradient descent can guarantee convergence and is often used in neural networks.

Add

In the case of small sample size, you can use the batch gradient descent algorithm, larger sample size or online, you can use a random gradient descent algorithm or a small batch gradient descent algorithm.

In addition to the gradient descent, the unconstrained optimization algorithm in machine learning has the least square method mentioned earlier, and there are also Newton and quasi-Newton methods.

Compared with the least squares method, the gradient descent method needs to choose the step size, and the least squares is not needed. The gradient descent method is an iterative solution, and the least squares is the analytic solution. If the sample size is not very large, and there is analytic solution, the least square method has advantages over gradient descent, and the computation speed is very fast. However, if the sample size is large, it is difficult or very slow to solve the analytic solution with the least square method because of the need to find a super large inverse matrix, and the gradient descent algorithm using the iteration has the advantage.

Compared with the Newton method and the quasi-Newton method, the gradient descent method is the iterative solution, but the gradient descent method is the gradient solution, and the Newton/quasi-Newton method is solved by the inverse matrix or pseudo-inverse matrix of the second order Haisen matrix. In contrast, the use of Newton/quasi-Newton method converges faster. But the time of each iteration is longer than the gradient descent method.

Resources:

78797667
78806156

Gradient rise and gradient descent of the grads algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.