The derivative of a function at a certain point describes the change rate of the function near this point. (How much dy changes are caused by tiny DX changes)
For an implicit function f (x) (X = {x0, X1 ,..., Xn}), because the change rate and direction in all directions of X are different, for example, it can be changed linearly in the x0 direction, if we want to know the variation direction and speed of f (x) near X in the direction of X1, how should we portray it?
The answer is gradient. The gradient is the result of f (x)'s partial derivation of each component, representing the change rate of f (x) in each direction. The whole normal vector is the vector that f (x) overlays the change rate in all directions. You can use the I, J, K... Replace dx1, dx2, dx3 ..., .
In the above fully differential formula, we can better understand the extreme value. Why do we often say that the derivative of a function is 0 when it obtains the extreme value.
Assume that in one dimension, if the minimum value is required, the two sides are obtained after the differentiation. When x = 0, the derivative 2x is 0 and the extreme value is obtained. Otherwise, if X is a positive number, DX only needs to be adjusted to the left (DX <0) to reduce the value of f (x). If X is a negative number, so DX only needs to adjust to the right (dx> 0) to make f (x) smaller. Therefore, the final adjustment result is x = 0.
For two-dimensional situations,
The value of is positive and negative after calculation, but we should note that DX can be positive and negative, Dy can also be positive and negative, as long as one is not 0, then by adjusting dx, dy's plus and minus signs (that is, determine how to move X and Y) can make the value larger and smaller. Only when the partial derivative is 0, no matter how you adjust Dx and Dy, It is 0 to obtain the extreme value.
Since the derivative can be directly set to 0 to solve the Extreme Value of the function, why does the gradient descent algorithm need to be used to calculate the Extreme Value of the function instead of directly setting the derivative to 0? The reason is as follows:
Not all functions can find the point whose derivative is 0. The reality is:
1. The value of the derivative at each point can be obtained, but the equation cannot be solved directly.
2. The derivative is not parsed. If an input value is given like a black box, an output value can be returned. However, we cannot know what the input value is like. This is true for neural networks.
Both Newton Iteration and gradient descent can calculate extreme values. The difference is that gradient descent algorithms have lower complexity but more iterations; the Newton Iteration Method is faster (the initial value must be reasonably set), but the Newton Iteration Method is involved in Division (inverse matrix for the matrix ), therefore, each iteration requires a large amount of computing. It is generally chosen based on the specific situation.
In short, most functions cannot parse the zero derivative (exp (x) + (Ln (x) ^ 2 + x ^ 5 ?!). The gradient descent method is a numerical algorithm. Generally, you can use a computer to find a good approximate solution.
Starting from Derivative