1. Disadvantages of gradient descent method
Since the processed data has different dimensions and dimensional units, which results in a large difference in the scale between the data of different dimension, the contour of the objective function is elliptical as shown in the following figure (left). In this way, in the process of finding the optimal solution by minimizing the objective function, the path of gradient descent is jagged and the number of iterations required is too high, which seriously affects the efficiency of the algorithm.
To solve this problem, the data can be normalized, such as using Min-max normalization to unify the input data range to [0,1]:
X∗=x−minmax−min X^*=\frac{x-min}{max-min}
As shown in the image above (right), the lowest point of the objective function can be achieved by a very few iterations, which greatly improves the efficiency of the algorithm execution.
2. Newton's Method
Data normalization is to solve the problem that the gradient descent method has too many iterations from the angle of data preprocessing, if we think from the angle of objective function optimization, we can substitute the gradient descent method by Newton method to improve the solution speed of the parameter optimal value.
The first derivative of a function J (θ) j (\theta) with n variables is:
∂j∂θ=[∂j∂θ1,∂j∂θ2,..., ∂j∂θn] \frac{\partial j}{\partial \theta}=[\frac{\partial j}{\partial \theta_1},\frac{\ Partial j}{\partial \theta_2},..., \frac{\partial j}{\partial \theta_n}]
The second derivative (also known as the Hessian matrix) is:
Target function J (θ) j (\theta) contains the second derivative of Taylor expansion to:
J (θ+δθ) =j (θ) +δθt∂j (θ) ∂θ+12δθt∂2j (θ) ∂θ2δθj (\theta+\ Delta\theta) =j (\theta) +\delta\theta^t\frac{\partial J (\theta)}{\partial \theta}+\frac{1}{2}\delta\theta^t\frac{\ Partial^2 J (\theta)}{\partial \theta^2}\delta\theta
sees J (θ+δθ) J (\theta+\delta\theta) as a function of Δθ\delta\theta, Its minimum value is obtained at 0 of its partial derivative: