One, random gradient descent method (Stochastic Gradient descent)
When the training set is large and the normal gradient descent method (Batch Gradient descent) is used, because each time \ (\theta\) is updated, the differential term is calculated by iterating over all the data of the training set, so the speed is slow
Batch gradient descent method is a one-time calculation of the M-Group of data differential, one update \ (\theta\), the calculation of the differential m-group data, using the same \ (\theta\), will get the global minimum value
The stochastic gradient descent method computes the differential of M-group data in sequence, M-Update \ (\theta\), calculates the differential of M-group data, uses the last set of arrays to update the \ (\theta\), and obtains a local minimum value very close to the global minimum.
General Iteration 1-10 Times
Two, small batch gradient descent method (Mini-batch Gradient descent)
Comparison of three gradient descent methods
The low-volume gradient descent method is a one-time update B (typically 10,2~100d) group data, update \ (\lceil \frac{m}{b} \rceil\), between the random gradient descent method and the batch gradient descent method
The low-volume gradient descent method is faster than the random gradient descent method because the frequency of the update \ (\theta\) is faster than the random gradient descent method because it is possible to accelerate the quantization operation when the differential is computed (that is, matrix multiplication).
Third, verifying the convergence of the cost function
Calculate \ (Cost (\theta, (x^{(i)}, y^{(i)}) before each update \ (\theta\))
Because the random gradient descent method updates \ (\theta\) Every time, there is no guarantee that the cost function \ (costs (\theta, (x^{(i)}, y^{(i))) will be smaller, only the overall oscillator is smaller, so we only need the last 1000 data \ (Cost (\theta , (x^{(i)}, y^{(i)})) the average
The above two pairs of graphs are relatively normal random gradient descent graph, lower left need to improve the sample number (1000->5000) and see if convergence, lower right significantly monotonically increment, choose a smaller learning rate \ (\alpha\) or change the characteristics of the test
We can also dynamically modify the learning rate to converge the cost function, decreasing as the number of iterations increases
\ (\alpha = \frac{const1}{iterationnumber + const2}\)
Four, online learning
Online learning is in the absence of pre-prepared data sets, there is a real-time data stream to give learning model, real-time update \ (\theta\), the advantages
1, no need to save large amount of local data
2, real-time changes according to the characteristics of the data \ (\theta\)
In fact, similar to the random gradient descent method
Online learning other examples, can be based on user search keyword characteristics, to real-time learning feedback results, in accordance with the user's click to update \ (\theta\), such as
Machine learning Public Lesson Note the Nineth week of the big data gradient descent algorithm