First, how to learn a large-scale data set?
In the case of a large training sample set, we can take a small sample to learn the model, such as m=1000, and then draw the corresponding learning curve. If the model is found to be of high deviation according to the learning curve, the model should continue to be adjusted on the existing sample, and the adjustment strategy should refer to the High deviation of section Sixth when the model is adjusted, and if the model is found to be of high variance, the training sample set can be increased.
Ii. Random Gradient Descent method (Stochastic Gradient descent)
When we talked about the optimization cost function, we took "batch gradient descent" batch Gradient, which needed to calculate all the training samples at each iteration, and for billions of large-scale sample sets, the computational cost was too high, plus multiple iterations needed. In addition, the calculation amount is larger and the convergence rate is slower.
The random gradient descent method first disrupts the sample sequence and then traverses the sample set. For each sample is the equivalent of iterating once, adjusting the parameters once, so the overall calculation is small and large. The number of repetitions of the entire sample set is 1-10 times as sufficient. Therefore, the algorithm is much faster.
Three, small batch gradient descent method (Mini-batch Gradient descent)
The low-volume gradient descent method is between the batch gradient descent method and the stochastic gradient descent method, each of which substitutes B sample data, b often = 10, or the number of 2~100. However, when using a small batch gradient descent method, if you are using vectorization, you can simultaneously process B samples at the same time, the efficiency should be better than the stochastic gradient method, because it does not process the data in parallel.
Four, the convergence of the stochastic gradient descent method.
The final convergence of the stochastic gradient descent method is not necessarily the global minimum, which is not the same as the batch gradient descent method, but the oscillation hovering around the global minimum value, which is acceptable as long as it is close to the global minimum. In fact, you can dynamically adjust the learning rate α= constant 1/(number of iterations + constant 2), so that as the iteration, α gradually reduced, in favor of the final convergence to the global minimum value. However, because "constant 1" and "Constant 2" is not OK, so often set α is fixed.
How do you judge the convergence of the model as the iteration progresses? Every 1000 or 5,000 samples, the J value of these samples is calculated as a population average and then drawn out as shown. From the graph trend, the model is not convergent in the iterative process. As the fourth chart, the trend is rising, it should be small learning rate α.
Coursera Online Learning---section tenth. Large machine learning (Large scale machines learning)