Hello everyone, I am mac Jiang, today and everyone to share Coursera-stanford university-machine Learning-week 10:large scale machine learning after the class exercise solution. Although my answer passed the system test, but my analysis is not necessarily correct, if you bo friends found wrong or have a better idea, please leave a message contact, thank you. Hope my blog is helpful to your study!
This unit, Wunda teacher mainly said five aspects of the content:
1. Random gradient descent (Stochastic Gradient descent), the difference between the stochastic gradient algorithm and the batch gradient algorithm (batch Gradient descent) is compared. The random gradient algorithm only needs to use the partial derivative of one sample per iteration, and the batch gradient algorithm calculates the partial derivative of each sample each time, if the sample number M is large (tens), the speed will be very slow.
2. The low-volume gradient descent method (Mini-batch Gradient descent) is introduced, and the low-volume gradient descent method is used to update the parameter theta for each iteration of the B sample point, and he is less than the batch gradient descent method, which converges faster than the random gradient descent method.
3. The convergence judgment of judging the random gradient descent method is introduced. The method used here is to judge the cost function J and the number of iterations iteration. However, the convergence process is not stable because the random gradient is modified theta only with the partial derivative of one sample per iteration. We take every 1000 iterations to calculate the average cost (theta, (x (i), Y (i)) of the method to be plotted.
4. The online learning is introduced. Each time the user visits the page will leave some information, we get this information, the existing parameters theta correction, after the completion of the correction to discard the information (the information will be lost after learning, do not have to save, later data). Online Learnin Benefits If the user's preferences change, through the user information correction parameters theta, can be theta gradually to a new direction adjustment.
5. Introduction of Map-reduce. When a machine cannot complete the current amount of data learning, multiple machines can be used to run in parallel. Divide the data into equal parts, learn one of the machines each day, and then synthesize the results from each machine into the final result. Map-reduce also need to consider some network latency, data transmission and other issues.
OK, summed up so much, the following exercises to explain!
1. First question
(1) Test instructions: Suppose you use a random gradient descent to train a logistic regression classifier. You take the cost every 500 times (theta, (x (i), Y (i)) The average is the ordinate, the number of iterations is the horizontal axis, after making the graph after each iteration of the graph to increase gradually, we should do?
1. Use a smaller learning rate alpha
2. Try to take the average value every 1000 times instead of 500 times
3. Try a larger learning rate alpha
4. This is not a problem, it should be for us to use the random gradient method when we want this
(2) Analysis: The graph increases slowly with the number of iterations, indicating that the iterative process is divergent!
1. Correct. Divergence may be caused by alpha too large, so you can try to reduce the alpha
2. Error. No matter how many times he takes, he is ascending or diverging.
3. Error. It all radiates and it increases alpha, and it's even more divergent.
4. Error. We want the curve to converge.
(3) Answer: 1
2. The second question
(1) Test instructions: Select all of the following correct statements about random gradient descent
1. If you train a linear regression classifier using a random gradient descent method, the cost function is the above equation. The cost function value J is guaranteed to be decremented during each iteration
2. Random gradient descent is suitable for small sample problems. In this kind of problem, the stochastic gradient descent method is often better than the batch gradient descent method.
3. One advantage of the stochastic gradient descent method is that each iteration simply modifies the theta value with a single sample, and the batch gradient descent rule calculates all the sample values
4. The random gradient descent method only needs to learn one sample at a time
(2) Analysis: 1. Error. The method of random gradient descent does not guarantee that the value of J after each iteration is reduced, and that there is a large likelihood of increased
2. Error. The proposed random gradient descent is in the sample number m too large and the batch gradient descent method is too slow. The stochastic gradient descent method is suitable for large sample problems
3. Correct.
4. Correct.
(3) Answer: 3,4
3. Question Three
(1) Test instructions: Select the following correct statement about online learning.
1. In the classroom discussion of online learning methods, we have repeatedly received translation training samples, for these repeated samples we only have a random gradient descent method, and then the next sample.
The disadvantage of 2.online learning is that it requires a lot of storage space to store training sample data
3. When the online learning is made, whenever we get a new sample (x, y), we do a study and discard the sample to train the next
One advantage of 4.online learning is that there is no need to choose the learning rate alpha
(2) Analysis: 1. Correct. These samples may be repeated by a user, we just need to learn one, and then discard it.
2. Error. The biggest feature of online learning is that once a sample has been trained, it can be discarded without saving, and later data is more
3. Correct.
4. Error. Gradient descent method is required to select the learning rate alpha
(3) Answer: 1,3
4. Question Fourth
(1) Test instructions: You have a large data set, the following algorithms can be used to map the reduction of the way they are in parallel on multiple machines?
1. Neural network training using mass gradient descent
2.online learing, you repeat a training sample (x, y) to learn from this sample
3. Linear regression using batch gradient descent
4. Logistic regression using stochastic gradient descent method
(2) Analysis: 1. Correct. The computational cost function of neural networks is also quite complex and can be accelerated by mapping reduction.
2. Error. Online learning only learns one sample at a time, so there's no need to use dimension reduction for parallel operations
3. Correct. Batch gradient drop each iteration needs to calculate all the data, the calculation is large, you can divide the sample into N points using a computer to calculate, speed up.
4. Error. A random gradient drop is not necessary, and each iteration uses only one sample and does not require parallelism.
(3) Answer: 1,3
5. Question Fifth
(1) Test instructions: Select the following correct statements about mapping reduction
1. If you use the N machine to run the mapping reduction, we are often faster than a machine at least n times faster
2. If you have only one single-core computer, then the mapping can see no effect
3. When the gradient is reduced, we often use a separate machine to combine the calculated gradient values of each machine to calculate the parameters of this iteration.
4. Due to network latency and other overhead, if you use n too computer to run the mapping reduction, our speed increase is less than n times
(2) Analysis: 1. Error, Network has a delay, establish a connection takes time, data transfer takes time, then how to calculate also can not be redundant n times
2. Correct, mapping reduction can be used for multiple computers in parallel or multi-core parallelism of a single computer
3. Correct
4. Correct
(3) Answer: 2,3,4
Week 10:large Scale machine learning after class exercise solution