Coursera Machine Learning notes (eight)

Source: Internet
Author: User

Mainly for the week content: large-scale machine learning, cases, summary

(i) Random gradient descent method

If there is a large-scale training set, the normal batch gradient descent method needs to calculate the sum of squares of errors across the entire training set, which is a very large computational cost if the learning method needs to iterate 20 times.

First, you need to determine whether a large-scale training set is necessary. When we do need a large-scale training set, we can try to replace the batch gradient descent method with the random gradient descent method.

In the stochastic gradient descent method, the cost of defining a single training instance of the cost function is:

The random gradient descent algorithm is as follows:

random gradient descent algorithm updates parameters after each calculation Θ , without having to sum up all the training sets first, the random gradient descent algorithm has been out of the way when the gradient descent algorithm has not yet completed an iteration. But the problem with such algorithms is that not every step is directed towards the " right " direction. Therefore, although the algorithm will gradually move to the global minimum position, but may not be able to stand at that point of the minimum value, but hovering around the minimum point.

The micro-batch gradient descent algorithm is an algorithm between the batch gradient descent algorithm and the random gradient descent algorithm, each of which calculates the constant B training instance and changes the Θonce.

in batch gradient descent, we can make the cost function J for the number of iterations of the function, draw a chart, according to the chart to determine whether the gradient decline convergence. However, in the case of large-scale training sets, this is unrealistic, because the computational cost is too great. In a random gradient descent, we calculate the cost once every time we update Θ, and then , after every x iterations, find out the x The average of the cost to the training instance, and then plot the function graph between these averages and the number of times X iterations.

when we draw such a chart, we may get an image of a function that is bumpy but not significantly reduced (as shown in the upper left-middle Blue line). We can increase The number of X to make the function more gentle, perhaps we can see the downward trend (as shown in the upper left red line), or maybe the function chart is still bumpy and does not fall (as the magenta line shows), then our model itself may have some errors. If we get the curve as shown above right below and constantly rising, then we may need to choose a smaller learning rate α.

can also make the learning rate decrease as the number of iterations increases, such as

but usually we don't need to do this to have a very good effect, α calculations that are spent on adjustments are often not worthwhile.

(ii) Online learning

An online learning algorithm refers to the learning of data streams rather than offline static datasets. Many online sites have a continuous flow of users, and for each user, the site wants to be able to learn the algorithm smoothly without storing the data in the database.

Once the learning algorithm is completed for the data, we discard the data and no longer store it.

The advantage of the online learning algorithm is that our algorithm can adapt well to the user's tendency, and the algorithm can update the model to adapt to the user's current behavior constantly.

Map Reduce and data parallelism

Batch gradient descent algorithm to solve the optimal solution of large-scale data sets = need to cycle the entire training set, calculate its partial derivative and cost, and then sum, the computational cost is very large. Assign the entire data set calculation to several computers, have each computer work on a portion of the dataset, and then summarize the calculation results. This is map Reduce.

Detailed map Reduce can continue to learn about Hadoop and Spark

(iii) Upper limit analysis

In machine learning applications, there are usually several steps in the blood to make a final prediction, which is the most worthwhile step to take time and effort to improve? This is where the upper limit analysis comes in.

In a word recognition application, it is divided into the following steps:

In the upper limit analysis, for this step and the previous part of the application, manually provide 100% output results to see how the final effect of the application improved.

By visible, the text detection (Character segmentation) step is worth devoting time and effort.

Coursera Machine Learning notes (eight)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.