Stanford Machine Learning---sixth lecture. How to choose machine learning method and system

Source: Internet
Author: User

Original: http://blog.csdn.net/abcjennifer/article/details/7797502

This column (machine learning) includes linear regression with single parameters, linear regression with multiple parameters, Octave Tutorial, Logistic Regression, regularization, neural network, design of the computer learning system, SVM (Support vector machines), clustering, dimensionality reduction, anomaly detection, large-scale machine learning and other chapters. All of the content is from the Standford public class machine learning in the teacher Andrew's explanation. (Https://class.coursera.org/ml/class/index)

Six says. How to choose a machine learning method--advice for applying machines learning

===============================

Candidate machine learning methods

Evaluation Method hypothesis

☆ Model selection and training, validation of experimental data

☆ Differential diagnosis deviating from bias and deviations variance

☆ Regulation and Bias/variance

Learning Curve: When to add training data training set is valid?

===============================

Candidate machine learning methods--deciding to try Next

Also using the prediction example of the house price, suppose we have implemented a regular linear regression method to predict the price:

However, when you find that this prediction is applied to a new training data with great error (Error), some solutions should be taken:

Get more training Examplestry smaller sets of featurestry getting additional featurestry adding polynomial features (e.g. X1^2, x2^2, x1x2 ...) Try Decreasingλtry increasingλ

Diagnosis of machine learning method:

-What is diagnostic dignostic? Diagnosis is a test that can determine whether a learning algorithm can work and improve the performance of the algorithm.

Diagnostic:a Test so can run to gain insight what is/isn ' t working with A Learning algorithm, and gain guidance as Improve to the its performance.

-The effect of the diagnosis: diagnostics can take time to implement, but the doing so can is a very good use of your time.

===============================

Evaluation method hypothesis--evaluating a hypothesis

First of all, we divide all the data into two sets, a trainning set and a testing set, get the parameter vectors with the training set, and test with the testing set (such as classification).

At this point, the error of test set is divided into linear regression linear regression and logistic regression logistic regression two classes:

-Error of Linear regression:

-The error of logistic regression:

===============================

Model selection and training, validation of experimental data

Facing model selection problem, how can we get a model of just fit without causing underfit or overfit? We introduce a class of data sets, called Cross validation set, that is, a data set that crosses validation. Divide all data by <6,2,2> into training set, cross validation set, testing set three classes, as shown in:

The error calculation formula is as follows, in fact, the three kinds of calculation methods are similar, the same formula just changed the data and only:

The method of model selection is actually very simple, corresponding to the people to see:

-First, set up the D model hypothesis (10 in the figure, D for its ID), and on training set to training the theta vector with the least error, then get d θ

-Then, on this D model hypothesis, bring in θ, calculate J (CV) on cross validation set, that is, a model with the smallest CV set error as the hypothesis, such as the smallest in the 4th Group of J (CV), take d=4 hypothesis.

PS: In fact, d means dimension, that is, the dimension, indicating that the maximum polynomial of the hypothesis is D-dimensional.

PS ': Generally, J (CV) is greater than or equal to J (train)

===============================

Differential diagnosis deviating from bias and deviations variance

In the previous lesson we talked about the same data in different regression situations:

In this section, we distinguish two concepts: bias vs. variance.

As shown in error with different dimension model change diagram, you can imagine that with the D rise is a process from underfit to Overfit, in this process, the error of the training set decreases gradually, and the error of CV set drops first and then rises.

This creates the concept of bias and variance:

Bias:j (train) Large, J (CV) Large, J (train) ≈j (CV), bias produced in D small, underfit stage;

Variance:j (train) Small, J (CV) Large, J (train) <<j (CV), variance produced in D, Overfit stage;

As shown in the following:

Come on, do the problem:

-------------------------------------------------------------

Well, with the initial concept, now let's look at the origins and specific definitions of bias and variance:

Given the data and d (such as a point set), for the points on these datasets we can calculate the average of the points below each index (that is, expect) t (x) = E (y|x), then we have mean square error:

MSE = 1/n *σ (f (x)-T (x)) ^2 MSE (mean square error) = Bias2 + Variance +noise

That's what the definition says:

Variance:measures the extent to which the solutions for individual data sets vary around their average, hence this Measur Es the extent to which the function f (x) are sensitive to theparticular choice of data set.

Bias:represents the extent to which the average prediction through all data sets differs from the desired regression functio N.

Our goal are to minimize the expected loss, which we havedecomposed into the sum of a (squared) bias, a variance, and a con Stant Noiseterm. As we shall see, there are a trade-off between bias and variance, with very flexible models (overfit) have low bias and hi GH variance, and relatively rigid models (Underfit) having a high bias and a low variance

To summarize:

Variance: Estimates the variance of itself.

Bias: Estimates the difference between expected and sample data samples to get the regression function.

Specifically, I have a statistic D (such as statistics on a university student who is taller than [0.5-1],[1,1.1],[1.1,1.2] ... [1.9,2], so that a number of discrete points can be formed. Then, the school has 20 classes, each class can be prepared to synthesize an estimate curve f (x), these 20 curves, there is an average, that is, the estimated expected (mean) curve E (x,d).

Variance refers to the distance between the 20 estimated curves and the last estimated expectation (mean), i.e. the variance of the estimated curve itself, which is unlikely to be 0.

Bias refers to the distance between the mean of the 20 estimated curves and the actual best fit condition.

In this case, for λ in the regularization item,

λ small, D-overfit (flexible)

Fit results for different training datasets (data from different classes) jitter is large--variance

Bias is the deviation between the estimated mean and the actual value expectation, bias small

, the left image is the fitting 20 curves, the red line on the right is the expectation of 20 curves, and green is the fitting curve expected from the actual data.

λ large, D small underfit (Stable)

Fit results for different training datasets (data from different classes) less jitter-variance small

Bias is the deviation between the estimated mean and the actual value, and it is not good for the regression---Bias large

, the left image is the fitting 20 curves, the red line on the right is the expectation of 20 curves, and green is the fitting curve expected from the actual data.

Shown is the relationship between λ and bias, variance, error:

We hope that the data of variance and bias are not big:

Then there is a variance and bias between the tradeoff ~

===============================

Bias/variance and rule of law

In the above section, the birth of bias and variance is described, and in this section we apply them to the regularization.

What else do you remember about regularization? Regularization is a component that is introduced in cost function to prevent overfit.

Not sure yet? See, the regularization item is the last item in cost function J (θ), where λ is too large to cause overfit to underfit,λ too small ...

The λ from 0, 0.01, all the way up to 2 each time, then to 10.24 in total can be tried 12 times λ.

These 12 λ will get the cost function of 12 model, each pair should have J (θ) and JCV (θ).

In the same way that the model is selected, each cost function is first selected to order the smallest θ of j (θ), and then the smallest set of JCV (θ) is determined as the final λ.

Come on, let's see what you're doing. No:

The picture is like this:

λ too small leads to overfit, producing variance,j (train) <<j (CV)

λ too large leads to underfit, producing bias,j (train) ≈j (CV)

Can you see what I mean?

===============================

Learning Curve: When to add training data training set is valid?

This section wants to talk about the relationship between the number of training data m and the error. From the picture above we know (think with extreme thought), the less training data (if only one), J (train) The smaller, J (CV), the larger the M, J (train) the larger (because the more difficult perfectly fit), J (CV) smaller (because the more accurate), You know what I mean?

Then we are high Bias and high variance to see how to increase the number of training set, that is, M, is it meaningful?!

Underfit high bias: adding M is useless!

Overfit High Variance: Increasing m makes the gap between J (train) and J (CV) decrease, which helps performance improve!

Come on, do the problem:

As can be seen from the graph, increasing the number of training data is useful for overfitting and is futile for underfit!

Here's a recap of the initial solution list:

What are the conditions for underfit and overfit? See:

This chapter is very useful, the question of choosing the best fit model is the common problem of machine learning, and I hope we can help you to choose the ML model.

Stanford Machine Learning---sixth lecture. How to choose machine learning method and system

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.