Stanford Machine Learning---The sixth lecture. How to choose machine Learning method, System _ Machine learning

Source: Internet
Author: User

This column (Machine learning) includes single parameter linear regression, multiple parameter linear regression, Octave Tutorial, Logistic regression, regularization, neural network, machine learning system design, SVM (Support vector machines Support vector machine), clustering, dimensionality reduction, anomaly detection, large-scale machine learning and other chapters. All of the content came from Standford public class machine learning in the lecture of Andrew. (Https://class.coursera.org/ml/class/index)


Say. How to choose machine Learning method--advice for applying machine learning

===============================

Candidate machine learning methods

Evaluation Method hypothesis

☆ Model selection and training, validation of experimental data

☆ Differential Diagnosis deviation bias and deviation variance

☆ Rules and Bias/variance

Learning Curve: When to increase the training data training set is valid.


===============================

Candidate machine learning method--deciding what to try Next


Using the prediction example of house prices, suppose we have implemented a rule-based linear regression method for predicting house prices:


However, when the prediction is applied to a new training data there is a large error (error), there should be some solution:

Get more training examples Try smaller sets of features try getting additional features Try adding polynomial features (E. G. x1^2, X2^2, x1x2 ...) Try Decreasingλtry increasingλ

Diagnosis of Machine Learning methods:

-What is a diagnostic dignostic? Diagnosis is a test that can determine whether a learning algorithm can work and improve the performance of the algorithm.

DIAGNOSTIC:A test this can run to gain insight what is/isn ' t working with A Learning algorithm, and gain as To the best to improve its performance.

-The effect of the diagnosis: diagnostics can take time to implement, but doing so can is a very good use of your time.





===============================

Evaluation method hypothesis--evaluating a hypothesis

First of all, we divide all the data into two sets, one trainning set and one testing set, the parameter vector with the training set, and the testing set for testing (such as classification).


At this point, the error of test set is divided into linear regression linear regression and logistic logistic regression two categories:

-Error of Linear regression:


-Error of Logical regression:




===============================

Model selection and training, validation of experimental data


Faced with the problem of model selection, how to get a just fit model without causing underfit or overfit. We introduce a class of data sets called cross validation set, that is, cross-validation datasets. Divide all data into <6,2,2> into training set, cross validation set, testing set three classes, as shown in the following figure:


The error calculation formula is as follows, in fact, the three kinds of calculation methods are similar, the same formula is only changed data and only:



The model selection method is actually very simple, corresponding to the following figure:

-First, establish the D model hypothesis (10 in the figure, D for its ID), and on the training set to make it training the least-error theta Vector, then get D theta

-Then, to the D-model hypothesis, take Theta, Compute J (CV) on cross validation set, i.e. a model with the lowest CV set error as hypothesis, the following figure J (CV) is the smallest in group 4th, then take d=4 hypothesis.

PS: actually d denotes dimension, which is the dimension, which means that the hypothesis maximum polynomial term is D-dimensional.

PS ': Generally, J (CV) is greater than or equal to J (train)


===============================

Differential diagnostics deviate from bias and deviations variance

In the previous course we have talked about the different regression of the same data:




In this section, we distinguish between two concepts: bias vs. variance.

The following figure shows the error with different dimension model changes, you can imagine, with the D rise is a process from underfit to Overfit, in this process, training set error gradually dropped, and the CV set error first drop and then rise.




Here we have the concept of bias and variance:

Bias:j (train) Large, J (CV) Large, J (train) ≈j (CV), bias produced in D small, underfit stage;

Variance:j (train) Small, J (CV) Large, J (train) <<j (CV), variance from the D-large, overfit stage;

As shown in the following illustration:



Come on, do the math:




-------------------------------------------------------------


Well, with the initial concept, let's look at the origins and specific definitions of bias and variance:

Given the data and d (such as a set of points), for the points on these datasets we can compute the average (that is, expectation) t (x) = E (y|x) of each index, then we have mean square error:


MSE = 1/n *σ (f (x)-T (x)) ^2


MSE (mean square error) = Bias2 + Variance +noise


That's what the definition says:

Variance:measures the extent to which the solutions of for individual data sets vary around their average, hence this Measur Es the extent to which the function f (x) are sensitive to theparticular choice of data set.

Bias:represents the extent to which the average prediction on all data sets differs from the desired regression N.

Our goal are to minimize the expected loss, which we havedecomposed into the sum of a (squared) bias, a variance, and a con Stant Noiseterm. As we shall, there is a trade-off between bias and variance, with very flexible models (overfit) have low bias and hi GH variance, and relatively rigid models (underfit) have high bias and low variance



To sum up:

Variance: Estimates the variance of itself.

Bias: The difference between the estimated expectation and the regression function desired by the sample data sample.


Specifically, I have a statistics d (such as statistics of a university graduate student height in [0.5-1],[1,1.1],[1.1,1.2] ... The number of [1.9,2], so that some discrete points can be formed. Then, the school has 20 classes, each class can be prepared to synthesize an estimate curve f (x), the 20 curves, there is an average, that is, the estimated expected (mean) curve E (f (x,d)).


Variance refers to the distance between these 20 estimated curves and the last estimated expectation (mean value), that is, the variance of the estimated curve itself, it is impossible to be 0.

Bias refers to the distance between the mean value of 20 curves and the actual best fit.


So, for the lambda in the regularization item,


λ small, D large-> overfit (flexible)->

The results of the fitting of different training datasets (data from different classes) are much-> variance large

Bias is the deviation between the estimated mean value and the actual value-> bias small

In the following figure, the left figure is a fitted 20 curve, the right red line is the expectation of 20 curves, and the green is the fitting curve expected from the actual data.




λ large, D small-> underfit (Stable)->

For different training data sets (data for different classes) the results of the-> jitter is small variance

Bias is the deviation between the estimated mean value and the actual value expectation, and can not be well carried back-> bias big

In the following figure, the left figure is a fitted 20 curve, the right red line is the expectation of 20 curves, and the green is the fitting curve expected from the actual data.



The following diagram shows the relationship between Lambda and bias, variance, and error:




We hope that the data variance and bias are not big:



Then there is a tradeoff between variance and bias.







===============================

Rules and Bias/variance


The birth of bias and variance is described in the section above, and we apply them to regularization in this section.

What else do you remember about regularization? Regularization is a component that is introduced in the cost function to prevent overfit.

It's not clear. Look at the picture, the regularization item is the last item in the cost function J (theta), where λ is too large to cause the underfit,λ to be too small to cause overfit ...




Multiply lambda from 0, 0.01, all the way up to 2 each time, then to 10.24 you can try a total of 12 λ.

These 12 λ will get 12 model cost function, each pair should have J (θ) and JCV (θ).

In the same way as model selection, first select each cost function to order the minimum of J (theta) θ, and then take out the smallest JCV (theta) set as the final λ.



Come on, look what you did.



The picture comes out like this:



λ too small leads to Overfit, which produces variance,j (train) <<j (CV)

λ too large causes underfit, which produces bias,j (train) ≈j (CV)

Can you understand what I mean?



===============================

Learning Curve: When to increase the training data training set is valid.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.