Notes of machine Learning (Stanford), Week 6, Advice for applying machine learning

Last Update:2017-12-13 Source: Internet

Author: User

Tags erro

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This paper uses the regularization linear regression model pre-flow (water flowing out of dam) according to the water storage line (water level) of the reservoir, then the Debug Learning Algorithm and discusses the influence of deviation and variance on the linear regression model.

① visualizing datasets

The data set for this job is divided into three parts:

Training set (training set), sample matrix (Training Set): X, results label (label of result) Vector y

Cross validation set (validation set) to determine regularization parameters Xval and Yval

Testing set (test set) for evaluating performance, the data in the test set is never seen in the training set

Load the data into Matlab as follows: There are 12 training instances in the training set, and each training instance has only one feature. It is assumed that the function hθ (x) =θ0 x0 +θ1 X1, expressed as a vector:hθ (x) =θt x

In general, thex0 is Bais unit, the default x0==1

The data graphic in the training dataset is represented as follows:

The cost function of ② regularization linear regression model

The cost function formula is as follows:

The MATLAB code is implemented as follows: The cost function here is implemented by vector (matrix) multiplication .

The concrete proof can refer to: Linear Regression---realizes a linear regression

2:length (theta))) ' * Theta (2:length (theta))); J = SUM ((x*theta-y). ^2)/(2*m) + reg;

Note: Since θ0 does not participate in regularization, the above MATLAB array subscript is starting from 2 (the MATLAB array subscript is 1-based, θ0 is the first element in the MATLAB array).

The linear regression gradient of ③ regularization

The gradient is calculated as follows:

Where the vector representation of the following formula is: [XT (x θ-y)] /m, in MATLAB means:X ' * (x*theta-y)/M

The MATLAB code of the gradient is implemented as follows:

grad_tmp = X ' * (x*theta-y)/M;grad = [Grad_tmp (1:1); Grad_tmp (2:end) + (lambda/m) *theta (2:end)];

④ uses the function FMINCG function of MATLAB to train the linear regression model, obtains the parameter of the model, TRAINLINEARREG.M as follows:

function [theta] = Trainlinearreg (x, y, lambda)%trainlinearreg Trains linear regression given a dataset (x, y) and A%regul Arization parameter lambda%   [theta] = Trainlinearreg (x, y, lambda) trains linear regression using% the   dataset (x , y) and regularization parameter lambda. Returns the%   trained parameters theta.%% Initialize Thetainitial_theta = zeros (Size (X, 2), 1);% Create "short hand" f or the cost function to be minimizedFMINCG (costfunction, Initial_theta, options); End

However, in the job one, we do not use MATLAB FMINCG function to obtain the model parameters, but by the following formula in the for loop to obtain the model parameter θ

Its MATLAB implementation is as follows:

For iter = 1:num_iters    theta = theta-(alpha/m) *x ' * (x*theta-y),% Theta is implemented using the above vector notation of the Matlab language .... End

Graphical representation of ⑤ linear regression model

The model parameters have been obtained by FMINCG, so how much does the model fit with the data? See:

As can be seen, because our data is two-dimensional, but with a linear model to fit, it is obvious that underfiting problem

Here, it is easy to visualize the model graphically, because our training data has very few features (one-dimensional). When the characteristics of the training data are many (feature variables), it is difficult to draw (three-dimensional more difficult to direct graphical representation ...). at this point, it is necessary to use the "learning curve" to check whether the trained model and the data are well-fitted.

The best fit line tells us and the model is not a good fit to the data because the data has a non-linear pattern. While visualizing the best fit as shown is one possible-to debug your learning algorithm,
It is not always easy to visualize the data and model (for example, when features exceed 3 dimensions ...)

Trade-offs between ⑥ deviations and variances

High deviation---less fit, underfit

Takakata difference---over fitting, overfit

The deviation-variance problem can be diagnosed with the learning curve (learning curve). The x-axis of the learning curve is the training set size (training set sizes), and the y-axis is the cross-validation error and training error.

The training error is defined as follows:

Note: The training error jtrain (θ) is not a regularization item, so when calling Linearregcostfunction, Lambda==0. MATLAB is implemented as follows (LEARNINGCURVE.M)

function [Error_train, error_val] = ... learningcurve (X, y, Xval, yval, Lambda)%learningcurve generates the train and C Ross validation set errors needed%to plot a learning curve% [Error_train, error_val] = ...% learningcurve (x, y, X Val, Yval, Lambda) returns the train and% cross validation set errors for a learning curve. In particular,% it returns-vectors of the same length-error_train and% error_val. Then, Error_train (i) contains the training error for% I examples (and similarly for Error_val (i)).% because M is the same, so Erro The R_val and error_train vectors have the same number of elements% in this function, you'll compute the train and test errors for% dataset sizes from 1 up to M. In practice, when working with larger% datasets, you might want to does this in larger intervals.%% number of training Exa MPLESM = Size (X, 1), need to return these values Correctlyerror_train = Zeros (m, 1); error_val = Zeros (m, 1);% ===== ================= YOUR CODE here ======================%Instructions:fill in this function to return training errors in% Error_train and the cross validation Erro               RS in error_val. i.e., Error_train (i) and% error_val (i) should give you the errors%       Obtained after training in I examples.%% note:you should evaluate the training error on the first I training%  Examples (i.e., X (1:i,:) and y (1:i)). Percent for the cross-validation error, you should instead evaluate on% the       _entire_ Cross Validation set (Xval and Yval). Percent Note:if you is using your cost function (linearregcostfunction)% To compute the training and cross validation error, you should% call the function with the lambda argument set to 0.% do note so you'll still need to use lambda when running% the training to obtain the theta parameters. Percent Hint:you can loop over the examples with the following:%% for i = 1:m%% Compute train/cross validatio N Errors using TrainingExamples% X (1:i,:) and y (1:i), storing the result in% Error_train (i) and error_val (i)% ....% end%%----------------------Sample solution----------------------for i = 1:m theta = tr        Ainlinearreg (X (1:i,:), y (1:i), lambda);        Error_train (i) = Linearregcostfunction (X (1:i,:), y (1:i), theta, 0);    Error_val (i) = Linearregcostfunction (Xval, Yval, theta, 0); % -------------------------------------------------------------% ================================================ =========================end

The graph of the learning curve is as follows: It can be seen that under-fitting, when the number of training examples is very small, the trained model can fit "a little bit of data", so the training error is relatively small, but for the cross-validation error, it is calculated using unknown data, and now the model is not fit, Therefore, it is almost impossible to fit the unknown data, so the cross-validation error is very large.

With The increase of the number of training examples, due to the lack of fitting, the training model is more and more able to fit some data, so the training error increased. As for the cross-validation error, eventually the training error is consistent with and become more and more smooth, at this time, the increase in training samples (training examples) has not much impact on the training effect of the model---in the case of under-fitting, and then increase the number of training sets can no longer reduce the training error.

⑦ polynomial regression

From the above learning curve graph can be seen: underfit problem, by adding more features (features), using the higher power of the polynomial as a hypothetical function to fit the data, to solve the problem of less than fit.

The assumption functions of the polynomial regression model are as follows:

By "augmenting" the feature to add more features, the code is implemented as follows: POLYFEATURES.M

function [X_poly] = polyfeatures (x, p)%polyfeatures Maps x (1D vector) into the p-th power%   [X_poly] = Polyfeatures (x, p) takes a data matrix X (size M X 1) and%   maps Each example to its polynomial features where% x_poly   (i,:) = [ X (i) x (i). ^2 x (i). ^3  ... X (i). ^p];%% you need to return the following variables correctly. X_poly = Zeros (Numel (X), p);% ====================== YOUR CODE here ======================% instructions:given a vector X , return a matrix x_poly where the p-th%               column of x contains the values of x to the p-th power.%%%x_ploy (:, 1) = X; For i = 1:p    x_poly (:, i) = x.^i;end% =========================================================================end

After the "expansion" of the feature, it becomes a polynomial regression, but because the characteristics of the polynomial regression range is too large (for example, some characteristics of the value is very small, and some characteristics of the value is very large), it is necessary to use the normalization (normalized), normalized code as follows:

function [X_norm, mu, sigma] = featurenormalize (x)%featurenormalize normalizes the features in x%   featurenormalize (x ) returns a normalized version of X where% the mean value of each feature are 0 and the standard   deviation% is   1. T Often a good preprocessing step to do when% working with   learning algorithms.mu = mean (X); X_norm = Bsxfun (@minus, X, mu), sigma = STD (x_norm); X_norm = Bsxfun (@rdivide, X_norm, sigma);% ============================================================end

Continue to use the original linearregcostfunction.m to calculate the cost function and gradient of the polynomial regression, the graph of the hypothetical function of the polynomial regression model is as follows: (Note: lambda==0, no regularization is used):

From the graph of the polynomial regression model, it is almost perfect to fit all the training sample data. Therefore, it can be considered that there has been an overfitting problem (overfit problem)---high variance

The learning curve graph of the polynomial regression model is as follows:

The learning curve graph of polynomial regression shows that the training error is almost 0 (very close to the x-axis), precisely because the overfitting---model passes almost perfectly through each data point in the training dataset, resulting in a very small training error.

The cross-validation error was very large (the number of training samples was 2 o'clock), and then as the number of training samples increased, validation error became smaller (the number of training samples increased to 2 to 5), and then, when the number of training samples was increasing (more than 11 training samples ...). ), cross-validation errors become larger (over-fitting results in reduced generalization capability).

⑧ using regularization to solve overfitting problem of polynomial regression model

When you set the regularization item lambda = = 1 (λ==1), the resulting model assumes that the function graph is as follows:

It can be seen that the fitting curve here is no longer the same as the Lambda = = 0 o'clock, and it is not very precise to pass through each point, but to become relatively smooth. This is exactly the effect of regularization.

The learning curve of Lambda==1 (λ==1) is as follows:

The learning curve of Lambda==1 shows that the model has better generalization ability, and can predict the unknown data better. Because of this, the cross-validation error is very close to the training error and is very small. (The training error is small, indicating that the model can fit the data well, but it is possible to have the problem of fitting, over-fitting, it is not very good to predict the unknown data, and the cross-validation error Here is also small, indicating that the model can also be very good to predict the unknown data)

Finally, the polynomial regression model of the regularization parameter lambda = = 100 (λ==100) When the case:(there is underfit problem--less fitting-high deviation)

The model "hypothetical function" curve is as follows:

The learning curve graph is as follows:

⑨ How to automatically select the appropriate regularization parameter lambda (λ)?

As seen from point ⑧: The regularization parameter lambda (λ) equals 0 o'clock, there is an over fitting, lambda (λ) equals 100, there is an under-fitting, lambda (λ) equals 1 o'clock, the model is just fine.

How do you automatically select the appropriate lambda parameters during the training process ?

You can use cross-validation sets (select the appropriate lambda parameter based on cross-validation errors)

Use a cross validation set to evaluate how good each lambda value is.

We can then evaluate the model on the test set to estimate how well the model would perform on actual unseen data.

The specific options are as follows:

There is a series of lambda (lambda) values to be selected first, and these lambda values are saved with a lambda_vec vector in this lambda job (10 total):

Lambda_vec = [0 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10] '

Then, use the training data set to train 10 regularization models for each of the 10 lambda. Then, for each trained model, the cross-validation error is calculated, and the lambda (λ) value corresponding to the model with the smallest cross-validation error is chosenas the most suitable λ. (Note: There is no regularization when calculating training errors and cross-validation errors, equivalent to lambda==0)

For i = 1:length (Lambda_vec)   theta = Trainlinearreg (X,y,lambda_vec (i)); percent for each lambda, train out model parameters Theta%compute JCV   and Jval without Regularization,causse last arguments (lambda) is zero    error_train (i) = Linearregcostfunction (X, y, th ETA, 0);% Calculation Training error  error_val (i) = Linearregcostfunction (Xval, Yval, theta, 0);% compute cross-validation error end

For these 10 different lambda, the calculated training error and cross-validation error are as follows:

Lambda     Train error   Validation error 0.000000    0.173616    22.066602 0.001000    0.156653    18.597638 0.003000    0.190298    19.981503 0.010000    0.221975    16.969087 0.030000    0.281852    12.829003 0.100000    0.459318    7.587013 0.300000    0.921760    1.000000    2.076188    4.260625 3.000000    4.901351    3.822907 10.000000   16.092213   9.945508

Training errors, cross-validation errors, and relationships between lambda graphs are represented as follows:

When the lambda >= is 3, the cross-validation error starts to rise, and if you increase the lambda, you may be under-fitted ...

As seen from the above: lambda = = 3 o'clock, the cross-validation error is minimal. The fitting curve of Lambda==3 is as follows: (Can compare with lambda==1 curve and learning curve, see what is different)

The learning curve is as follows:

Original: http://www.cnblogs.com/hapjin/p/6114466.html

Notes of machine Learning (Stanford), Week 6, Advice for applying machine learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More