Stanford Machine Learning Week 1-single variable linear regression

Source: Internet
Author: User

This article covers the following topics:

    • Single-Variable linear regression
    • Cost function
    • Gradient Descent
Single-Variable linear regression

Looking back at the next section, in the regression problem, we have given the input variable, trying to map to the continuous expected result function to get the output. single-Variable linear regression is the prediction of an output value from an input value. The input/output correspondence is a linear function.

Here is an example of predicting house prices based on the size of the house .

Suppose you have a dataset that we call a training set that includes the house area and the house price data.

X: Represents an input variable, also known as a feature variable.

Y: Represents the output variable, also known as the target variable.

(Xi,yi): Represents a training sample, a row of the training set. I represents the first training sample.

M: Indicates the number of training samples.

Before you go on, learn about the general ways of working with supervised learning. For example, after training the training set through the learning algorithm, an H (hypothesis) is obtained, which is represented as a function.

The input of the function is the house size and the output is the price of the house. Through this function we can predict the house price, which is the regression problem in machine learning.

In single-variable linear regression, our hypothetical function is:hθ (x) =θ0+θ1x, where θ0 and θ1 are model parameters.

There's got to be some confusion. Why choose this function, the actual application may not be a linear function, may be more complex functions . This is just to start with an example of a simple linear equation, assuming that the area and price of the house is a linear relationship.

Well go ahead, we have a hypothetical function, and the next thing to do is to choose different parameters θ0 and θ1. You can see that different parameter values get different hypothetical functions.

We select the lines obtained by the parameters θ0 and θ1 and should be fitted to the training data as much as possible. So that we can make more accurate predictions.

So how do you make it better to fit the data? Our idea is that hθ (x) should be closest to the y -value, that is, the problem is turned into a mininize (hθ (x) -y) problem, and there is no feeling that translates into a mathematical problem.

cost function

Next, a cost function , called the squared error function, is introduced.

The function is defined as follows, each training sample hθ (x)-y squared sum, and then averaged. Multiply by 1/2 is for the convenience of the next calculation.

What we want to do is to find the minimum value of θ0 and θ1 for function J (θ0, θ1). That is, the values ofθ0 and θ1 are found to make J (θ0, θ1) the minimum value. Next, we introduce an algorithm, gradient descent . It can automatically find the values of the parameters θ0 and θ1 that minimize J.

Gradient Descent

Before we understand the gradient drop, we'll look at the function image of J (θ0, θ1) .

When θ0 is 0 o'clock, the change θ1 will get the effect. θ1 When I take 1, J is the minimum value.

When both θ0 and θ1 have values.

Well, back to our previous question description, we have a cost function J (θ0, θ1)that wants to use an algorithm to minimize J.

Then the idea of a gradient descent is:

Starting with a given (θ0, θ1) initial value, we want to change the value of θ0, θ1 , each time by changing it to make J decrease. Final J reduced to a minimum.

Let's take a look at how gradient descent works in the following diagram.

First imagine, for θ0, that θ1 assigns an initial value (which can typically be initialized to 0). Assuming that after initialization, J corresponds to this point of the marked red.

We imagined it was a mountain, and you stood at this point on the hill. In the gradient drop, what we have to do is to look around the week and ask myself if I want to go down the hill by a small rags, in which direction can the fastest downhill? and move in that direction.

You need to adjust the quickest direction every step of the way. Eventually you come out of such a path. Finally go to a local optimal solution--at the foot of the mountain. Of course if you are standing at another point, you can get out of the other track.

The above is the intuitive feeling through the diagram, then in fact in this direction, in fact, is the gradient of the negative direction. The gradient is actually the derivative of J, for a linear function, that is, the slope of the line.

If you define the length of each step as the learning rate. Then every step of this action, in fact, is θ0, θ1 are in the negative direction of the gradient to update. Controls the amplitude of the update.

is a mathematical definition: that is, according to a certain learning rate, constantly update θ0, θ1 until J local convergence to the minimum value.

is a straight-passenger sensation on a unary linear function, such as a point on a function that drops one step in the direction of the red direction, that is, the slope in the inverse, to another point. θ1 will get the-* slope update.

Another direction is the same effect.

The influence of the choice of learning rate on the algorithm, if the rate is too small, the rate of decline is slower. If the rate is too large, it will not converge to the minimum value.

If the choice of θ1 is the most advantageous at the beginning. Then the derivative of this point is 0, and theθ1 will not be updated.

As the gradient declines, the value of the derivative decreases and the magnitude of the decrease decreases naturally. All there is no need to adjust the learning rate during the descent process.

In linear regression, the concrete realization of gradient descent

The mathematical characteristics of gradient descent are discussed, so how to apply it in the training algorithm of linear regression.

On the left is the gradient descent algorithm, and the right side is the linear regression model.

To implement this algorithm, we are still poor at this key. It indicates that the derivation of θ0, θ1 , can be parametric mathematical derivation method.

After the derivation of θ0 , we get

After the derivation of θ1, we get

And then substituting for the gradient descent formula. Each cycle, each to find the θ0, θ1 update value and then update.

After-school practice code, octave implementation.

There is now a restaurant chain, with data on urban population and profits. We want to get the prediction function of urban population and profit.

Plot the data in a visual way first.

Percent ======================= part 2:plotting =======================fprintf (' plotting data ... \ n ') data = Load (' Ex1data1.txt '); X = Data (:, 1); y = data (:, 2); m = length (y);  % number of training examples% Plot data% Note:you has to complete the code in Plotdata.mplotdata (X, y); fprintf (' Program Paused. Press ENTER to continue.\n ');p ause;

function PlotData (x, y)%plotdata plots the data points x and y into a new figure% PlotData (x, y) plots the data points a nd gives the figure axes labels of% population and profit.% ====================== YOUR CODE here ====================== % Instructions:plot The training data into a figure using the percent "figure" and "Plot" commands. Set The axes labels using% the "Xlabel" and "ylabel" commands.  Assume the population and revenue data has been passed in% as the X and Y arguments of this function.%% hint:you can use the ' Rx ' option with a plot to the markers% appear as red crosses. Furthermore, you can make the% markers larger by using plot (..., ' rx ', ' markersize ', ten); % open a new figure windowplot (x, y, ' rx ', ' markersize ', 10); %plot the Dataylabel (' Profit in $10,000s '); %set the Y-axis Lablexlabel (' Population of city in 10,000s '); %set the x-axis lable% ============================================================end 

A best-fit line is obtained by using gradient descent method.

% defines the number of cycles  % definition learning rate % compute and display initial costcomputecost (x, y, theta)% run gradient Descenttheta = gradientdescent (x, Y, Theta, alpha, iterations);

Costfunction cost function implementation,

function J = Computecost (x, Y, theta)%computecost Compute cost for linear regression%   J = computecost (x, y, theta) com Putes the cost of using theta as the%   parameter for linear regression to fit the data points in X and y% Initialize so Me useful valuesm = Length (y); % Number of training examples% you need to return the following variables correctly J = 0;% ====================== YOUR CO DE here ======================% Instructions:compute the cost of a particular choice of theta%               
Matrix operation mode
% =========================================================================end

Gradient Descent implementation

function [Theta, j_history] = Gradientdescent (X, y, theta, Alpha, num_iters)%gradientdescent performs gradient descent to Learn theta% theta = gradientdesent (X, y, theta, Alpha, num_iters) updates theta by% taking num_iters gradient steps With learning rate alpha% Initialize some useful valuesm = Length (y); % Number of training examplesj_history = Zeros (num_iters, 1); for iter = 1:num_iters% ====================== YOUR CODE Here ======================% instructions:perform a single gradient step on the parameter vector% T     Heta. % Hint:while debugging, it can be useful to print out the values% of the cost function (Computecost) and G    Radient here. %TEMP1 = Theta (1)-(alpha/m) * SUM ((x * theta-y). *x (:, 1)); Temp2 = Theta (2)-(alpha/m) * SUM ((x * theta-y). * X (:  , 2)); Theta (1) = Temp1;theta (2) = Temp2;
%Matrix operation mode% theta = Theta-alpha * (x ' * (x * theta-y))/m; % ============================================================% Save The cost of J in every iteration j_history (it ER) = computecost (X, y, theta); endend

Print results

% print theta to screenfprintf (' theta found by gradient descent: '); fprintf ('%f%f \ n ', theta (1), Theta (2));% Plot the Lin Ear fithold on;  % Keep previous plot visibleplot (X (:, 2), X*theta, '-') Legend (' Training data ', ' Linear regression ') hold off% don ' t overlay Any more plots on the figure% Predict values for population sizes of 35,000 and 70,000predict1 = [1, 3.5] *theta;fprintf (' for population = 35,000, we predict a profit of%f\n ',...    predict1*10000);p redict2 = [1, 7] * theta;fprintf (' for population = 70,000, we predict a profit of%f\n ',...    predict2*10000); fprintf (' program paused. Press ENTER to continue.\n ');p ause;

Note: The above learning materials reference: https://www.coursera.org/learn/machine-learning

Stanford Machine Learning Week 1-single variable linear regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.