Machine learning notes (b) univariate linear regression

Source: Internet
Author: User

Machine learning notes (b) univariate linear regression

Note: This content resource is from Andrew Ng's machine learning course on Coursera, which pays tribute to Andrew Ng.

Model representation

How to solve the problem of house price in note (a), this will be the focus of this article.

Now, assuming that there is more housing price data, a straight line is needed to approximate the trend of house prices, as shown in:

Review the concepts of supervised learning , unsupervised learning , regression and classification as described in note (i):

1. Supervised learning (supervised learning)

For each sample of the data, there is a definite output value corresponding to it.

2. Unsupervised learning (unsupervised learning)

For each sample of the data, the output value name is ambiguous.

3. Regression (Regression)

For an input sample, the predicted output value is a sequential real number.

4. Classification (classification)

For input samples, the predicted output values are discrete.

Therefore, the housing price problem is a supervised learning problem, need to use the regression method to solve.

Some concepts of the problem are given below:

    • m: Number of training samples

    • x: Input variable/feature

    • y: Output variable/target variable

    • (x (i), Y (i)): Thefirst training sample

For a given training set (Training set), we want to use the Learning Algorithm (learning algorithm) to find a line to maximally approximate all the data, and then infer the output value (y) of the new input (x) by the function (h) represented by the line, The model is represented as follows:

H is often referred to as the hypothetical function, which corresponds to the X-to-y mapping, because in this case a straight line is represented by the right green formula.

Because the assumed function is a linear function, and the input variable in the training sample has only one feature (that is, size), this type of problem is called univariate linear regression (Linear Regression with one Variable, or univariate Linear Regression) problem.

Second, cost function

The hypothetical function h that continues to discuss the above problem:

As shown,hθ (x) represents a straight line about X, and θ0 and θ1 are its two parameters, which require hθ (x)and must be determined by both parameters.

So, how do you choose these two parameters?

For us with only data, we don't know the parameter value of hθ (x) , the simplest way is to assume two θ. For different assumptions,hθ (x) is indicated as follows:

So, how do you choose the parameter θ to make hθ (x) closest to the data?

a good idea : choose θ0 and θ1 so that for all training samples (x, y) , hθ (x) is closest to y .

How can you describe the closest? We can do this by adjusting the parameter θto minimize the sum of squares of all training sample points ( x, y) and the predicted sample points (x,hθ (x) ).

The specific description is as follows:

Note: M indicates the number of training samples. Dividing the sum of squares of distances by 2m is for the sake of late derivation, which will not affect the final result.

Therefore, the goal of training is to adjust the parameters θ0 and θ1 to minimize the cost function J (θ0,θ1).

Iii. visual representation of the cost function I (intuition i)

For the convenience of explaining the solution process, make a certain simplification, assuming θ0 = 0, as shown on the right, then the cost function j is only related to θ1 .

When θ1 = 1 , the solution is as follows, which can be obtained by J (1) = 0 .

When θ1 = 0.5 , the solution process is as follows, J (0.5) ≈0.58 can be obtained.

By constantly trying to θ1 values, you can find the corresponding J (θ1) value, you can see that J (θ1) is about the θ1 two times function, and for this example three data, in θ1 = 1, J (θ1) obtains the minimum value and only has a minimum value.

So, we can guess that the quickest way to find the minimum value of J (θ1) is to ask for its derivative of θ1 .

Iv. visual representation of the cost function II (Intuition II)

In intuition I , in order to solve the convenience, we brutally abandoned the θ1 , but θ0 and θ1 are equally important. Now we have to pick up the θ0 and take care of them.

Now suppose θ0 = 50,θ1 = 0.06 , get the line as shown:

In intuition I , J (θ1) is a two-time function about θ1 . Since there are now two parameters θ0 and θ1 ,J (θ0,θ1) uses the Matlab/octave drawing results to represent the following:

However, this shows that, although we can roughly understand the location of the minimum value of J, but it is not very obvious, so using the contour of the display method, θ1 ,J (θ0,θ1) can be more intuitive to show.

contours : The values of all points on the curve are equal, and the inward values are smaller.

For example, for the following hθ (x) , the value of its J (θ0,θ1) is represented in the contour line as shown in red X:

Select a different θ value, and J (θ0,θ1) values are closer to the minimum value:

When the appropriate θ value is selected, the value ofJ (θ0,θ1) is basically completely close to the minimum value:

V. Gradient Descent method (Gadient descent)

Now we know that we need to select the appropriate θ to minimize J (θ0,θ1) as shown in:

In the plane, the value of θ is infinite, so that J (θ0,θ1) is represented by an infinite number of possible functions, we can not just guess the way to get the desired results. Therefore, it is very important to find a method to find the minimum value of J (θ0,θ1) .

The typical method is the gradient descent method, as shown in:

Suppose you are standing on a hillside and want to get to the bottom of the mountain in the quickest way, usually we choose to go one step at a time and reach the bottom of the mountain after many steps. However, there is a situation that needs to be considered, that is, multiple mountain bottoms, when you choose a different starting point, you may reach different mountain bottoms:

What we call the mountain bottom is a local optimal solution of J (θ0,θ1) , sometimes the local optimal solution does not represent the global optimal solution. Fortunately, in the case of the text, we choose the hypothetical function hθ (x) is a straight line, so J (θ0,θ1) is a two-time function, it has only one optimal solution, the use of gradient descent method can be a good solution to the problem.

So how do you take each step, that is, how to perform a gradient descent algorithm? The following procedures are performed:

For θj , J (θ0,θ1) on the θj of the partial derivative, multiplied by α (called the learning rate), and then the θj minus this value, equivalent to all the θj in a To the point represented by the minimum value, which in the figure represents a step down of J (θ0,θ1) . Repeat this step until θ0 and θ1 are basically the same and end.

Here, it is important to note the following points:

1.: = Symbol denotes assignment, = symbol is true, 2. α indicates the learning rate, which determines how much the gradient descent is,α> 0; 3. All θ values must be updated synchronously.
Vi. Gradient Descent visual presentation (Gradient descent intuition)

We know about the concept of gradient descent, and now we have an intuitive representation to explain the process of gradient descent. The gradient descent algorithm is as follows:

For a more convenient explanation, let θ0: = 0, so that J is simplified to a two-time function on the θ1 . When J (θ1) has a derivative of θ1 greater than 0 o'clock, the θ1 shifts to the left and J (θ1) drops. When J (θ1) has a derivative of θ1 less than 0 o'clock, θ1 shifts right, J (θ1) still declines.

So here's the problem. If Michael's steps are too small, J (θ1) drops a little bit at a time, and when does it get to the bottom? What happens when J (θ1) crosses the minimum to the other side if the stride is too large?

These two problems have a great relationship with the learning rate α . If α is too small, the gradient descent algorithm will be quite slow. If α is too large, the gradient drop may cross the minimum, leading to non-convergence and even divergence.

One thing that is comforting is that, in most of the problems, as theta approaches the optimal solution, the absolute value of the bias of theJ (θ) about θ will be smaller, which means the curve is smoother. This means that the decrease or increase in θ is gradually slow, which means that the decrease inJ (θ) will gradually slow down. Therefore, we do not need to lower the value of Alpha at every step.

Vii. linear regression gradient descent (Gradient descent for Linear Regression)

Now that you understand the gradient descent and linear regression, you now need to combine them to solve the single-variable linear regression model of the housing price problem in this article.

For linear regression models, it is possible to find the skewness of the J (θ) about θ :

Thus, the gradient descent method is converted to the following form (θ0 and θ1 must be updated synchronously):

Moreover, since the shape of J (θ0,θ1) is a bowl shaped, the gradient descent will eventually converge to the minimum value:


Machine learning notes (b) univariate linear regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.