Linear regression with one variable in Machine Learning)

Source: Internet
Author: User

1. Model Representation)

Our first learning algorithm is linear regression. Let's start with an example. This example is used to predict housing prices. We use a dataset that contains the housing prices in Portland, Oregon. Here, I want to plot my dataset based on the prices sold for different housing sizes:

Let's take a look at this DataSet. If one of your friends is trying to sell their own house, and if your friend's house is 1250 square meters, tell them how much the House can sell. One thing you can do is to build a model, which may be a straight line. From this data model, you may tell your friend that he can sell the house for about $220000. Obviously, this is an example of supervised learning algorithms.

In the previous blog post introduction to machine learning, we have learned that this example is called supervised learning because we provide the "correct answer" for each data ". More specifically, this is a regression problem. The word "back-to-back" refers to the prediction of an accurate output value based on the previous data. For this example, the price is used.

Another common supervised learning method is classification. Furthermore, in supervised learning, we have a dataset called a training set. For example, we have a training set that contains different house prices,Our task is to learn from this training and predict the house price..

Now we provide some frequently used symbol definitions, as shown below:

M is used to represent the number of training samples. Therefore, if there are 47 rows in this dataset, We have 47 training samples, and m is equal to 47.


The input variable is usually referred to as the feature quantity. X is used to represent the input feature, and Y is used to represent the output or target variable, that is, my prediction results, that is, the second column in the table above.

Here we use (x, y) to represent a training sample. Therefore, a separate row in this table corresponds to a training ?? Training sample.

To represent a specific training sample, we use X superscript (I) and Y superscript (I) to represent the I training sample, therefore, the supermark is I. Note that this is not a power operation. The supermark I in the brackets (X (I), y (I) is just an index, indicates row I in my training set.

Our house price prediction is a way of supervised learning algorithms, such:

We can see that there is a house price in our training set. We feed it to our learning algorithm and learn to get a function. By convention, it is usually expressed as lower case h, h Represents hypothesis (hypothesis), where H represents a function and its input is the housing size. Therefore, H obtains the Y value based on the input x value, and the Y value corresponds to the price of the house. Therefore, H is a function ing from X to Y.

People often have this question: why is this function called hypothesis )? Some of you may know what hypothesis means, from the dictionary or other methods. In fact, in machine learning, this is a name that was used for machine learning in the early days. It is a bit of a detour. For some functions, this may not be a proper name. For example, for the function ing between the house size and price, I think this word "hypothesis" may not be the best name. However, this is a standard term that people use in machine learning, so we don't have to worry about why people call it.

Summary: when solving the housing price prediction problem, we actually want to "Feed" the training set to our learning algorithm, and then learn a hypothesis H, then we input the size of the house we want to predict into H as the input variable, and predict the transaction price of the house as the output variable.

When designing learning algorithms, we need to think about how to get this hypothesis H. For example, how do we express h in our house price forecast? One possible expression is:

Because only one feature/input variable exists, such a problem is called Single-variable linear regression.

So why is it a linear function? In fact, sometimes we have more complex functions, maybe non-linear functions, but since linear equations are simple forms, we will start with the example of linear equations. Of course, we will eventually build more complex models and more complex learning algorithms.

2. Cost Function)

Next we will defineCost functionsThis will help us figure out how to fit the most likely straight line with our data.

In linear regression, we have a training set like the same. Remember that m represents the number of training samples. For example, M = 47 here.

Our hypothetical function (that is, the function used for prediction) is also expressed in the form of linear functions.

Next we will introduce some terms, such as θ 0 and θ 1, which we usually call as model parameters. So how do we select the values θ 0 and θ 1? If we select different parameters θ 0 and θ 1, we will get different hypothetical functions. For example:

In linear regression, we have a training set. What we need to do is to obtain the values of θ 0 and θ 1, so that the straight lines represented by the hypothesis function can fit well with these data points as much as possible. So how can we get values θ 0 and θ 1 so that it fits the data well?

Our idea is that we should choose the parameters θ 0 and θ 1 that bring the predicted value closest to the Y value corresponding to the sample when h (x) is input x. There are a certain number of samples in our training set. We know which house X indicates to sell and the actual price of the house. Therefore, we need to select the parameter value, make sure that the X value provided in the training set can accurately predict the value of Y.

Let's give a standard definition,In linear regression, what we need to solve is a minimal problem..

The selected parameters θ 0 and θ 1 determine the accuracy of the line we get relative to our training set, the gap between the predicted value of the model and the actual value in the training set is the modeling error ).

Our goal is to select a model parameter that minimizes the sum of squares of modeling errors. Even if the cost function is:

Minimum. We plot a contour map with three coordinates θ 0, θ 1, and J (θ 0, θ 1 ):

We can see that there is a point in three-dimensional space that minimizes J (θ 0, θ 1.

3. Gradient Descent)

We have defined the cost function. Now we will introduce the gradient descent algorithm, which can minimize the cost function J. Gradient Descent is a very common algorithm. It is not only used in linear regression, but also widely used in many fields of machine learning. Later, in order to solve other linear regression problems, we will also use the Gradient Descent Method to minimize other functions.

Here, we have a function J (θ 0, θ 1). Maybe this is a cost function of linear regression, maybe it is another function that needs to be minimized, we need to use an algorithm to minimize Function J (θ 0, θ 1). It turns out that the gradient descent algorithm can be applied to a variety of function solutions, so imagine if you have a function J (θ 0, θ 1, θ 2 ,..., θ n), You want to minimize this cost function J (θ 0 to θ n) by minimizing θ 0 to θ n ), N θ is used to prove that the gradient descent algorithm can solve more general problems. For the sake of conciseness, we will only use two parameters in the following discussion to simplify the symbols.

The following describes the concept of gradient descent. First, we need to initialize θ 0 and θ 1. In fact, what they are is actually not important, but the general choice is to set θ 0 to 0
At the same time, set θ 1 to 0, and initialize them to 0.

What we need to do in the gradient descent algorithm is to constantly change θ 0 and θ 1, and try to make J (θ 0, θ 1) smaller through this change, until we find the minimum value of J (θ 0, θ 1. It may be the local minimum value.

Let's take some pictures to see how the gradient descent method works. We are trying to minimize J (θ 0, θ 1). Note that the axes θ 0 and θ 1 are on the horizontal axis, while the Function J (θ 0, θ 1) is on the vertical axis, the surface height of the graph is the value of J (θ 0, θ 1. We want to minimize this function, so we start from a value of θ 0 and θ 1. So imagine assigning a certain initial value to θ 0 and θ 1, that is, corresponding to a starting point on the surface of this function, no matter what the values of θ 0 and θ 1 are, I initialize them to 0. But sometimes you can initialize it as another value.

Now I hope you can think of this image as a mountain. Imagine that there are two mountains in the park like this. Imagine that you are standing on the hill and standing on the Red Hill of the park you imagined. In the gradient descent algorithm, what we need to do is to rotate 360 degrees, look around us, and ask ourselves, if I want to go down the hill, if I want to go down the hill as soon as possible, what direction do we need?

If we are standing on a hill, do you want to look around ?, You will find that the best downhill direction is about that direction. Okay, now you are at a new starting point in the mountains. You can look around again and think about where I should go down the hill in a small broken step? Then you take another step in that direction based on your own judgment. Then, repeat the above steps. From this new point, you will look around and decide the direction in which you will go down the hill as soon as possible. Then you will take a small step and a small step... And so on until you are close to the lowest point.

In addition, this descent has an interesting feature. You will get a very different local optimal solution from different starting points, which is a feature of the gradient descent algorithm.

Is the definition of the gradient descent algorithm:


Note: In the gradient descent algorithm, we need to update θ 0 and θ 1 at the same time.

If the learning speed is too small, you can only move a little bit like a baby, to try to reach the lowest point, so that it takes many steps to reach the lowest point.

If the learning rate is too high, the gradient descent method may cross the lowest point or even fail to converge. In fact, it will be farther and farther away from the lowest point. Therefore, if α is too large, it will lead to convergence or even divergence.

If your parameter is already at the local lowest point, the gradient descent method does not update anything (evaluate to 0) and does not change the parameter value. This is exactly what you want, because it keeps your solution in the most advantageous position, it also explains whyEven if the learning rate α remains unchanged, the gradient descent can converge to the local lowest point..

After the gradient drops by one step, the new derivative will become a little smaller. As the gradient descent method runs, your moving range will automatically become smaller and smaller until the final moving range is very small, and you will find that it has been converged to a local minimum value.

In the gradient descent method, when we approach the local lowest point, the gradient descent method automatically takes a smaller range. This is because when we approach the local lowest point (apparently when the partial lowest derivative is equal to zero), the derivative value will automatically become smaller and smaller, so the gradient decrease will automatically take a smaller range, this is the method of gradient descent, so there is no need to reduce α.


In fact, the cost function used for Linear Regression always looks like a bow. The specialized term of this function is convex function. This function has no local optimal solution and only one global optimal solution.

In fact, the gradient descent we just introduced is "batch gradient descent", which means that we use all the training in each step of gradient descent ?? Training sample. In gradient descent, We need to calculate the sum when calculating the derivative. Therefore, in every single gradient descent, we will eventually calculate such an item, this requires summation of all M training samples.

In fact, the gradient descent method of other types is not of this "batch" type. Instead of the entire training set, the gradient descent method only focuses on some small subsets of the training set at a time.

Linear regression with one variable in Machine Learning)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.