Chapter Two univariate linear regression (Linear Regression with one Variable)

**< model representation >**
**< cost function >**
**< gradient drop (Gradient descent) >**
**< linear regression with gradient descent (Gradient descent for Linear Regression) >**
1.Model Representation

If we return to the problem of training set (Training set) as shown in the following table:

The tag we will use to describe this regression problem is as follows :

M represents the number of instances in the training set

X represents the feature / input variable

Y represents the target variable / output variable

(x, Y) represents an instance of a training set

(x (i), Y (i)) On behalf of section I Examples of observations

H represents the solution or function of the learning algorithm also known as hypothesis (hypothesis)

based on the known data and the above analysis, we can get:

I will select the original usage rule h represents hypothesis thus, to solve the housing price forecast problem, we are actually going to " Hello " give us the learning algorithm, and then learn to a hypothesis h h h : h (x) = Theta0+theta1*x because it contains only one feature / Enter the variable x, so the problem is called ** univariate linear regression **.

2. price functions (cost function)

The parameters we choose determine the accuracy of the straight line we get, the difference between the predicted value of the model and the actual value in the training set (indicated by the Blue line) is the **modeling error** (modeling error).

Here we give the hypothetical function and the cost function model:

**The cost function is also called the square error function, sometimes called the square error cost function. ** We ask for the sum of squares of errors because of the **squared error cost function, which is a reasonable choice for most problems, especially the regression problem. **there are other cost functions that work well, but the squared error cost function is probably the most common means of solving regression problems.

Now let's see what the cost function is all about,

We can draw a contour plot with three coordinates of θ0 and θ1 and J (θ0,θ1), which are more clearly displayed:

We can see that there is a point in the three-dimensional space that makes J (θ0,θ1) the smallest.

With these graphs, we can better understand what these cost function J - expressed values are, what their assumptions are, and what assumptions correspond to the points, which are closer to the minimum value of the cost function J . What we really need, of course, is an efficient algorithm that can automatically find the parameters θ 0 and θ that make the cost function J Take the minimum value.

3. Gradient descent (Gradient descent)

Gradient descent is an algorithm used to find the minimum value of a function, and we will use the gradient descent algorithm to find the minimum value of the cost function J (θ0,θ1) .

The idea behind the gradient drop is that at first we randomly select a combination of parameters (θ0,θ1,..., partθn), calculate the cost function, and then we look for the next parameter combination that will lower the value of the cost function. We continue to do this until a local minimum value (localminimum)is reached, because we are not trying to complete all the parameter combinations, so we **cannot determine whether the local minimum we get is the global minimum value** ( Global minimum), selecting different initial parameter combinations may find different local minimum values.

the formula for the gradient descent algorithm is as follows: where

**α is the learning rate (learning), it determines how much we can go down in the direction that allows the cost function to fall in the most ways. **
In the gradient descent algorithm, this is the correct way to implement simultaneous updating. **at the same time, updating is also a common method in gradient descent. **

Let's take a look at what happens if α is too small or alpha too general:

If α is too small and the learning rate is too small, the result is a little nudge to get to the lowest point, so it takes a lot of steps to reach the lowest point, so **if α is too small** it can be slow because it moves a little bit, **it takes a lot of steps to get to the global lowest point. **.

If α is too large, then the gradient descent method may cross the lowest point and may not even converge, and the next iteration moves a big step, over and over again, crossing the lowest point again and again, until you find that it is actually getting farther from the lowest point, **so if α is too large, it can cause an inability to converge or even diverge. **.

One thing worth noting:

**As we approach the local minimum, the guide values will automatically become smaller, so the gradient drop will automatically take a smaller amplitude, which is the practice of gradient descent. So there's actually no need to reduce the alpha in addition, we need a fixed (constant) learning rate ****α. **

4. Gradient Descent linear regression (Gradient descent for Linear Regression)

This is the method of using gradient descent on linear regression problems. comparison of gradient descent algorithm and linear regression algorithm

The key to applying the gradient descent method to our previous linear regression problem is to find out the derivative of the cost function, namely:

Based on the above analysis, the gradient descent method can be rewritten as:

We have just used the algorithm, sometimes also called batch gradient descent. In fact, in machine learning, the algorithm usually does not give a name, but the name "batch gradient descent", referring to the gradient drop in each step, we all use all the training?? To practice the sample, in the gradient descent, when we calculate the derivative term, we need to do the summation, so, in each individual gradient descent, we finally have to calculate such a thing, this item needs to sum all the m training samples.

In the following lesson, we will also talk about a method that can solve the minimum value of the cost function J without the need for multi-step gradient descent, which is another called normal equation (normal equations) . The method. In fact, the gradient descent method is more suitable than the normal equation in the case of large data volume.

Machine Learning Machines Learning (by Andrew Ng)----Chapter Two univariate linear regression (Linear Regression with one Variable)