Linear regression and Gradient Descent

Source: Internet
Author: User
Stanford machine learning notes, source: http://blog.csdn.net/xiazdong/article/details/7950084 This article will cover:

(1)Linear regression Definition

(2)Single-Variable Linear Regression

(3)Cost Function: method for evaluating whether linear regression fits a training set (4)Gradient Descent: one of the solutions to Linear Regression (5)Feature Scaling: Method for accelerating Gradient Descent

(6)Multi-Variable Linear Regression


Linear RegressionNote: feature scaling is required before multi-variable linear regression!
Method: linear regression is supervised learning. Therefore, the method and supervised learning are the same. First, a training set is given and a linear function is learned based on the training set, then, test whether the function is trained (that is, whether the function is sufficient to fit the training set data) and select the best function (minimum cost function). Note: (1) because it is a linear regression, therefore, the learned function is a linear function, that is, a linear function. (2) because it is a single variable, there is only one X. We can give a linear regression model of a single variable: we often call X feature and h (x) hypothesis. From the above "method", we must have a question: how can we see whether linear function fitting is good? We need to use the cost function. The smaller the cost function, the better the linear regression (the better the fitting with the training set). Of course, the minimum value is 0, that is, full fitting;

For example:

We want to predict the price of a house based on the house size. The following data set is given:

 

 

Draw a graph based on the preceding dataset, as shown in:

We need to fit a straight line based on these points to minimize the cost function;

Although we do not know what the cost function is like, our goal is to give the input vector X, the output vector y, the theta vector, and the output cost value;The above describes the general process of single-variable linear regression;Cost FunctionPurpose of the cost function: Evaluate the hypothetical function. The smaller the cost function, the better the fitting of the training data, the role of the cost function; but we certainly want to know what is the internal structure of the cost function? Therefore, the following formula is given: indicates the I-th element in vector X, the I-th element in vector y, the known hypothetical function, and M indicates the number of training sets;

For example, a given dataset)
Then X = [1; 2; 3], y = [1; 2; 3] (the syntax here is the syntax of the Ave ave language, indicating the matrix of 3*1)
If we predict theta0 = 0, theta1 = 1, then h (x) = x, then cost function:
J (0, 1) = 1/(2*3) * [(H (1)-1) ^ 2 + (H (2)-2) ^ 2 + (H (3)-3) ^ 2] = 0;
If we predict theta0 = 0, theta1 = 0.5, then h (x) = 0.5x, then cost function:
J (0, 0.5) = 1/(2*3) * [(H (1)-1) ^ 2 + (H (2)-2) ^ 2 + (H (3)-3) ^ 2] = 0.58;

If theta0 is always 0, the functions of theta1 and J are: If theta0 and theta1 are not fixed, the functions of theta0, theta1, and J are: of course, we can also use two-dimensional graphs to represent the contour map; Note: For linear regression, costfunctionj and the function must be bowl-like, that is, there is only one minimum point; We have explained the definition and formula of cost function; Gradient Descent (gradient descent)However, another problem arises. Although given a function, we can know whether the function fits well based on the cost function. But after all, there are so many functions, so it is impossible to try them one by one? Therefore, we lead to Gradient Descent: we can find the minimum value of the cost function; gradient descent principle: Compare the function to a mountain. We stand on a hillside and look around, which of the following ways can take a small step down to minimize the loss? Of course, there are many ways to solve the problem. Gradient Descent is only one of them, and another method is normal equation; Method(1) determine the pace to the next step, which we call learning rate; (2) specify an initial value at will; (3) determine a downward direction, and follow the predefined steps and update them. (4) When the descending height is smaller than a defined value, the descent stops; Algorithm: Features(1) The obtained minimum values are also different when the initial points are different. Therefore, the gradient descent only calculates the local minimum value. (2) The closer the gradient falls, the slower the descent speed; Problem: if the initial value is in the local
How does the position of minimum change?
A: because it is already in the local
Minimum location, so derivative must be 0, so it will not change; If a correct value is obtained, the cost function should be smaller and smaller;Question: How to set the value? A: Observe the value at any time. If the cost function is smaller, it will be OK. Otherwise, a smaller value will be taken. The process of gradient descent is described in detail: as shown in the figure above: The minimum value obtained varies with the initial values.Therefore, gradient descent only calculates the local minimum value. Note: The descent speed is very important, because if it is too small, it is very slow to find the minimum value of the function. If it is too large, the phenomenon of overshoot the minimum may occur; that is, the phenomenon of overshoot minimum: If J function increases after learning rate value, the value of learning rate needs to be reduced; Integrating with Gradient Descent & linear regressionGradient descent can be used to obtain the minimum value of a function. Linear Regression needs to be used to make the cost
Therefore, we can use gradient descent for cost functions to integrate gradient descent with linear regression, as shown in: gradient descent is achieved through continuous iteration, we pay more attention to the number of iterations, because this is related to the execution speed of gradient descent. To reduce the number of iterations, feature scaling is introduced; Feature ScalingThis method is applied to gradient descent Accelerate the execution of gradient descentIdea: standardize the values of each feature so that the value range is roughly between-1 <= x <= 1. The common method is mean normalization, that is, or: [X-mean (x)]/STD (X );

For example,

There are two feature values: (1) size, value range: 0 ~ 2000; (2) # bedroom, value range: 0 ~ 5; After feature scaling,
Exercise questionsWe want to predict the final score through the mid-term start score. The expected equation is: given the following training set:
Midterm exam (Midterm exam) 2 Final Exam
89 7921 96
72 5184 74
94 8836 87
69 4761 78
If we want to perform feature scaling on (midterm exam) ^ 2, what is the value after feature scaling? Max = 8836, min = 4761, mean = 6675.5, then x = (4761-6675.5)/(8836-4761) =-0.47; Multi-Variable Linear RegressionIn the previous section, we only introduced the linear regression of single variables, that is, there is only one input variable. The real world cannot be so simple. Therefore, we will introduce the linear regression of multiple variables. For example: the price is determined by many factors, such as size, number of bedrooms, number of floors, and age of home. Here we assume that the price is determined by four factors, as shown in:

We have previously defined a single-variable linear regression model: here we can define a multi-variable linear regression model: cost function is as follows: if we want to use gradient descent to solve multi-variable linear regression, we can still use the traditional Gradient Descent Algorithm for calculation: Total exercise questions:

 

1. We want to predict the score of the second year based on the score of a student in the first year. X indicates the number of A in the first year, and y indicates the number of A in the second year. The following dataset is given:
X Y
3 4
2 1
4 3
0 1
(1) What is the number of training sets? 4; (2) what is the result of J (0, 1? J (0, 1) = 1/(2*4) * [(3-4) ^ 2 + (2-1) ^ 2 + (4-3) ^ 2 + (0-1) ^ 2] = 1/8*(1 + 1 + 1 + 1) = 1/2 = 0.5; we can also use vectorization to quickly calculate J (0, 1 ):

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.