Machine Learning (vi): linear regression and Gradient descent _ machine learning

Source: Internet
Author: User

A reprint of the article in the logistic regression there are some basic not mentioned in this article will be explained in detail. So it is recommended to read this one first.

This article is reproduced from http://blog.csdn.net/xiazdong/article/details/7950084.

=======================================


This article will cover:

(1) Definition of linear regression

(2) Single-Variable linear regression (3) Cost function: method to evaluate whether the linear regression fits the training set (4) gradient descent: One of the methods to solve the linear regression (5) Feature scaling: A method to speed up the execution of gradient descent

(6) Multivariable linear regression
Linear regression Note: The multivariable linear regression must be preceded by a feature scaling.

Methods: Linear regression belonged to supervised learning, therefore, the method and supervised learning should be the same, first given a training set, according to the training set to learn a linear function, and then test the function of the training is not good (that is, this function is enough to fit the training set data), select the best function (cost function minimum); Note: (1) because it is a linear regression, the function learned is a linear function, that is, a linear function; (2) because it is a single variable, so there is only one x;
We can give a model of univariate linear regression: We often call x feature,h (x) as hypothesis;
From the above "method", we certainly have a question, how can we see the linear function fitting good or bad? We need to use the cost function, the smaller the cost function, the better the linear regression (and the better the training set fitting), of course, the smallest is 0, that is fully fitted;

As a practical example:

We want to predict the price of a house based on the size of the house, given the following dataset:

Based on the above dataset, draw on the diagram, as shown in the following illustration:

We need to fit a line according to these points so that the cost function is minimal;


Although we do not know what is inside the cost function, but our goal is: given the input vector x, output vector y,theta vector, output cost value;
We describe the general process of linear regression of univariate variables.

Cost Function

Use of the cost function: to evaluate the assumed function, the smaller the function of the cost function, the better the fitting training data fitting; The following figure details the function of the cost function when the cost function is black box;
But we certainly want to know what the internal structure of the cost function is. So we give the formula here: the first element in the vector x, the first element in the vector y, the known assumption, and the number of the training set;

such as a given dataset (1,1), (2,2), (3,3)
then x = [1;2;3],y = [1;2;3] (the syntax here is the syntax for the octave language, which represents the 3*1 matrix)
If we predict theta0 = 0,THETA1 = 1, then h (x) = x, then the cost function:
J (0,1) = 1/(2*3) * [(H (1)-1) ^2+ (H (2)-2) ^2+ (H (3)-3) ^2] = 0;
If we predict theta0 = 0,THETA1 = 0.5, then h (x) = 0.5x, then the cost function:
J (0,0.5) = 1/(2*3) * [(H (1)-1) ^2+ (H (2)-2) ^2+ (H (3)-3) ^2] = 0.58;
If the theta0 is always 0, then the function of Theta1 and J is: if there are theta0 and theta1 are not fixed, then the functions of THETA0, Theta1, J are:
Of course, we can also use two-dimensional graphs to express the contour map;

Note: If the linear regression, then Costfunctionj and the function must be bowl-like, that is, only a minimum point;
Above we explain the cost function of the definition, formula;

Gradient descent (gradient descent)

But another problem arises, though given a function, we can know the fitting of the function according to the cost function, but after all, there are so many functions, it is impossible to try it one at a while. So we drew a gradient drop: the minimum value of the cost function function can be found; the gradient descent principle: the function is compared to a mountain, we stand on a hillside, look around, from which direction down one small step, can fall the fastest;
Of course there are many ways to solve the problem, gradient drop is only one, there is a method called normal equation;
Method: (1) first determine the pace to the next step size, we are called Learning rate (2) arbitrarily given an initial value: (3) to determine a downward direction, and to go down a predetermined pace and update it; (4) Stop falling when the descending height is less than a defined value;
Algorithm:


Characteristics: (1) The initial point is different, the minimum obtained is also different, so the gradient drop is only a local minimum value, (2) The nearer the minimum value, the slower the descent speed;
Question: If the initial value is in the local minimum position, how it will change. A: Because already in the local minimum position, so derivative is certainly 0, therefore will not change;
If a correct value is taken, the cost function should be smaller and less; question: How to get a value. A: Observe the value at any time, if the cost function becomes smaller, OK, on the contrary, then take a smaller value;
The following diagram shows a detailed description of the gradient descent process:
From the above diagram can be seen: the initial point is different, the minimum value obtained is also different, so the gradient is only the local minimum value;
Note: The descending step size is very important, because if too small, the speed of finding the function minimum is very slow, if too large, it may appear overshoot the minimum phenomenon;
The following figure is the overshoot minimum phenomenon:
If the J function increases when the learning rate is taken, the value of learning rate needs to be reduced;

Integrating with gradient descent & Linear regression

The gradient descent can find the minimum value of a function, and the linear regression is required to make the cost function minimum.
So we are able to use the gradient drop for the cost function to integrate the gradient descent and linear regression, as shown in the following figure:


Gradient descent is through iterative, and we are more concerned about the number of iterations, because this is related to the speed of gradient descent, in order to reduce the number of iterations, so the introduction of feature scaling;

Feature Scaling

This method is applied to gradient descent, in order to accelerate the execution speed of gradient descent; thought: The values of each feature are standardized so that the range of values is roughly between -1<=x<=1;
The commonly used method is mean normalization, or: [X-mean (x)]/std (x);

For a practical example, there are two feature: (1) size, the range 0~2000 (2) #bedroom, the value range 0~5; Then after feature scaling,     

Exercise
We want to predict the final exam results through midterm results, and we want to get the equation:
Given the following training set:
Midterm exam (midterm exam) 2 Final Exam 89 7921 96 72 51 84 74 94 8836 87 69 4761 78 We want to feature scaling for (midterm exam) ^2, the value after feature scaling.
max = 8836,min=4761,mean=6675.5, then x= (4761-6675.5)/(8836-4761) = -0.47;

multivariable linear regression

Before we only introduce the linear regression of univariate, i.e. There is only one input variable, the real world is not so simple, so here we want to introduce the linear regression of multivariable;
For example: Prices are determined by a number of factors, such as size, amount of bedrooms, floors, age of home, etc. Here we assume that the price is determined by 4 factors, as shown in the following figure:


We have previously defined the model for univariate linear regression:

  Here we can define a model for multivariable linear regression:


Cost Functio n is as follows:  
If we want to use gradient descent to solve multivariable linear regression, we can still use the traditional gradient descent algorithm to calculate:


 
General exercises:

 
1. We want to predict the results for the second year based on the results of a student's first year, X for the first year, and Y for the number of a in the second year, given the following dataset:
X y 3 4 2 1 4 3 0 1 (1) The number of training sets.  4, (2) J (0,1) The result is how much. J (0,1) = 1/(2*4) *[(3-4) ^2+ (2-1) ^2+ (4-3) ^2+ (0-1) ^2] = 1/8* (1+1+1+1) = 1/2 = 0.5;
We can also quickly work out J (vectorization) by means of 0,1: br>


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.