"Machine learning" linear regression

Source: Internet
Author: User

First, Curve fitting

1, Problem Introduction

① Suppose there is now a data set on the housing area of a city and the corresponding house price


Table 1 The relationship between living area and house price


Fig. 1 The relationship between living area and house price

So given such a dataset, how do we learn a function to predict the city's house price with the housing area size as an independent variable?


The problem can be formatted as

set of training samples for a given size m


The objective function we want to learn is


House price forecasting is essentially a regression problem

A, regression analysis The relationship between the mining independent variable and the dependent variable

b, supervised learning problems, all sample points with target variables

C, the output variable is a continuous value, any real number is desirable

② Suppose now that we have a more detailed data set, it also records the number of bedrooms


Input,x= (x1,x2)

Assuming that each argument is linearly related to the dependent variable Y

The goal is to learn the hypothesis function


1. How to model

① Basic Concepts


Relationship

L Linear correlated?

L Nonlinear correlated?

Mining relation

L Correlation coefficient


= 1 , which is called x, y is fully correlated and has a linear function between x, y

L Special case

e.g. guess that Y has exponential relationship with X and observes

linear correlation of ln Y and X

L General---polynomial Curve fit(polynomial curve fitting)

Find the appropriate order Kto set up the equation, such as the logistic regression.



② multivariate variable linear regression

The hypothetical function is mentioned above:


Parameters or weights ( reflecting the effect of each independent variable on the output ), making the linear function spatially parameterized (H - form known , To characterize with parameters )

In order to make it convenient to x0 ( corresponding intercept term ), the above can be written as


Note:k is related to the number of arguments, here k=2

3. How to get Parameters

A reasonable selection strategy : for each sample of the dataset, the selected parameters make it as close as possible to y. In practice, as close as possible to the cost function to represent.

cost function

Describe the difference between the predicted value and the real value, thus optimizing the objective function parameters, can take advantage of 0-1 loss, absolute loss, square loss, logarithmic loss.

For the linear regression problem, we use the target function as


This is the ordinary least squares regression model (statistics), which can be explained by the probability theory knowledge, as follows.

Second, probability interpretation

1. The reason forchoosing the least squares ( square loss ) cost function:

We make the following assumptions:


E (i): Error term ( no model out of effect,e.g. missing out on some factors ) or random noise

Further assumptions:


That


Note that the following equation is equivalent to this one.



Equivalent to


We can then use the likelihood function to explain the least squares cost function:

Definition : Given the random variable X and the parameter , we observe the possibility of the result Y


The independence between E (i) is assumed to be


To explain briefly, our goal is to achieve the total product of the probability that the m - y can be output in a given case, for the input of m -samples, that the more accurate the model is constructed, the maximum likelihood estimate.

Definition: Maximum likelihood estimation (maximum likelihood estimation)

When a likelihood function ( The probability model of associating y and x ) is given, a reasonable parameter estimation method is to choose as much as possible The probability of the occurrence of the data is greatest, that is, maximizing the likelihood function.



In fact, the logarithm likelihood is commonly used:


Thus, the maximum likelihood estimation is equivalent to minimizing the square loss function.


Three, the model solution

1. Gradient Descent method (steepest gradient descent)

Negative gradient direction is the direction of function value descent, using negative gradient direction to determine the new search direction of each iteration, so that each iteration can reduce the objective function to be optimized gradually.

A: Learning rate


Among them (key):


LMS Update rule:


Note: Each parameter update uses only one training sample, and the sample dimension equals the number of dimensions.

2, Batch gradient drop (batch gradient descent)


Each parameter update requires a dependency on all samples of the training set.

For linear regression problems, the cost function is a convex two-time programming function with global optimal solution


Fig. 1 iterative process of gradient descent method

3. Random Gradient descent


Features :

1. Select a sample point at a random time to update the parameters immediately

2. the Cost function value drop of a single sample point is approximate to the total cost function value Drop

3. sensitivity to step selection may occur overshoot the minimum


3. Comparison of methods

1. The gradient descent method is a batch update algorithm, and the stochastic gradient is an on-line algorithm

2. The gradient method optimizes the empirical risk, and the stochastic gradient method optimizes the generalization risk.

3. The gradient method may fall into the local optimal, and the stochastic gradient may find the global optimal

4. Gradient method is insensitive to step size and random gradient is sensitive to step selection

5. The gradient method is sensitive to the initial point ( parameter ) selection


5. Input preprocessing

A. normalization

The input feature is normalized to ensure that the feature is in a similar scale, but not all of the data needs to be normalized.

Reason: The gradient descent method may decrease the Cunging of the zigzag, and affect the convergence speed of the algorithm.


General Practice:

The difference between the mean, maximum and minimum values, or the standard deviation.

B. selection of step size

For gradient descent methods:

Note Two questions:

1, "debugging": how to ensure that the gradient descent algorithm correctly executed;

2, how to choose the correct step size (learning rate): α

How to choose Alpha- empirical method:

... , 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1 ... ..

In particular, for the random gradient descent method, the step selection needs to meet two points:

① guaranteed convergence of the algorithm

② guaranteed a chance to search the global optimal solution

5. Normal equation

Assume that the function acts on each sample:


The


The cost function can be changed to:


This problem is equivalent to:


That is, the Euclidean distance between two vectors:


Geometrical meaning:



Need to ensure reversible ( reversible Sufficient condition : matrix X columns linearly independent )

In retrospect, our approach is to use iterative methods to find out the value of the cost function, and not to find the cost function. That is to say, whether the so-called optimal solution can be obtained, either by iteration or by other means, in line with the above conditions.

But the reality of the data is not so ideal.

If not reversible, how to solve?

1, to seek pseudo-inverse (statistics solution )

2. Remove redundant features (linear correlation)

3, remove too many characteristics, such as m <= N (M is the number of samples , n is the number of features )

Iv. Summary

1. Gradient Descent method

Need to select the appropriate learning rate α;

Requires many rounds of iteration

It works well even when N is very large (n is the feature number, which is the dimension )

2. Normal equation

No need to select alpha

No need for iteration, one fix

Need to calculate, its time complexity is

n very large, very slow, you can consider dimensionality reduction






"Machine learning" linear regression

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.