"Machine learning" linear regression

Last Update:2015-01-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, Curve fitting

1, Problem Introduction

① Suppose there is now a data set on the housing area of a city and the corresponding house price

Table 1 The relationship between living area and house price

Fig. 1 The relationship between living area and house price

So given such a dataset, how do we learn a function to predict the city's house price with the housing area size as an independent variable?

The problem can be formatted as

set of training samples for a given size m

The objective function we want to learn is

House price forecasting is essentially a regression problem

A, regression analysis The relationship between the mining independent variable and the dependent variable

b, supervised learning problems, all sample points with target variables

C, the output variable is a continuous value, any real number is desirable

② Suppose now that we have a more detailed data set, it also records the number of bedrooms

Input,x= (x1,x2)

Assuming that each argument is linearly related to the dependent variable Y

The goal is to learn the hypothesis function

1. How to model

① Basic Concepts

Relationship

L Linear correlated?

L Nonlinear correlated?

Mining relation

L Correlation coefficient

= 1 , which is called x, y is fully correlated and has a linear function between x, y

L Special case

e.g. guess that Y has exponential relationship with X and observes

linear correlation of ln Y and X

L General---polynomial Curve fit(polynomial curve fitting)

Find the appropriate order Kto set up the equation, such as the logistic regression.

② multivariate variable linear regression

The hypothetical function is mentioned above:

Parameters or weights ( reflecting the effect of each independent variable on the output ), making the linear function spatially parameterized (H - form known , To characterize with parameters )

In order to make it convenient to x0 ( corresponding intercept term ), the above can be written as

Note:k is related to the number of arguments, here k=2

3. How to get Parameters

A reasonable selection strategy : for each sample of the dataset, the selected parameters make it as close as possible to y. In practice, as close as possible to the cost function to represent.

cost function

Describe the difference between the predicted value and the real value, thus optimizing the objective function parameters, can take advantage of 0-1 loss, absolute loss, square loss, logarithmic loss.

For the linear regression problem, we use the target function as

This is the ordinary least squares regression model (statistics), which can be explained by the probability theory knowledge, as follows.

Second, probability interpretation

1. The reason forchoosing the least squares ( square loss ) cost function:

We make the following assumptions:

E (i): Error term ( no model out of effect,e.g. missing out on some factors ) or random noise

Further assumptions:

That

Note that the following equation is equivalent to this one.

Equivalent to

We can then use the likelihood function to explain the least squares cost function:

Definition : Given the random variable X and the parameter , we observe the possibility of the result Y

The independence between E (i) is assumed to be

To explain briefly, our goal is to achieve the total product of the probability that the m - y can be output in a given case, for the input of m -samples, that the more accurate the model is constructed, the maximum likelihood estimate.

Definition: Maximum likelihood estimation (maximum likelihood estimation)

When a likelihood function ( The probability model of associating y and x ) is given, a reasonable parameter estimation method is to choose as much as possible The probability of the occurrence of the data is greatest, that is, maximizing the likelihood function.

In fact, the logarithm likelihood is commonly used:

Thus, the maximum likelihood estimation is equivalent to minimizing the square loss function.

Three, the model solution

1. Gradient Descent method (steepest gradient descent)

Negative gradient direction is the direction of function value descent, using negative gradient direction to determine the new search direction of each iteration, so that each iteration can reduce the objective function to be optimized gradually.

A: Learning rate

Among them (key):

LMS Update rule:

Note: Each parameter update uses only one training sample, and the sample dimension equals the number of dimensions.

2, Batch gradient drop (batch gradient descent)

Each parameter update requires a dependency on all samples of the training set.

For linear regression problems, the cost function is a convex two-time programming function with global optimal solution

Fig. 1 iterative process of gradient descent method

3. Random Gradient descent

Features :

1. Select a sample point at a random time to update the parameters immediately

2. the Cost function value drop of a single sample point is approximate to the total cost function value Drop

3. sensitivity to step selection may occur overshoot the minimum

3. Comparison of methods

1. The gradient descent method is a batch update algorithm, and the stochastic gradient is an on-line algorithm

2. The gradient method optimizes the empirical risk, and the stochastic gradient method optimizes the generalization risk.

3. The gradient method may fall into the local optimal, and the stochastic gradient may find the global optimal

4. Gradient method is insensitive to step size and random gradient is sensitive to step selection

5. The gradient method is sensitive to the initial point ( parameter ) selection

5. Input preprocessing

A. normalization

The input feature is normalized to ensure that the feature is in a similar scale, but not all of the data needs to be normalized.

Reason: The gradient descent method may decrease the Cunging of the zigzag, and affect the convergence speed of the algorithm.

General Practice:

The difference between the mean, maximum and minimum values, or the standard deviation.

B. selection of step size

For gradient descent methods:

Note Two questions:

1, "debugging": how to ensure that the gradient descent algorithm correctly executed;

2, how to choose the correct step size (learning rate): α

How to choose Alpha- empirical method:

... , 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1 ... ..

In particular, for the random gradient descent method, the step selection needs to meet two points:

① guaranteed convergence of the algorithm

② guaranteed a chance to search the global optimal solution

5. Normal equation

Assume that the function acts on each sample:

The

The cost function can be changed to:

This problem is equivalent to:

That is, the Euclidean distance between two vectors:

Geometrical meaning:

Need to ensure reversible ( reversible Sufficient condition : matrix X columns linearly independent )

In retrospect, our approach is to use iterative methods to find out the value of the cost function, and not to find the cost function. That is to say, whether the so-called optimal solution can be obtained, either by iteration or by other means, in line with the above conditions.

But the reality of the data is not so ideal.

If not reversible, how to solve?

1, to seek pseudo-inverse (statistics solution )

2. Remove redundant features (linear correlation)

3, remove too many characteristics, such as m <= N (M is the number of samples , n is the number of features )

Iv. Summary

1. Gradient Descent method

Need to select the appropriate learning rate α;

Requires many rounds of iteration

It works well even when N is very large (n is the feature number, which is the dimension )

2. Normal equation

No need to select alpha

No need for iteration, one fix

Need to calculate, its time complexity is

n very large, very slow, you can consider dimensionality reduction

"Machine learning" linear regression

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Machine learning" linear regression

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Machine learning" linear regression

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support