I----with Watermelon Book 2 more about Linearregression

Source: Internet
Author: User

Previously, a method of solving LR (linearregression), the norm equation (Normal equation), was discussed. This is the watermelon book on the method, which in Andrew Ng's public class also said a lot of other methods of solving parameters: Gradient descent (Gradient descent), which includes batch gradient descent (batch Gradient descent) and random gradient descent (Stochastic gradient descent).

In addition, the probability interpretation of cost function and local weighted regression (LMR) are also explained in the Open class. In this blog, I'll tell you more about these three items.

Note: The symbol theta in the Open class is the W that corresponds to the watermelon book.

1. Gradient Descent

Then last time, we assume we've got the cost function, which is a convex two-time function (Conrex quadratic functions).

J (θ) =1/2∑ (hθ (x (i))-Y (i)) 2 (the norm equation can be solved by writing J (θ) in the form of a function as a vector)

1.1. Gradient descent (Gradient descent)

To minimize the cost function, a search algorithm is required: Gradient descent algorithm:

Θj=θj-α* [J (θ)] ' where α is the learning rate (learning rates)

Put J (θ) in:

Θj=θj-α* (hθ (x (i))-Y (i)) XJ (i)

Note: 1, x (i): represents the input of the first sample, (X (i), Y (i)): Represents a sample of the training set; Θj: Represents the J parameter (or weight weights); XJ (i): The J variable that represents the first sample

2. The above is the LMS (least mean square minimum mean square error) update rule. Its update size is proportional to the error hθ (x (i))-Y (i).

1.2, Batch gradient drop (batch gradient descent)

What is a batch gradient drop?

The gradient drop above is just a gradient drop for the sample I, and we should certainly obtain a weighted average of the weights for the entire training set sample. That

Repeat until convergence{

Θj=θj-(α/m) *∑ (hθ (x (i))-Y (i)) XJ (i) (for every J)

}

It's obvious. We're going to do one-time recursion for the entire training set (all samples), and there's a for loop in the recursion for Θj. Obviously when the sample number m is very large, this method is very inappropriate. (next time to fill in the details of the complexity calculation)

1.3. Random gradient descent (Stochastic gradient descent)

In order to deal with the calculation is too large, the other method is a random gradient drop: From the sample randomly pulled out a group, after training by gradient update once, then extract a group, and then update once, in the case of sample size and large, may not have to train all the samples can be used to obtain a loss value within the acceptable range of the model.

Repeat until convergence{

Θj=θj-α* (hθ (x (i))-Y (i)) XJ (i) (for every J)

}

1.4. Local weighted regression (locally weighted regression)

As long as fit fits, if the effect is not good, is not due to fit is over fitting. To make the fitting effect good, we can use local weighted regression (LWR).

The basic goal remains minimized:

J (θ) =1/2∑ (hθ (x (i))-Y (i)) 2

Assuming that you continue to use the above normal equation,batch gradient descent, the Stochastic gradient descent method results in a poor fitting effect, we need to add a weighting factor

Weighted factor W (i) = exp (( -1/2) * (x (i)-X) 2)

Re-structuring J (θ) =1/2∑w (i) (hθ (x (i))-Y (i)) 2

Exp is an e-based index, this time you can know if X is far from the sample when w (i) = 0, otherwise 1, when we predict a value we need to re-calculate the current parameter θ value, and then construct the regression equation, calculate the current predicted value.

2. Probability interpretation of cost function J (θ)

Why is LR especially why the cost function J of the least squares has become our choice, and for what reason? The probability explanation will answer this.

First assume that for a sample of the training set (x (i), Y (i)) there is

Y (i) =θtx (i) +ε (i)

The error term ε (i) includes both related factors or random noise that the model does not involve (consider).

We again assume that the error term ε (i) is an independent co-distribution (I.I.D.) and is subject to a normal distribution of random variables:

ε (i) ~n (0,δ2)

(Y (i) | x (i); θ) ~ N (ΘTX (i), δ2): Output variable y (i) subject to normal distribution given input variable x (i) and parameter θ

Note: Why do y (i) and ε (i) distributions use Gaussian distributions here? Because it is easy to deal with mathematics because it conforms to the likelihood of the central limit law

We derive the value of the parameter θ by the great-by-estimation:

L (θ) =p (y (i) | x (i); θ)-------> L (θ) = log (l (θ)) =-Mlog [(2πδ) 1/2]-1/(δ2) * 1/2∑ (hθ (x (i))-Y (i) 2

Because the training set determines that the total training sample is m, the limit mean and variance of the training set are determined, and then the first item of L (θ) (-mlog [(2πδ) 1/2]) is also determined, in order to make L (θ) the largest, the second item should be minimized, namely:

J (θ) =1/2∑ (hθ (x (i))-Y (i)) 2 The minimum value can be obtained by the parameter θ

Note: 1, the central limit theorem: If the random variable is independent of the same distribution, then the sum distribution of the limit random variable obeys the normal distribution

2, great relief estimate: Using known sample results, the inverse of the most likely (maximum probability) result in such a parameter value.

I----with Watermelon Book 2 more about Linearregression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.