Machine learning Notes (iii) multivariable linear regression

Source: Internet
Author: User
Tags square root

Machine learning Notes (iii) multivariable linear regression

Note: This content resource is from Andrew Ng's machine learning course on Coursera, which pays tribute to Andrew Ng.

One, multiple characteristics (multiple Features)

The housing price problem discussed in note (b) only considers a feature of the size of the house:

This is only a single characteristic of the data, it is often difficult to help us accurately predict the price trend. Therefore, it is often possible to improve the predictive effect by considering the data values of multiple features. For example, select the following 4 features as input values:

Explanation of some of the concepts:

    • N: Number of features

    • x (i): input of the first training sample (all characteristics)

    • y: Output variable/target variable

    • XJ (i): The value of the J characteristic of the first training sample

For multiple features, we need to update the hypothetical function to include all of the input features, for the hypothetical functions of the 4 features are as follows:

By and by, the hypothetical function of the n feature is shown. For convenience, we define the x1=1 , using the column vectors to represent the parameter θ and the input X.

This assumes that the function hθ (x) can be represented as:

The linear regression problem with multiple features is called multivariable linear regression problem .

Two, multivariate gradient descent (Gradient descent for multiple Variables)

The linear regression problem of multivariable is similar to that of univariate, because the number of features changes from 1 to N, so more computation is needed. The comparison is as follows:

Third, characteristic normalization (Feature Scaling)

Because there are now multiple features, the range of values for each feature varies. For example, the size of the house is generally around thousands of, but the number of bedrooms is often single-digit, to be left intact in the image, will cause most of the data are crowded in a range. Extreme unevenness results in slower gradient descent and cannot be effectively differentiated. Then, it is necessary to use the method of feature normalization to confine all features to a range.

As can be seen on the left, the contour is flattened, resulting in a slower convergence rate due to the absence of feature normalization. And the right, the house size divided by 2000, the number of bedrooms divided by 5, so that two features are transformed into the range of the 0~1, contour lines appear more uniform state, speed up the convergence rate

Feature Normalization

Convert each eigenvalue to the same specific range (typically 1 <= x <= 1)

In feature normalization, another common method is the mean normalization (mean normalization). The conversion method for mean normalization is as follows:

Concept :

    • μi: Mean value of feature XI

    • si: Range of feature XI (max-min)

For example: In this example, X1 and X2 are converted as follows:

Mean value standardization

Using the characteristic mean and range, the feature is normalized to the range of -0.5~0.5.
Iv. Learning Rates (learning rate)

This section describes how to confirm that gradient drops work properly and how to choose the learning rate α .

First, how to confirm that the gradient drop is working properly. Our goal is to minimize J (θ) and hope that it will decrease in each iteration until the final convergence:

The simple convergence test method is that if the decrease of J (θ) is less than one ε value (for example, 10-3), the description is convergent.

The question of how to choose α has been explained in the previous chapters. If α is too small, the convergence rate will be slow, and if α is too large,J (θ) may not decrease and may not even converge at last. In general, α usually chooses 0.001, 0.01, 0.1, 1 and other smaller values.

V. Characteristics and polynomial regression (Features and polynomial Regression)

Now we have learned about multivariable linear regression problems. In this section, we will discuss the selection of features and how to obtain good learning algorithms with these features, as well as some polynomial regression problems, which can be used to fit very complex functions, even nonlinear functions, using linear regression methods.

Take the forecast rate as an example. Suppose you have two features, a house's frontage width (frontage), and a longitudinal depth (depth), so assume the function is as follows:

Of course, we can also use other methods to represent this feature, for example, we can create an area feature that is equal to the product of width and depth, then assume that the function can be simplified as shown below.

After careful analysis of the training samples of the housing price problem, we can roughly find that a curve is better than a straight line. Therefore, the concept of polynomial regression is introduced to replace the original linear function with a polynomial hypothesis function:

You can see that if you choose a two-time polynomial, you can better match the sample data, but when the housing size continues to grow, the price will be a downward trend, which is clearly inconsistent with the reality. Therefore, the choice of three-time polynomial may be a better choice. Moreover, in this issue, we only used a feature, that is, the size of the house, but the more complex curve, to bring better results.

Of course, the choice of polynomial can be many kinds. We can also use the square root function as a hypothetical function, which is more realistic:

Vi. normal equation (normal equation)

For some linear regression problems, it is better to use the normal equation to solve the optimal value of the parameter θ .

For the gradient descent method we are currently using, J (θ) needs several iterations to converge to the minimum value.

The normal equation method provides an analytic solution for θ , that is, the solution is solved directly, and the optimal value is obtained in one step.

The key point of the normal equation method is the derivation of J (θ) , the point at which the derivative equals 0 is very low, in order to obtain the optimal θ , as shown in:

The calculation of θ can be conveniently represented by matrix calculation,

With Matlab, the optimal solution of θ can be calculated quickly:

Comparing the gradient descent and the normal equation, it can be found that each has its advantages and disadvantages.

Gradient descent requires a manual selection of the learning rate α and requires multiple iterations to get the optimal solution. But the normal equation does not need to choose the study rate, also does not need the iteration, can solve directly. However, although the matrix representation of θ is simple, its internal calculation is quite complex. When the number of features n is relatively small, it is relatively convenient to use the normal equation solution. However, when n is large, the normal equation will take a lot of time to inverse the matrix, and this time, it is better to choose the gradient descent method.


Machine Learning Notes (three) multivariable linear regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.