Introduction to Machine learning (11)--multi-feature gradient descent algorithm

Last Update:2018-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

definition

Multivariate hypothesis: The output is determined by multidimensional input, that is, the input is a multidimensional feature. For the previous case of the housing price forecast, assume that the variable from the house area, increased to the number of bedrooms, the number of floors, the length of the house three variables, given the data set as shown below: Price for the output, the front four dimensions for input:

Multi-feature vectors

Tips:

n = number of features

X (i) = all input characteristics of the I training sample, which can be considered as a set of eigenvectors

X (i) j = value of the J characteristic of the first training sample, which can be considered as the J value in the eigenvector

Suppose h (x) =θ0+θ1x1+ ... The so-called multi-parameter linear regression is that each input x has (n+1) dimension [X0......xn]1. The linear hypothesis model of multi-characteristic quantity,

hθ (x) =θ0 +θ1x1 +θ2x2 +θ3x3 +θ4x4

For convenience, remember x0 = 1, then multivariate linear regression can be recorded as:

hθ (x) =θtx

which

gradient descent algorithm for multi-feature quantity

And for the multi-variable gradient descent algorithm,

For hypothesis:hθ (x) =θtx=θ0+θ1x1+θ2x2+...+θnxn

Where parameters: Θ0,θ1,..., partθn can be represented as vector θ of n+1 dimension

For cost Function:

The gradient descent algorithm can be transformed into:

the normalization of multi-characteristic quantity

If you have a machine learning problem this problem has multiple characteristics if you can ensure that these characteristics are in a similar range I mean to ensure that the different characteristics of the value in a similar range, so that the gradient descent method can be faster convergence, or not only will converge slowly, but also will be a fluctuation or even concussion, This uses the feature scaling algorithm.

Thought: Standardize the values of each feature, so that the range of values is roughly between -1<=x<=1;

method One: Mean normalization, which is simple normalization, divided by the maximum value of each set of features, then:

For example, the price problem: Feature 1: The size of the house (0-2000); feature 2: Number of rooms (1-5);

The contours are as follows:

According to the simple normalization process,

Its contour line is

method Two: mean normalization, the basic idea is to replace XI with Xi–μi to approximate the mean of the characteristics of 0 (but not x0=1 processing), the mean normalized formula is:

which

The SI can be a range of values (maximum-minimum) or standard deviation (deviation).

Μi is the mean value of the training set XI.

The choice of learning rate

For the gradient descent algorithm: two points to note:

-"Debug": How to ensure that the gradient descent algorithm is executed correctly;

-How to choose the correct step size (learning rate): α;

The 2nd is important and it is also the key point to ensure the convergence of gradient descent. To ensure that the gradient descent algorithm works correctly, it is necessary to ensure that J (θ) is reduced in each iteration, and that if a step decreases less than a small value ϵ, it converges. For example:

This section of the curve between 300 steps and 400 steps is the one that looks like J (θ) does not fall so much so that when you reach the 400-step iteration, the curve looks pretty flat, which means that the gradient descent algorithm is basically converging for the 400-step iteration, because the cost function and the number do not continue to fall. This curve can help you determine if the gradient descent algorithm has been convergent. That is, if the cost function J (θ) falls less than a small value ε then it is considered to have been convergent, usually to choose a suitable threshold of ε is quite difficult. Will usually try a series of alpha values, so in the run gradient descent method, try different alpha values such as 0.001, 0.01 here every 10 times times a value and then for these different alpha values are plotted J (θ) with the number of iterations of the curve and then choose a α value that appears to make J (θ) fast descent 。

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to Machine learning (11)--multi-feature gradient descent algorithm

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support