Supervised machine learning-Regression

Last Update:2014-10-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Introduction

　　This document is based on Andrew Ng's machine learning course http://cs229.stanford.edu and Stanford unsupervised learning ufldl tutorial http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial.

Regression Problems in Machine Learning belong to the scope of supervised learning. The goal of the regression problem is to specify the D-dimension input variable X, and each input vector X has a corresponding value Y. It is required to predict the continuous target value T corresponding to the new data. For example, in the following example, assume that we have a dataset containing the area and price of 47 houses:

We can draw this set of datasets in MATLAB, as shown below:

Is the point shown a little like a straight line? We can use a curve to fit these data points as much as possible, so for new input, we can return the corresponding points on the fitting curve to achieve the purpose of prediction. If the value to be predicted is continuous, such as the above house price, it is a regression problem. If the value to be predicted is discrete, that is, one label, it is a classification problem. The learning process is shown in:

Common Terms in the above learning process: datasets containing house area and price are calledTraining set training set;Input variable X (area in this example) isFeature features;The output predicted value Y (house price in this example) isTarget value target;The fitting curve, usually y = h (x), is calledAssume that the model is hypothesis;The number of entries in the training set is calledFeature dimensionIn this example, 47.

Ii. Linear Regression Model

The linear regression model assumes that the input features and corresponding results meet the linear relationship. Add one-dimensional data set in the preceding data set-the number of rooms, and the data set is changed:

Therefore, the input feature X is a two-dimensional vector. For example, x1 (I) indicates the area of the second house in the dataset, and X2 (I) indicates the number of rooms in the second house in the dataset. Therefore, we can assume that the input feature X and house price y meet the linear function, for example:

Here θ I is called the hypothesis model, that is, the linear function H that maps the Input Feature X to the result y.Parameter ParametersTo simplify the representation, we add X0 = 1 to the input features and get:

Both θ and Input Feature X are vectors, and N is the number of input feature X (excluding X0 ).

Now, how can we learn the θ parameter for a given training set to achieve better fitting results? An intuitive idea is to make the predicted value h (x) as close as possible to Y. To achieve this, we defineCost FunctionUsed to describe the closeness between H (X (I) 'and corresponding y (I:

The above 1/2 is used to calculate the derivation, so that the constant coefficient disappears. So our goal is to adjust θ so that the cost function J (θ) gets the minimum value. The methods include gradient descent method and least square method.

　　2.1 Gradient Descent Method

Now we need to adjust θ so that J (θ) gets the minimum value. To achieve this goal, we can take a random initial value for θ (the purpose of random Initialization is to invalidate symmetry ), then, we iteratively change the θ value to reduce J (θ), knowing that the final convergence gets a θ value to minimize J (θ. The gradient descent method uses the following idea: Set A Random Initial Value θ 0 for θ, and then update the following iteration:

Until convergence. α here is calledLearning Rate.

　　The gradient direction is determined by the partial derivative of J (θ) to θ. Because the minimum value is required, the negative value of the partial derivative is obtained to obtain the gradient direction. Enter J (θ) to obtain the overall update formula.

Such an update Rule is called LMS update Rule (least mean squares), also known as widrow-ho? Learning rule.

The algorithm for updating parameters is as follows:

In each iteration, all samples in the training set are examined, which is called batch gradient descent. For the price dataset in the introduction, run this algorithm to obtain θ 0 = 71.27, θ 1 = 1.1345, and the fitting curve is as follows:

If the parameter update calculation algorithm is as follows:

Here we update the θ value based on a single training sample, which is called stochastic gradient descent with a random gradient descent. Compared with the two gradient descent algorithms, because batch gradient descent considers all the datasets in each step, the complexity is relatively high, and the random gradient descent will quickly converge, in addition, in the actual situation, the optimal J (θ) obtained by two gradient descent is generally close to the actual minimum value. Therefore, for large datasets, the high efficiency random gradient descent method is generally used.

　　2.2 Least Squares

The gradient descent algorithm provides a method for calculating θ, but the iteration process is time-consuming and not intuitive. The least square method described below is an intuitive algorithm that uses matrix operations to obtain the θ value. To understand the least square method, first review the calculation of the matrix:

Assume that function f maps M * n-dimensional matrix to an operation of a real number, and defines the gradient of A mapped F (a) to matrix A as follows:

Therefore, the gradient is M * n matrix. For example, for matrix A =, And the ing function f (a) is defined as: F (a) = 1.5a11 + 5a122 + a21a22, the gradient is:

In addition, there are the following rules for Matrix Trace gradient operations:

In the following example, the input feature X and the corresponding result Y in the test set are represented as a matrix or vector:

The prediction model has the following features:

So we can get it.

Therefore, we represent the cost function J (θ) as a matrix, and we can use the matrix operation mentioned above to obtain the gradient:

Let the above gradient be 0 and get the equation:, then we get the value of θ:

. This is the parameter value in the hypothesis Model Obtained by the least square method.

　　2.3 Weighted Linear Regression

First, consider the following Curve Fitting situations:

The leftmost graph uses linear fitting, but we can see that the data points are not completely in a straight line, so the fitting effect is not good. If we add the X2 item, we can get it. As shown in the middle figure, this quadratic curve can better fit the data points. We continue to add a higher item to obtain the fitting curve shown in the rightmost diagram, which perfectly fits the data point. The curve in the rightmost diagram is a 5-order polynomial, however, we all clearly know that this curve is too perfect, and it may not be so good for new data. For the leftmost curve, we call it underfitting-a small feature set makes the model too simple to express the data structure well, the rightmost curve is called overfitting-a large set of features makes the model too complex.

As the above example shows, in the learning process, feature selection has a great impact on the performance of the final learned model, so which feature is used, the importance of each feature produces weighted linear regression. In traditional linear regression, the learning process is as follows:

The learning process of weighted linear regression is as follows:

The difference between the two lies in that different non-negative weights are assigned to different input features. The larger the weight, the greater the impact on the cost function. Generally, the weight calculation formula is as follows:

X indicates the feature to be predicted. The closer the sample is to X, the greater the weight of the sample.

Iii. Logistic regression and softmax Regression

3.1Logistic Regression

　　The following describes logistic regression. Although it is called regression, logistic regression is used for classification. Logistic regression is essentially a linear regression model. It only adds a function ing to the continuous regression value results to linearly sum features and then uses g (z) for ing, ing continuous values to discrete values 0/1 (for sigmoid functions, for hyperbolic sine Tanh functions are 1/-1 ). Assume that the model is:

The sigmoid function g (z) is:

When Z approaches-∞, g (z) approaches 0, Z approaches ∞, and g (z) approaches 1 to achieve classification. Here

How can we adjust the θ parameter for such a logistic model? Let's assume that

Because there are two types of problems, that is, the likelihood estimation is:

The logarithm of likelihood estimation can be more easily solved :.

The next step is to maximize the likelihood of θ. You can consider the gradient descent method described above, so we get:

Obtain an update formula similar :. Although this update Rule is similar to the formula obtained by LMS, these two algorithms are different because h θ (X (I) Here is) non-linear functions.

　　3.2 softmax Regression

Logistic regression is an algorithm for two types of regression problems. What if the target result is multiple discrete values? The softmax regression model solves this problem. The softmax regression model is a promotion of the Logistic Model for multiclass classification. In softmax regression, class label y can go to k different values (k> 2 ). Therefore, for Y (I), it belongs to {1, 2, 3 · k }.

For the given test input X, we use the hypothesis model to estimate the probability value p (y = j | x) for each category J ). Assume that the form of the function hθ (X (I) is:

Where θ 1, θ 2, θ 3, ·, θ K is a model parameter, and the coefficient on the right of the equation is to normalize the probability distribution so that the sum of the total probability is 1. Like logistic regression, the new cost function is promoted as follows:

We can see that the softmax cost function is very similar to the logistic cost function, but the softmax function accumulates K possible classes. In softmax, the probability of dividing X into Category J is:

The gradient descent method is used to minimize the cost function of softmax. The gradient formula is as follows:

Returns the partial derivative of element θ J of the J (θ) element. Each iteration is updated as follows :.

　　3.3 softmax regression vs Logistic Regression

In particular, when k = 2 in softmax regression, softmax degrades to logistic regression. When K = 2, the assumption model of softmax regression is:

Let's cut the two parameters θ 1 and get the following:

Therefore, the probability form of the softmax regression prediction is consistent with that of the logistic regression.

Now, if there is a K-class Classification task, we can select softmax regression or k independent logistic regression classifiers. How should we choose?

This choice depends on whether these K categories are mutually exclusive. For example, if there are four categories of Movies: Hollywood movies, Hong Kong and Taiwan movies, Japan and South Korea movies, and mainland movies, if you need to label each trained movie sample, select the softmax regression of K = 4. However, if there are four categories of Movies: Action, comedy, love, and Europe and America, these categories are not mutually exclusive. Therefore, it is reasonable to use four logistic regression classifiers in this case.

Iv. General Linear Regression Model

First, define a general exponential probability distribution:

The bernuoli distribution is considered as follows:

Then consider Gaussian distribution:

The general linear model meets the following requirements: 1. Y | X; θ satisfies the exponential distribution family E (ETA) 2. given feature X, the prediction result is T (y) = E [Y | x] 3. the ETA parameter is θ Tx.

For the linear model in the second part, we assume that the result y satisfies the Gini (μ, σ 2) of Gaussian distribution, so we expect μ = ETA, so:

Obviously, the hypothesis model in the second part is obtained from the general linear model.

For the Logistic model, we assume that the results are divided into two types. Naturally, we think of the bernuoli distribution, and we can get the results. Therefore, Y | X; θ satisfies B (Φ), E [Y | X; θ] = Phi, so

Therefore, we obtained a formula with the logistic hypothesis model, which also explains why logistic regression uses this function.

Supervised machine learning-Regression

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Supervised machine learning-Regression

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Supervised machine learning-Regression

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support