"Reprint" to the understanding of linear regression, logistic regression and general regression

Source: Internet
Author: User

Understanding of linear regression, logistic regression and general regression

"Please specify the source when reproduced": Http://www.cnblogs.com/jerrylead

Jerrylead

February 27, 2011

As a machine learning beginner, the understanding is limited, the expression also has many mistakes, hope that everybody criticizes correct.

1 Summary

This report is a summary and understanding of the first four sections of the Stanford University Machine learning program plus the accompanying handouts. The first four sections mainly describe the regression problem, and regression is a method of supervised learning. The core idea of this method is to get the mathematical model from the continuous statistic data, then use the mathematical model to predict or classify. The data processed by this method can be multidimensional.

The handout initially introduces a basic problem, then leads to the solution of linear regression, and then gives the probability explanation for the error problem. The logistic regression is then introduced. Finally, it rises to the theoretical level and proposes a general regression.

2 Problem Introduction

This example comes from Http://www.cnblogs.com/LeftNotEasy/archive/2010/12/05/mathmatic_in_machine_learning_1_regression_and_ Gradient_descent.html

Suppose there is a home sales data as follows:

Area (m^2)

Sales price (million yuan)

123

250

150

320

87

160

102

220

...

...

This table is similar to the price of housing around Beijing 5, we can make a figure, the x-axis is the size of the house. The y-axis is the price of the house, as follows:

If we come up with a new area, what do we do if we don't have a record of the price of the sale?

We can use a curve to fit the data as accurately as possible, and then if there is a new input, we can return the value corresponding to the point on the curve. If you use a straight line to fit, it might look like this:

The green dots are the points we want to predict.

First, some concepts and commonly used symbols are given.

House Sales Record table: Training Set (training set) or training data (training) is the input data in our process, commonly called X

House Sales price : Output data, commonly called Y

fitted function (or hypothesis or model): General Write y = h (x)

number of entries in the training data (#training set): A training data is the dimension n of the input data consisting of a pair of input and output data (number of features, #features)

This example is characterized by a two-dimensional, and the result is one-dimensional. However, the regression method can solve the feature multidimensional, the result is one-dimensional discrete value or one dimension continuous value problem.

3 Learning Process

Here is a typical machine learning process, first give an input data, our algorithm will be a series of processes to get an estimated function, this function has the ability to give a new estimate of the new data not seen, also known as building a model. Just like the linear regression function above.

4 linear regression

Linear regression hypothesis features and results satisfy the linear relationship. In fact, the expression of the linear relationship is very powerful, the effect of each characteristic on the results can be reflected by the preceding parameters, and each feature variable can be mapped to a function, and then participate in the linear calculation. In this way, the nonlinear relationship between the characteristic and the result can be expressed.

We use X1,X2. Xn to describe the weight inside the feature, such as the area of the x1= room, the direction of the x2= room, and so on, we can make an estimate function:

Θ is called a parameter here, which means to adjust the influence of each component in the feature, that is, whether the area of the house is more important or the location of the house is more important. In order for us to make X0 = 1, we can use vectors to represent:

Our program also needs a mechanism to evaluate whether or not theta is better, so we need to evaluate our H-function, which is called the loss function (loss functions) or the wrong function (error functions), to describe the degree of bad H function, below, We call this function a J function.

Here we can think of the error function as follows:

This error estimation function is to go to the sum of the estimated value of x (i) and the squared sum of the true value Y (i) as the error estimation function, and the 1/2 in front of it is for the sake of derivation, the coefficient is gone.

As for why Squared is chosen as the error estimation function, the source of the formula is explained from the perspective of probability distribution in the following handout.

How to adjust θ so that J (θ) obtains the minimum value there are many methods, including the least squares (min square), is a completely mathematical description of the method, and gradient descent method.

5 Gradient Descent method

After the linear regression model is selected, the model can be used for prediction only if the parameter θ is determined. However, Theta needs to be determined in the smallest case of J (θ). So the problem boils down to finding the minimum problem, using the gradient descent method. The biggest problem of gradient descent method is that it is possible to obtain a global minimum, which is related to the selection of the initial point.

The gradient descent method is carried out according to the following process:

1) First assign a value to θ, which can be random, or let Theta be a vector of all zeros.

2) Change the value of θ so that J (θ) is reduced in the direction of gradient descent.

The gradient direction is determined by the partial derivative of θ from J (θ), which is the inverse direction of the partial derivative, because the minimum value is obtained. Result is

There are two ways to iterate the update, one is the batch gradient descent, that is, the entire training data to obtain the error after the θ is updated, the other is the incremental gradient drop, each scan step to the θ to update. The former method can be continuously convergent, and the result of the latter method may be constantly hovering in the convergence place.

In general, the convergence rate of gradient descent method is still relatively slow.

Another way to directly calculate the result is the least squares method.

6 Least Squares

The training features are represented as X-matrices, the results are expressed as Y vectors, and the linear regression model is still the same, and the error function is unchanged. Then theta can be derived directly from the following formula

However, this method requires that X is a column full rank, and the inverse of the matrix is relatively slow.

7 probability interpretation of using the error function as the sum of squares

Assuming that the predicted results of the features and the actual results have errors, then the predicted results and the real results satisfy the following formula:

In general, the error satisfies a Gaussian distribution with a mean of 0, which is a normal distribution. So the conditional probability of X and Y is

This estimates the result probability of a sample, but we expect the model to be the most accurate of all the samples, that is, the maximum probability product. Note that the probability product is the product of probability density function, and the probability density function of continuous function is different from the probability function of discrete value. This probability product becomes the maximum likelihood estimate. We want to determine θ when maximum likelihood estimates are maximized. Then the maximum likelihood estimation formula should be derivative, and the derivative result is both

This explains why the error function uses the sum of squares.

Of course, some assumptions are made in the derivation process, but this assumption conforms to the objective law.

8 linear regression with weights

The system in the error function of the linear regression mentioned above is 1 and has no weight. Linear regression with weights added weight information.

The basic assumption is that

Where the hypothesis conforms to the formula

where x is the characteristic to be predicted, the assumption is that the larger the sample weight from the X, the smaller the farther the effect. This formula is similar to the Gaussian distribution, but not the same, because it is not a random variable.

This method becomes a non-parametric learning algorithm because the error function varies with the predicted value, so that θ cannot be determined beforehand, and the prediction needs to be calculated once and feels similar to KNN.

9 Classification and Logistic regression

In general, regression is not a classification problem, because regression is a continuous model, and is affected by the noise is relatively large. If you do not want to apply the entry, you can use logistic regression.

Logistic regression is essentially linear regression, except that a function map is added to the mapping of the feature to the result, that is, the feature is summed linearly, and then the function g (z) is used to predict the most hypothetical function. G (z) can map continuous values to 0 and 1.

The assumption function of logistic regression is as follows, the linear regression hypothesis function is just.

Logistic regression is used to classify the 0/1 problem, which is the two value classification problem that the prediction result belongs to 0 or 1. This assumes that the two value satisfies the Bernoulli distribution, i.e.

Of course, assuming that it satisfies the Poisson distribution, exponential distribution and so on can also, but more complex, the following will refer to the general form of linear regression.

As in the 7th section, the maximum likelihood estimate is still obtained, then the derivative is derived, and the result of the iterative formula is

You can see that it is similar to linear regression, but instead, it is actually mapped by G (z).

10 Newton method to solve maximum likelihood estimation

The methods of maximum likelihood estimation used in sections 7th and 9th are the methods of derivation iteration, and the Newton descent method is introduced, so that the results can be quickly convergent.

When the solution is required, if f is conductive, then the formula can be iterated

To iteratively solve the minimum value.

When applied to the maximum likelihood estimation, it becomes a problem to solve the probability derivative of maximum likelihood estimation.

So the iterative formula writing

When θ is a vector, the Newton method can be used to represent

 

Which is the Hessian matrix of NXN.

Although the Newton method converges fast, it is time-consuming to find the inverse of Hessian matrix.

When the initial point X0 near the minimum x, the convergence speed of Newton method is the fastest. But when X0 away from the minimum, Newton's method may not converge, even the descent can not guarantee. The reason is that the iteration point xk+1 is not necessarily the minimum point of the target function f in Newton's direction.

111-Like linear model

The reason for using logistic regression

The formula is supported by a set of theories.

This theory is a general linear model.

First, if a probability distribution can be expressed as

, then this probability distribution can be called an exponential distribution.

Bernoulli distribution, Gaussian distribution, Poisson distribution, beta distribution, Dietritt distribution all belong to exponential distribution.

In logistic regression, the Bernoulli distribution is used, and the probability of Bernoulli distribution can be expressed as

which

Get

This explains the logistic regression in order to use this function.

The main point of a general linear model is

1) to satisfy an exponential distribution of the parameters, then the expression can be obtained.

2) Given x, our goal is to determine, in most cases, what we actually want to be sure of, while. (The expected value in logistic regression is, so h is; the expected value in linear regression is, whereas in the Gaussian distribution, so the linear regression is h=).

3)

Softmax regression

Finally, an example of using a general linear model is given.

Assuming that the predictive value Y has k, that is, y∈{1,2,..., k}

such as k=3, can be seen as an unknown message into the three categories of spam, personal mail or work mail.

Defined

So

Such

That is, the left side can be represented by other probabilities, so it can be considered as a k-1 dimension problem.

To indicate that the polynomial distribution is expressed as an exponential distribution, we introduce t (y), which is a set of k-1-dimensional vectors where T (y) is not y,t (y) I represents the I component of T (y).

Applied to the general linear model, the result y must be one of the K. 1{y=k} indicates that when Y=k, 1{y=k}=1. then P (Y) can be expressed as

In fact, it is good to understand that when Y is a value m (m from 1 to K), p (y) =, and then formalized.

So

At last, we obtain

And when Y=i

Get expectations

Then the hypothesis function is established and the maximum likelihood estimation is obtained.

The formula can be solved by using gradient descent or Newton method.

The problem of establishing and predicting multi-valued model is solved.

Learning Summary

The structure of the handout is clear, unique thinking, reason, also say derivation. What is valuable is to speak out the basic solution of the problem and expand the idea, more important is to explain why to use the relevant methods and the source of the problem. In the seemingly concrete problem-solving ideas can lead to more abstract general problem-solving ideas, the level of the theory is very high.

This method can be used in multidimensional analysis and multi-value prediction of data, and it is more suitable for the scenario of some probabilistic model behind the data.

Several questions

One is the use of iterative method, how to determine the step length is better

But whether the matrix form of the least squares is generally available

"Reprint" to the understanding of linear regression, logistic regression and general regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.