Logical return __ Logical regression

Source: Internet
Author: User

what is logical regression.

There are many similarities between logistic regression and multiple linear regression, and the biggest difference is that the variables are different, and the others are basically the same. Because of this, these two kinds of regression can be attributed to the same family, that is, the generalized linear models (Generalizedlinear model).

The form of the model in this family is basically the same, the difference is due to different variables.

If it is continuous, it is multiple linear regression, if it is two distribution, is logistic regression, if it is Poisson distribution, it is Poisson regression, if it is negative two distribution, it is negative two regression.

The dependent variables of logistic regression can be classified as two or multiple, but the two classification is more commonly used and easier to explain. So the most commonly used in practice is the logistic regression of the two classification.

Main uses of logistic regression:

Looking for risk factors: Finding the risk factors for a particular disease; prediction: According to the model, predict the probability of the occurrence of a disease or a situation under different independent variables, discriminant: In fact, it is similar to the prediction, but also according to the model, to determine the probability of someone belonging to a certain disease or a certain situation, That is to see how much the possibility of this person is a certain disease.

Logistic regression is mainly used in epidemiology, and it is more commonly used to explore the risk factors of a disease, predict the probability of occurrence of a disease according to risk factors, and so on. For example, to explore the risk factors of gastric cancer, you can choose two groups of people, one group is gastric cancer group, a group of non-gastric cancer group, two groups of people must have different signs and lifestyles. The variable here is whether gastric cancer, or "yes" or "No", can include a lot of independent variables, such as age, sex, eating habits, Helicobacter pylori infection and so on. The independent variable can be either contiguous or categorized.

General Steps

The general steps for regression problems are:

Look for h function (i.e. hypothesis); construct J function (loss function); Find a way to minimize the J function and get the regression parameter (θ)


Constructing predictive function H

Logistic regression Although the name "return", but it is actually a classification method, mainly used for two classification problems (that is, output only two, representing two categories), so using the logistic function (or called the sigmoid function), the function form is:

The sigmoid function has a beautiful "S" shape, as shown in the following image (from Wikipedia):

The left image below is a linear decision-making boundary, and the right graph is a non-linear decision boundary.



For linear boundary conditions, the boundary form is as follows:

The construction predictive function is:


The value of a function has a special meaning, which indicates the probability that the result will be 1, so the probability for the input X Category 1 and category 0 is:



Construction Loss Function J

The cost function and the J function are as follows, and they are derived from the maximum likelihood estimation.



The following is a detailed description of the deduced process:

(1) can be written in combination:

Take the likelihood function as:


The logarithmic likelihood function is:


The maximum likelihood estimate is the theta when the maximum is obtained, in fact, the gradient rise method can be used to solve the problem, and the θ is the best parameter required. However, in the course of Andrew Ng, we will take the following style, namely:


Because of a negative coefficient -1/m, the minimum value of theta is the best parameter required.


minimum value obtained by gradient descent method

Θ update process:


The theta update process can be written as:

 


to quantify vectorization

Vectorization is used to replace the For loop with matrix calculations to simplify the calculation process and improve efficiency.

As above, Σ (...) is a summation process, obviously requires a for Statement loop m times, so there is no complete implementation of vectorization.


The following describes the process of quantifying:

The matrix form of the agreed training data is as follows, each behavior of x is a training sample, and each column is a different special value:

The parameter A of G (a) is a column vector, so we should support the column vector as the parameter and return the column vector when we implement the G function. It can be obtained from the first calculation by the upper formula.

The theta update process can be changed to:


To sum up, the steps for vectorization after Theta update are as follows:

(1) Request;

(2) Request;

(3) Request.

regularization of the regularization

Cross Fitting problem

For linear regression or logical regression of the loss function of the model, may be some weight is very large, some weight is very small, leading to the fitting (is too much fitting the training data), so that the complexity of the model improved, generalization ability is poor (the ability to predict the unknown data).

The left figure below is the fitting, the middle picture is fitting, and the right picture is over fitting.


the main cause of the problem

Cross-fitting problems often originate from too many characteristics.

Solving Method

1 reduce the number of features (reduced features will lose some information, even if the feature is selected well)

You can choose the feature that you want to keep, the model selection algorithm;

2 regularization (more effective when more characteristic)

Retain all features, but reduce the size of theta

Regularization Method

Regularization is the implementation of structural risk minimization strategy, which is to add a regularization term or penalty item to empirical risk. Regularization term is generally a monotone increment function of model complexity, the more complex the model, the greater the regularization term.

Starting with the housing price forecast, this is a polynomial regression. The picture on the left is fitted properly and the right picture is over fitted.


Intuitively, if we want to solve this example of the problem of fitting, it is best to eliminate the impact, that is, let. Assuming that we're punishing and making it small, an easy way to do this is to add two minor penalties to the original cost function, such as:


So when you minimize the cost function,.

The regular term can take different forms and take the square loss in the regression problem, which is the L2 norm of the parameter, also can take the L1 norm. When the square loss is taken, the loss function of the model becomes:


The lambda is the regular factor:

If its value is large, it shows that the complexity of the model penalty is large, the damage to the fitting data penalty is small, in this way, it will not fit the data excessively, the deviation on the training data is large, the variance on the unknown data is small, but the phenomenon of less fitting can occur; If its value is small, it is more important to fit the training data, Deviations from the training data will be small, but may result in a fit.

After regularization, the gradient descent algorithm Theta's update becomes:


The formula for normal equation of the linear regression after regularization is:



Other optimization algorithms

Conjugate gradient method (conjugate gradient) Quasi-Newton method (Quasi-Newton) BFGS methods L-bfgs (Limited-memory BFGS)

The latter two are derived from the quasi-Newton method, and compared with the gradient descent algorithm, the advantages of these algorithms are:

First, there is no need for manual selection of step size; second, it is usually faster than gradient descent algorithm;

But the downside is more complicated.

multi-Class classification problem

For multiple classification problems, it can be seen into two categories of classification problems: keep One of them, the rest as the other class.

For each Class I trains a classifier of a logistic regression model, and predicts the probability of y = i; for a new input variable x, each class is predicted separately, taking the class with the most probability as the result of the classification:


Reference Links

http://blog.csdn.net/dongtingzhizi/article/details/15962797

Coursera Public Lesson Notes: "Logical regression (Logistic regression)", Stanford University machine learning Sixth course

Coursera Public Lesson notes: seventh session of the Stanford University Machine Learning "regularization (regularization)"


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.