The principle and implementation of the logistic regression algorithm (LR)

Last Update:2016-12-02 Source: Internet

Author: User

Tags natural logarithm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Logistic regression, Although called "regression" , is a classification learning Method. There are about two usage scenarios: the first is to predict, the second is to find the factors affecting the dependent variable. Logistic regression (logistic Regression, LR), also known as logistic regression analysis, is one of the classification and prediction Algorithms. The probability of future results is predicted by the performance of historical Data. For example, we can set the probability of a purchase as the dependent variable, setting the User's characteristic attributes, such as gender, age, registration time, and so on as Arguments. predict the probability of purchase based on feature Attributes. There are many similarities between logistic regression and regression analysis, and we look at regression analysis before we begin to introduce logistic Regression.

regression analysis is used to describe the relationship between the independent variable x and the dependent variable y, or the degree to which the independent variable x affects the dependent variable y, and predicts the dependent variable y . The dependent variable is the result that we want to obtain, the independent variable is the latent factor which affects the result, the independent variable can have one, can also have more than one. An independent variable called univariate regression analysis, more than one argument is called multivariate regression analysis.

Below is a set of advertising costs and exposure data, the cost and Exposure number one by one CORRESPONDS. The number of exposures is what we want to know, the cost is the factor affecting the number of exposures, we set the cost as an independent variable X, the exposure number is set to the dependent variable Y, through the unary linear regression equation and the determination coefficient can be found in the cost (X) of the exposure (Y).

The following is a unary regression linear mode where y is the dependent variable and x is the independent variable, and we only need the intercept B0 and slope B1 to get the relationship between the cost and the number of exposures and to predict the number of Exposures. Here we use the least squares method to calculate the intercept b0 and slope b1. The least squares method matches the optimal function of finding data by minimizing the square of the error .

The following table is a few of the necessary computational processes for calculating regression equations using least squares. The leftmost two columns in the table are the argument x and the dependent variable y, and we first calculate the mean of the derived and dependent variables, then calculate the difference between each observation and mean, and the data needed to calculate the slope B1 of the regression Equation.

According to the data in the table, the slope B1 of the regression equation is calculated by formula, and the calculation process is as Follows. The slope represents the relationship between the independent variable and the dependent variable, the slope is positive for the independent variable and the dependent variable is positive correlation, the slope is negative for the argument and the dependent variable is negatively correlated, and the slope is 0 to indicate that the argument and the dependent variable are Irrelevant.

After the slope B1 is obtained, The Intercept b0 of y-axis can be calculated by the following Formula.

By substituting the slope B1 and intercept b0 into the regression equation, we can obtain the relationship between the independent variable and the dependent variable, the cost of each increase of 1 yuan, the number of exposures will increase 7,437 times. The following are the regression equations and illustrations.

In the diagram of the regression equation, there is also a r^2, which is called the determinant coefficient, which is used to measure whether the regression equation fits the sample data Well. the decision coefficient is between 0-1, the larger the value, the better the fitting, in other words, the higher the interpretation of the dependent Variable. The formula for determining the coefficient is Sst=ssr+sse, where SST is the total sum of squares, the SSR is the sum of squares, and SSE is the sum of squared errors. The following table is a number of necessary calculation procedures for calculating the three indicators required to determine the Coefficient.

Based on the regression squared sum (SSR) and total squared sum (SST), the determination coefficient is 0.94344.

The above is the calculation process of the regression equation, we can calculate the exposure quantity by the regression equation in the case of the number of exposures according to the cost Forecast. Logistic regression adds a logical function on the basis of linear regression compared with the regression equation . For example, the User's attributes and characteristics determine whether the user will eventually make a purchase. where the probability of purchase is dependent on the variable y, the User's attributes and characteristics are the argument x. The greater the Y value, the greater the probability that the user is Buying. Here we use the likelihood of events (odds) to indicate the ratio of purchase to not Purchased.

Using e as the purchase event, p (e) is the probability of purchase, p (e ') is the probability of not being purchased, and Odds (e) is the likelihood of event E (purchase) Occurring.

Odds is a number from 0 to infinity, and the greater the value of odds, the greater the likelihood that the event would occur. Now we're going to convert the odds into a probability function between 0-1. first, the natural logarithm of the odds is obtained, and the logit equation is derived, and the logit is the value of a range in negative infinity to positive Infinity.

Based on the Logit equation above, the following formula is Obtained:

Which uses π to replace P (e) in the formula, π=p (e). The following formula is obtained according to the exponential function and the logarithmic rule:

And finally get the logistic regression equation:

The following is the probability of a User's purchase based on the logistic regression equation, The following table is the number of days the user is registered and whether to purchase the data, where the number of registrations is the argument x, whether the purchase is an argument y. We mark the purchase as 1 and mark the Non-purchase as 0.

Next we will calculate the slope and intercept of the logistic regression equation in Excel in 8 steps . And the equation predicts whether new users will buy.

The first step is to use Excel's sorting function to sort The original data by the dependent variable y, separating the purchased and the non-purchased data, making the data feature more Visible.
The second step, according to the logit equation preset slope b1 and intercept B0 values , here we set two values are preset to 0.1. Then the optimal solution is obtained by excel.
The third step is to calculate the L value using the pre-set slope and intercept values according to the Logit Equation.

The fourth step is to take the L value to the natural logarithm ,
In the fifth step, the value of P (x) is computed, and P (x) is the probability of the event (Odds).
See the detailed calculation steps and Procedures.

Sixth step, calculate the logarithmic likelihood function estimate (LOG-LIKELIHOOD) for each Value. See Methods and Procedures.
The seventh step is to summarize the logarithmic likelihood function values.

eighth, use Excel's solver function to calculate the maximum log likelihood function Value. See Methods and Procedures. Set the log likelihood function value of the rollup to maximize the target, the preset slope B1 and intercept b0 are variable cells, and the option to "make unconstrained variables non-negative" is Canceled. To Solve.

Excel will automatically find the optimal solution for the slope and intercept in the logistic regression equation, as shown in the Results.

After obtaining the slope and intercept of the logistic regression equation, we can take the value into the equation and obtain a forecast model of the number of days of registration and the probability of purchase, through which we can predict the purchase probability (Y) of the different registered days (X) Users. The following is the calculation process.

The first step is to enter the value of the number of days (X) of the argument registration, where we enter 50 days.
The second step, the input x value, as well as the slope and intercept into the logit equation, to find the L Value.
The third step is to take the natural logarithm of the L Value.
The fourth step is to find the probability value of P (X) of the time occurrence Probability.

The probability of a user who is registered for 50 days to purchase is approximately 17.6%.

We take the value of all registered days into the purchase probability prediction model, and get a curve that the number of registrations affects the purchase Probability. From the curve can be found that the number of days registered in the lower and higher number of days the user purchase probability is more stable. In the middle days, the purchase probability of users varies greatly.

We continue to add the new argument "age" to the above calculation Results. The following is the original Data. There are now two independent variables and one dependent variable for age and number of registrations.

According to the previous method, the optimal solution of slope and intercept is calculated, and the logistic regression equation is obtained, and the different ages and registration days are put into the equation, and the forecast model of the User's age and the number of registered days is Obtained. We use Excel's three-dimensional chart to draw the effect of age and number of registrations on the purchase Probability.

As can be seen, the purchase probability increases with the number of registrations, and under the same number of registrations, the purchase probability of the younger user is relatively high.

Reproduced In: Http://bluewhale.cc/2016-05-18/logistic-regression.html#ixzz4RbUh8R3T

One from linear regression to logistic regression

Both linear regression and logistic regression are special cases of generalized linear models.

Suppose there is a dependent variable y and a set of arguments x1, x2, x3, ..., xn, where y is a continuous variable, we can fit a linear equation:

Y =β0 +β1*x1 +β2*x2 +β3*x3 +...+βn*xn

The values of each β coefficient are estimated by the least squares method.

If Y is a two categorical variable and can only take a value of 0 or 1, then the linear regression equation will encounter difficulties: the right side of the equation is a continuous value, the value is negative infinity to positive infinity, and the left side can only take value [0,1], cannot correspond. To continue using the idea of linear regression, statisticians think of a transformation method that transforms the value of the right side of the equation into [0,1]. finally, the logistic function is selected:

y = 1/(1+e-x)

This is an S-type function, the range is (0,1), can map any value to (0,1), and has an infinite order can be guided and other good mathematical properties.

We rewrite the linear regression equation to:

y = 1/(1+e-z),

wherein, z =β0 +β1*x1 +β2*x2 +β3*x3 +...+βn*xn

The values on both sides of the equation are between 0 and 1.

Further mathematical transformations, which can be written as:

Ln (y/(1-y)) =β0 +β1*x1 +β2*x2 +β3*x3 +...+βn*xn

Ln (y/(1-y)) is called a logit transformation. We'll consider y as the probability p (y=1) of Y with a value of 1, so 1-y is the probability p (y=0) for y with a value of 0, so the upper rewrite is:

P (y=1) = ez/(1+ez),

P (y=0) = 1/(1+ez),

Among them, z =β0 +β1*x1 +β2*x2 +β3*x3 +...+βn*xn.

You can then use the "maximum likelihood" method to estimate the individual coefficient β.

Two odds and or review

Odds: known as probability, ratio, and ratio, refers to the probability (probability) of the occurrence of an event and the probability of non-occurrence. Use p to indicate the probability that the event occurred: odds = p/(1-p).

OR: ratio, The probability of event occurrence in the experimental group (odds1)/control group (odds2).

Interpretation of the results of three logistic regression

We use an example to illustrate that this example contains 200 student data, including 1 independent variables and 4 arguments:

Dependent variable: hon, Indicates whether the student is in the Honors class (honors class), 1 indicates yes, 0 indicates no;

Independent variables:

Female: gender, categorical variable, 1 = female, 0 = Male

Read: reading scores, for continuous variables

Write: writing scores, for continuous variables

Math: mathematical results for continuous variables

1 , a logistic regression that does not contain any variables

first, you fit a logistic regression that does not contain any Variables.

The model is ln (p/(1-p) =β0

The regression results are as follows (the results are edited):

Hon	Coefficient β	Standard error	P
Intercept distance	-1.12546	0.164	0.000

The coefficient β here is the β0 in the model =-1.12546,

We use P to denote the student's probability in the honors class, so there is ln (p/(1-p) =β0 =-1.12546,

To solve the equation: p = 0.245.

Odds = P/1-p = 0.3245

What does the p here mean? P is the probability of hon=1 in all Data.

Let's take a look at the entire Hon Data:

Hon	Number of cases	Percentage
0	151	75.5%
1	49	24.5%

Hon the probability of a value of 1 P is 49/(151+49) = 24.5% = 0.245, We can manually calculate ln (p/(1-p) = 1.12546, equal to the coefficient β0. Can draw a relationship:

Β0=ln (odds).

2 , a model that contains a two categorical dependent variable

Fitting a logistic regression containing two categorical dependent variable female,

The model is ln (p/(1-p) =β0 +β1* female.

The regression results are as follows (the results are edited):

Hon	Coefficient β	Standard error	P
Female	0.593	.3414294	0.083
Intercept distance	-1.47	.2689555	0.000

Before interpreting this result, take a look at the cross-tables of Hon and female:

Hon	Female		Total
Hon	Male	Female	Total
0	74	77	151
1	17	32	49
Total	91	109

According to this crosstab, for men (Male), the probability of being in the Honors class is 17/91, the probability of being in a non-honors class is 74/91, so the odds of its being in the Honors class are odds1= (17/91)/(74/91) = 17/74 = 0.23; The odds of a woman being in an honors class Odds2 = (32/109)/(77/109) =32/77 = 0.42. The ratio of females to males or = ODDS2/ODDS1 = 0.42/0.23 = 1.809. We can say that women are 80.9% more likely to be in honor classes than Men.

Return to logistic regression results. The coefficient of intercept-1.47 is the logarithm of the male odds (as the male is indicated by female=0, is the control group), ln (0.23) =-1.47. The coefficient of the variable female is 0.593, which is the logarithm of the or value of the female to the male, ln (1.809) = 0.593. So we can draw a relationship: or = exp (β), or β= ln (or) (the exp (x) function is an exponential function, which represents the X-second side of e).

3 , a model that contains a continuous variable

Fit a logistic regression containing a continuous variable, math,

The model is ln (p/(1-p) =β0 +β1* math.

The regression results are as follows (the results are edited):

intercept

hon	coefficient β	Standard error	p
math	.1563404	.0256095	0.000
-9.793942	1.481745	0.000

The meaning of The intercept factor here is the logarithm of the odds with math score 0 in the honors class. We calculated odds = exp (-9.793942) =. 00005579, which is very small. Because in our data, there is no math score of 0 students, so this is an out-of-the-box hypothetical Value.

How do you explain Math's coefficients? According to the model that fits, there are:

ln (p/(1-p)) =-9.793942 +. 1563404*math

Let's first assume that math=54 has:

ln (p/(1-p)) (math=54) = 9.793942 +. 1563404 *54

Then we raise the math to raise a unit that makes math=55, which has:

ln (p/(1-p)) (math=55) = 9.793942 +. 1563404 *55

The difference between the Two:

ln (p/(1-p)) (math=55)-ln (p/1-p)) (math = 54) = 0.1563404.

is exactly the coefficient of the variable Math.

Thus we can say that the logarithm of odds (that is, p/(1-p), The probability of being in honor Class) increases by 0.1563404 for each 1 units of Math.

So how much does odds increase? According to the logarithmic formula:

ln (p/(1-p)) (math=55)-ln (p/1-p)) (math = si) = ln ((p/(1-p) (math=55)/(p/(1-p) (math=54))) = ln (odds (math=55)/odds (math= 54)) = 0.1563404.

So:

Odds (math=55)/odds (math=54) = exp (0.1563404) = 1.169.

So we can say that math increases by 16.9% per unit odds. and is independent of the absolute value of Math.

The clever reader must have found out that odds (math=55)/odds (math=54) is just or!

4 , a model with multiple variables (no interactive Effect)

Fit a logistic regression containing female, math, read,

The model is ln (p/(1-p) =β0 +β1* math+β2* female+β3* read.

The regression results are as follows (the results are edited):

Hon	Coefficient β	Standard error	P
Math	.1229589	Slightly	0.000
Female	0.979948	Slightly	0.020
Read	.0590632	Slightly	0.026
Intercept distance	-11.77025	Slightly	0.000

The results show that:

(1) Gender: The odds of women (female=1) entering the Honors class (odds) are male (female=0) exp (0.979948) = 2.66 times times, or women are 166% higher than men, in the same conditions as math and read Scores.

(2) Math score: Under the same conditions as female and read, the math score increases by 1 and the chance of entering the honors class increases by 13% (because exp (0.1229589) = 1.13).

(3) read reading is similar to Math.

5 , including the corresponding model of the interaction

Fit a logistic regression that contains female, math, and the interaction between the two,

The model is ln (p/(1-p) =β0 +β1* female+β2* math+β3* female *math.

The so-called interaction effect, refers to the effect of one variable on the result of another variable value is Different.

The regression results are as follows (the results are edited):

Hon	Coefficient β	Standard error	P
Female	-2.899863	Slightly	0.349
Math	.1293781	Slightly	0.000
Female*math	.0669951	Slightly	0.210
Intercept distance	-8.745841	Slightly	0.000

Note: the p for the Female*math item is 0.21 and can be considered as having no interaction. But here we want to explain the interaction effect, temporarily ignore p-value, let's think they are interactive effect.

Due to the existence of interaction, we cannot say how the female effect will be if we keep math and Female*math intact, because math and female*math are not going to stay the same!

For this simple case, we can fit two equations separately,

For Men (female=0):

Log (p/(1-p)) =β0 +β2*math.

For women (female=1):

Log (p/(1-p)) = (β0 +β1) + (β2 +β3) *math.

Then explain them separately.

The principle and implementation of the logistic regression algorithm (LR)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More