The principle and implementation of the logistic regression algorithm (LR)

Source: Internet
Author: User
Tags natural logarithm

Logistic regression, Although called "regression" , is a classification learning Method. There are about two usage scenarios: the first is to predict, the second is to find the factors affecting the dependent variable. Logistic regression (logistic Regression, LR), also known as logistic regression analysis, is one of the classification and prediction Algorithms. The probability of future results is predicted by the performance of historical Data. For example, we can set the probability of a purchase as the dependent variable, setting the User's characteristic attributes, such as gender, age, registration time, and so on as Arguments. predict the probability of purchase based on feature Attributes. There are many similarities between logistic regression and regression analysis, and we look at regression analysis before we begin to introduce logistic Regression.

regression analysis is used to describe the relationship between the independent variable x and the dependent variable y, or the degree to which the independent variable x affects the dependent variable y, and predicts the dependent variable y . The dependent variable is the result that we want to obtain, the independent variable is the latent factor which affects the result, the independent variable can have one, can also have more than one. An independent variable called univariate regression analysis, more than one argument is called multivariate regression analysis.

Below is a set of advertising costs and exposure data, the cost and Exposure number one by one CORRESPONDS. The number of exposures is what we want to know, the cost is the factor affecting the number of exposures, we set the cost as an independent variable X, the exposure number is set to the dependent variable Y, through the unary linear regression equation and the determination coefficient can be found in the cost (X) of the exposure (Y).

The following is a unary regression linear mode where y is the dependent variable and x is the independent variable, and we only need the intercept B0 and slope B1 to get the relationship between the cost and the number of exposures and to predict the number of Exposures. Here we use the least squares method to calculate the intercept b0 and slope b1. The least squares method matches the optimal function of finding data by minimizing the square of the error .

The following table is a few of the necessary computational processes for calculating regression equations using least squares. The leftmost two columns in the table are the argument x and the dependent variable y, and we first calculate the mean of the derived and dependent variables, then calculate the difference between each observation and mean, and the data needed to calculate the slope B1 of the regression Equation.

According to the data in the table, the slope B1 of the regression equation is calculated by formula, and the calculation process is as Follows. The slope represents the relationship between the independent variable and the dependent variable, the slope is positive for the independent variable and the dependent variable is positive correlation, the slope is negative for the argument and the dependent variable is negatively correlated, and the slope is 0 to indicate that the argument and the dependent variable are Irrelevant.

After the slope B1 is obtained, The Intercept b0 of y-axis can be calculated by the following Formula.

By substituting the slope B1 and intercept b0 into the regression equation, we can obtain the relationship between the independent variable and the dependent variable, the cost of each increase of 1 yuan, the number of exposures will increase 7,437 times. The following are the regression equations and illustrations.

In the diagram of the regression equation, there is also a r^2, which is called the determinant coefficient, which is used to measure whether the regression equation fits the sample data Well. the decision coefficient is between 0-1, the larger the value, the better the fitting, in other words, the higher the interpretation of the dependent Variable. The formula for determining the coefficient is Sst=ssr+sse, where SST is the total sum of squares, the SSR is the sum of squares, and SSE is the sum of squared errors. The following table is a number of necessary calculation procedures for calculating the three indicators required to determine the Coefficient.

Based on the regression squared sum (SSR) and total squared sum (SST), the determination coefficient is 0.94344.

The above is the calculation process of the regression equation, we can calculate the exposure quantity by the regression equation in the case of the number of exposures according to the cost Forecast. Logistic regression adds a logical function on the basis of linear regression compared with the regression equation . For example, the User's attributes and characteristics determine whether the user will eventually make a purchase. where the probability of purchase is dependent on the variable y, the User's attributes and characteristics are the argument x. The greater the Y value, the greater the probability that the user is Buying. Here we use the likelihood of events (odds) to indicate the ratio of purchase to not Purchased.

Using e as the purchase event, p (e) is the probability of purchase, p (e ') is the probability of not being purchased, and Odds (e) is the likelihood of event E (purchase) Occurring.

Odds is a number from 0 to infinity, and the greater the value of odds, the greater the likelihood that the event would occur. Now we're going to convert the odds into a probability function between 0-1. first, the natural logarithm of the odds is obtained, and the logit equation is derived, and the logit is the value of a range in negative infinity to positive Infinity.

Based on the Logit equation above, the following formula is Obtained:

Which uses π to replace P (e) in the formula, π=p (e). The following formula is obtained according to the exponential function and the logarithmic rule:

And finally get the logistic regression equation:

The following is the probability of a User's purchase based on the logistic regression equation, The following table is the number of days the user is registered and whether to purchase the data, where the number of registrations is the argument x, whether the purchase is an argument y. We mark the purchase as 1 and mark the Non-purchase as 0.

Next we will calculate the slope and intercept of the logistic regression equation in Excel in 8 steps . And the equation predicts whether new users will buy.

    • The first step is to use Excel's sorting function to sort The original data by the dependent variable y, separating the purchased and the non-purchased data, making the data feature more Visible.
    • The second step, according to the logit equation preset slope b1 and intercept B0 values , here we set two values are preset to 0.1. Then the optimal solution is obtained by excel.
    • The third step is to calculate the L value using the pre-set slope and intercept values according to the Logit Equation.

    • The fourth step is to take the L value to the natural logarithm ,
    • In the fifth step, the value of P (x) is computed, and P (x) is the probability of the event (Odds).
    • See the detailed calculation steps and Procedures.

    • Sixth step, calculate the logarithmic likelihood function estimate (LOG-LIKELIHOOD) for each Value. See Methods and Procedures.
    • The seventh step is to summarize the logarithmic likelihood function values.

    • eighth, use Excel's solver function to calculate the maximum log likelihood function Value. See Methods and Procedures. Set the log likelihood function value of the rollup to maximize the target, the preset slope B1 and intercept b0 are variable cells, and the option to "make unconstrained variables non-negative" is Canceled. To Solve.

Excel will automatically find the optimal solution for the slope and intercept in the logistic regression equation, as shown in the Results.

After obtaining the slope and intercept of the logistic regression equation, we can take the value into the equation and obtain a forecast model of the number of days of registration and the probability of purchase, through which we can predict the purchase probability (Y) of the different registered days (X) Users. The following is the calculation process.

    • The first step is to enter the value of the number of days (X) of the argument registration, where we enter 50 days.
    • The second step, the input x value, as well as the slope and intercept into the logit equation, to find the L Value.
    • The third step is to take the natural logarithm of the L Value.
    • The fourth step is to find the probability value of P (X) of the time occurrence Probability.

The probability of a user who is registered for 50 days to purchase is approximately 17.6%.

We take the value of all registered days into the purchase probability prediction model, and get a curve that the number of registrations affects the purchase Probability. From the curve can be found that the number of days registered in the lower and higher number of days the user purchase probability is more stable. In the middle days, the purchase probability of users varies greatly.

We continue to add the new argument "age" to the above calculation Results. The following is the original Data. There are now two independent variables and one dependent variable for age and number of registrations.

According to the previous method, the optimal solution of slope and intercept is calculated, and the logistic regression equation is obtained, and the different ages and registration days are put into the equation, and the forecast model of the User's age and the number of registered days is Obtained. We use Excel's three-dimensional chart to draw the effect of age and number of registrations on the purchase Probability.

As can be seen, the purchase probability increases with the number of registrations, and under the same number of registrations, the purchase probability of the younger user is relatively high.

Reproduced In: Http://bluewhale.cc/2016-05-18/logistic-regression.html#ixzz4RbUh8R3T

One from linear regression to logistic regression

Both linear regression and logistic regression are special cases of generalized linear models.

Suppose there is a dependent variable y and a set of arguments x1, x2, x3, ..., xn, where y is a continuous variable, we can fit a linear equation:

Y =β0 +β1*x1 +β2*x2 +β3*x3 +...+βn*xn

The values of each β coefficient are estimated by the least squares method.

If Y is a two categorical variable and can only take a value of 0 or 1, then the linear regression equation will encounter difficulties: the right side of the equation is a continuous value, the value is negative infinity to positive infinity, and the left side can only take value [0,1], cannot correspond. To continue using the idea of linear regression, statisticians think of a transformation method that transforms the value of the right side of the equation into [0,1]. finally, the logistic function is selected:

y = 1/(1+e-x)

This is an S-type function, the range is (0,1), can map any value to (0,1), and has an infinite order can be guided and other good mathematical properties.

We rewrite the linear regression equation to:

y = 1/(1+e-z),

wherein, z =β0 +β1*x1 +β2*x2 +β3*x3 +...+βn*xn

The values on both sides of the equation are between 0 and 1.

Further mathematical transformations, which can be written as:

Ln (y/(1-y)) =β0 +β1*x1 +β2*x2 +β3*x3 +...+βn*xn

Ln (y/(1-y)) is called a logit transformation. We'll consider y as the probability p (y=1) of Y with a value of 1, so 1-y is the probability p (y=0) for y with a value of 0, so the upper rewrite is:

P (y=1) = ez/(1+ez),

P (y=0) = 1/(1+ez),

Among them, z =β0 +β1*x1 +β2*x2 +β3*x3 +...+βn*xn.

You can then use the "maximum likelihood" method to estimate the individual coefficient β.

Two odds and or review

Odds: known as probability, ratio, and ratio, refers to the probability (probability) of the occurrence of an event and the probability of non-occurrence. Use p to indicate the probability that the event occurred: odds = p/(1-p).

OR: ratio, The probability of event occurrence in the experimental group (odds1)/control group (odds2).

Interpretation of the results of three logistic regression

We use an example to illustrate that this example contains 200 student data, including 1 independent variables and 4 arguments:

Dependent variable: hon, Indicates whether the student is in the Honors class (honors class), 1 indicates yes, 0 indicates no;

Independent variables:

Female: gender, categorical variable, 1 = female, 0 = Male

Read: reading scores, for continuous variables

Write: writing scores, for continuous variables

Math: mathematical results for continuous variables

1 , a logistic regression that does not contain any variables

first, you fit a logistic regression that does not contain any Variables.

The model is ln (p/(1-p) =β0

The regression results are as follows (the results are edited):

Hon

Coefficient β

Standard error

P

Intercept distance

-1.12546

0.164

0.000

The coefficient β here is the β0 in the model =-1.12546,

We use P to denote the student's probability in the honors class, so there is ln (p/(1-p) =β0 =-1.12546,

To solve the equation: p = 0.245.

Odds = P/1-p = 0.3245

What does the p here mean? P is the probability of hon=1 in all Data.

Let's take a look at the entire Hon Data:

Hon

Number of cases

Percentage

0

151

75.5%

1

49

24.5%

Hon the probability of a value of 1 P is 49/(151+49) = 24.5% = 0.245, We can manually calculate ln (p/(1-p) = 1.12546, equal to the coefficient β0. Can draw a relationship:

Β0=ln (odds).

2 , a model that contains a two categorical dependent variable

Fitting a logistic regression containing two categorical dependent variable female,

The model is ln (p/(1-p) =β0 +β1* female.

The regression results are as follows (the results are edited):

Hon

Coefficient β

Standard error

P

Female

0.593

.3414294

0.083

Intercept distance

-1.47

.2689555

0.000

Before interpreting this result, take a look at the cross-tables of Hon and female:

Hon

Female

Total

Male

Female

0

74

77

151

1

17

32

49

Total

91

109

According to this crosstab, for men (Male), the probability of being in the Honors class is 17/91, the probability of being in a non-honors class is 74/91, so the odds of its being in the Honors class are odds1= (17/91)/(74/91) = 17/74 = 0.23; The odds of a woman being in an honors class Odds2 = (32/109)/(77/109) =32/77 = 0.42. The ratio of females to males or = ODDS2/ODDS1 = 0.42/0.23 = 1.809. We can say that women are 80.9% more likely to be in honor classes than Men.

Return to logistic regression results. The coefficient of intercept-1.47 is the logarithm of the male odds (as the male is indicated by female=0, is the control group), ln (0.23) =-1.47. The coefficient of the variable female is 0.593, which is the logarithm of the or value of the female to the male, ln (1.809) = 0.593. So we can draw a relationship: or = exp (β), or β= ln (or) (the exp (x) function is an exponential function, which represents the X-second side of e).

3 , a model that contains a continuous variable

Fit a logistic regression containing a continuous variable, math,

The model is ln (p/(1-p) =β0 +β1* math.

The regression results are as follows (the results are edited):

intercept

hon

coefficient β

Standard error

p

math

.1563404

.0256095

0.000

-9.793942

1.481745

0.000

The meaning of The intercept factor here is the logarithm of the odds with math score 0 in the honors class. We calculated odds = exp (-9.793942) =. 00005579, which is very small. Because in our data, there is no math score of 0 students, so this is an out-of-the-box hypothetical Value.

How do you explain Math's coefficients? According to the model that fits, there are:

ln (p/(1-p)) =-9.793942 +. 1563404*math

Let's first assume that math=54 has:

ln (p/(1-p)) (math=54) = 9.793942 +. 1563404 *54

Then we raise the math to raise a unit that makes math=55, which has:

ln (p/(1-p)) (math=55) = 9.793942 +. 1563404 *55

The difference between the Two:

ln (p/(1-p)) (math=55)-ln (p/1-p)) (math = 54) = 0.1563404.

is exactly the coefficient of the variable Math.

Thus we can say that the logarithm of odds (that is, p/(1-p), The probability of being in honor Class) increases by 0.1563404 for each 1 units of Math.

So how much does odds increase? According to the logarithmic formula:

ln (p/(1-p)) (math=55)-ln (p/1-p)) (math = si) = ln ((p/(1-p) (math=55)/(p/(1-p) (math=54))) = ln (odds (math=55)/odds (math= 54)) = 0.1563404.

So:

Odds (math=55)/odds (math=54) = exp (0.1563404) = 1.169.

So we can say that math increases by 16.9% per unit odds. and is independent of the absolute value of Math.

The clever reader must have found out that odds (math=55)/odds (math=54) is just or!

4 , a model with multiple variables (no interactive Effect)

Fit a logistic regression containing female, math, read,

The model is ln (p/(1-p) =β0 +β1* math+β2* female+β3* read.

The regression results are as follows (the results are edited):

Hon

Coefficient β

Standard error

P

Math

.1229589

Slightly

0.000

Female

0.979948

Slightly

0.020

Read

.0590632

Slightly

0.026

Intercept distance

-11.77025

Slightly

0.000

The results show that:

(1) Gender: The odds of women (female=1) entering the Honors class (odds) are male (female=0) exp (0.979948) = 2.66 times times, or women are 166% higher than men, in the same conditions as math and read Scores.

(2) Math score: Under the same conditions as female and read, the math score increases by 1 and the chance of entering the honors class increases by 13% (because exp (0.1229589) = 1.13).

(3) read reading is similar to Math.

5 , including the corresponding model of the interaction

Fit a logistic regression that contains female, math, and the interaction between the two,

The model is ln (p/(1-p) =β0 +β1* female+β2* math+β3* female *math.

The so-called interaction effect, refers to the effect of one variable on the result of another variable value is Different.

The regression results are as follows (the results are edited):

Hon

Coefficient β

Standard error

P

Female

-2.899863

Slightly

0.349

Math

.1293781

Slightly

0.000

Female*math

.0669951

Slightly

0.210

Intercept distance

-8.745841

Slightly

0.000

Note: the p for the Female*math item is 0.21 and can be considered as having no interaction. But here we want to explain the interaction effect, temporarily ignore p-value, let's think they are interactive effect.

Due to the existence of interaction, we cannot say how the female effect will be if we keep math and Female*math intact, because math and female*math are not going to stay the same!

For this simple case, we can fit two equations separately,

For Men (female=0):

Log (p/(1-p)) =β0 +β2*math.

For women (female=1):

Log (p/(1-p)) = (β0 +β1) + (β2 +β3) *math.

Then explain them separately.

The principle and implementation of the logistic regression algorithm (LR)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.