The relationship between logistic regression and other models _ machine learning

Source: Internet
Author: User
Tags svm
Analysis of "Machine Learning Algorithm Series II" Logistic regression published in 2016-01-09 |   Categories in Project Experience |   | 12573 This article is inspired by Rickjin teacher, talk about the logistic regression some content, although already have bead Jade in front, but still do a summary of their own. In the process of looking for information, the more I think the LR is really profound, contains too much content, this article can only be plain to mention some aspects.

"Reprint please indicate the source" https://chenrudan.github.io/blog/2016/01/09/logisticregression.html

This article is inspired by Rickjin teacher, talk about the logistic regression some content, although already have bead Jade in front, but still do a summary of their own. In the process of looking for information, the more I think the LR is really profound, contains too much content, this article can only be plain to mention some aspects. The contents of the article are as follows: 1. Origin 2. Model Introduction and Formula derivation 2.1 Logistic distribution 2.2 Binomial logistic regression Model 3. Solution 3.1 Gradient Descent Method 3.2 Newton method 3.3 BFGS 4. Two methods of regularization 4.1 over fitting 4.2 regularization are 5. The relationship between logical regression and other models 5.1 logical regression and linear regression 5.2 logical regression with maximum entropy 5.3 logical regression with SVM 5.4 logical regression with naive Bayesian 5.5 logical regression and energy function 6. Parallelization of 7. Summary 8. Reference 1. Origin

The origin of logistic regression is divided into several stages, from the beginning thought of the word logistic, to the discovery of logistic function, and then deduced Logit function, finally named logistic regression. These processes are a lot of researchers work together to find, but in the history of the Long river, many people are gradually forgotten.

Logistic originated from a study of population growth, the most important of which was Pierre François Verhulst, who presented a formula for population growth in 1838 (this man was Belgian, wrote in French, and couldn't read a word, The following is a reading of a book that will study the development of population growth [1] only to know ...), he graduated from the University of Ghent Mathematics Department, is a mathematics professor and demographers. In 1835, Verhulst's fellow Adolphe Quetelet published an article on population growth, which argued that the population could not always be a geometric (exponential) growth, but would be affected by a resistance proportional to the square of growth, But this paper only guesses without the mathematical Theory Foundation, but has greatly inspired the Verhulst. So in 1838 Verhulst published a paper on population growth, in which he deduced the logistic equation, which, in the article, talked about an important point that, as time went on, The size of a country (which I understand as a resource) and the fertility of the nation limit the growth of the population, and the population will gradually get closer to a stable value. The good part is that he described the process as a formula, starting with the velocity formula of population growth, that is, the derivative of the population P (t) p (t) to time t:

∂P∂T=RP (1−PK) ∂P∂T=RP (1−PK)

where K K is what he thinks the population is stable, when P (t) p (t) is much less than K K, the derivation formula is approximately equal to 0, then it becomes ∂P∂T≃RP∂P∂T≃RP, which is proportional to the population growth and the product of a constant, and is becoming larger. Then the first order linear differential equation is solved by the P (t) ≃p (0) ERT P (t) ≃p (0) ert. When P (t) p (t) is near K K, the population growth rate begins to become smaller, and the second order differential equation is solved (the second order is converted to the first order solution), and then the two are integrated together to get the initial form.

P (t) =p (0) ert1+p (0) (ert−1)/k P (t) =p (0) ert1+p (0) (ert−1)/k

He compared the actual data of the more than 10-year population of France and Britain to the formula and found that it was a good fit. But he did not have that many years of data, the following figure 1 is after his death, people summed up in the 300 years of population growth distribution, you can see very beautiful fitted the LOGISITC distribution of cumulative distribution function trend. But the formula did not have a name until 1845, when he published another important article [2], he gave the formula a name-"logistic", and in this article he found that P (t) $ in $p (t) K/2 from time to time (by seeking the second-order guide to analyze, here's a bit). This growth trend is similar to the probability density function of logistic distribution.

Figure 1 Population growth figures of Belgium (figure Source [2])

In the next dozens of years, however, people were unaware of the importance of the work, and many had independently studied the growth until 1922, when a demographers called Raymond Pearl noticed that Verhulst had presented this phenomenon and formula in 1838, And in his article also uses the logistic function to call it, and continues to this day. In the 1920 pearl[3] in the study of the population growth in the United States, another way to express the logistic function is presented.

Y=beax1+ceax Y=beax1+ceax

Based on this expression, Joseph Berkson in 1944 logit function, Logit=in (1−QQ) logit=in (1−QQ), if Q=11+EA−BX q=11+ea−bx, the result is LOGIT=A−BX Logit=a−bx.

Later, in 1958 David Cox presented the logistic regression[4]. His article is to solve such a problem, there is a set of values for 0,1 observations, their values Yi Yi depends on some independent Variable XI XI, when yi=1 yi=1 the corresponding probability is ΘI=PR (yi=1) ΘI=PR (Yi=1). Because the θiθi is limited to [0,1], it is assumed that the relationship of Θiθi with Xi Xi conforms to the logit function, that is Logitθi≡logθi1−θi=α+βxi logitθi≡logθi1−θi=α+βxi, This article mainly analyzes how to solve the parameters of the ββ, here is not mentioned. Because the logistic function is used, and the problem itself is a regression problem (establishing the relationship between the observed value and the independent variable), it is called the logistic regression.

It seems that Cox did not deliberately propose a logistic regression in this article, but it is true that the word first appeared in this article, although Cox had already had a lot of research done, but they didn't give a name, so Cox became a forward logistic Regression of the people. This story tells us a truth, whether it is to send articles or write software must take a concise and nice and catchy name ...

The above is the logical regression of the historical development of more representative of a few things (I think ... There are still a lot of papers that have no time to look at ...), J.S cramer[5] has a more detailed discussion in his article. It is derived from the study of the laws of population development by mathematicians, and later applied to the study of microbial growth, and later applied to solve the economic problems, until today as a very important algorithm exists in all walks of life. As a branch of regression analysis, logistic regression is actually inspired by many regression analysis related techniques, for example, Berkson is a logit function proposed based on probit function. Light its origin to the application can write a book out, no wonder Rickjin teacher said LR is actually very very complex ... 2. Model Introduction and Formula derivation

It said the origin of logistic regression, the following discussion of the complete model, first introduced what is the logistic distribution, and then from the logistic distribution of logical regression. 2.1 Logistic Distribution

The random variable x obeys the logistic distribution, that is, the cumulative distribution function of x as mentioned above. The probability density function is obtained for derivation of the distribution function. The formula is as follows, the parameter influence reference Figure 2 (graph from Wikipedia, its parameter s is the statistical learning method Γγ)

F (x) =p (x⩽x) =11+e− (x−μ)/γf (x) =p (x⩽x) =11+e− (x−μ)/γ

F (x) =f′ (x) =e− (x−μ)/γγ (1+e− (x−μ)/γ) 2 f (x) =f′ (x) =e− (x−μ)/γγ (1+e− (x−μ)/γ) 2

Fig. 2 Effect of different parameters on logistic distribution (photo source Wikipedia)

You can see that the μμ affects the position of the center symmetry point, the faster the γγ the smaller the center point. The nonlinear transformation sigmoid function, which is often used in depth learning, is a special form of γ=1,μ=0γ=1,μ=0 of logistic distribution. 2.2 Binomial Logistic regression model

Figure 3 Data sample

Logical regression is to solve the classification problem, according to some known training practice model, and then to predict the new data belong to which class. As shown in Figure 3, there are some two classes of data, the goal is to determine what kind of circle. In other words, the goal of logistic regression is to find a decision boundary that is good enough to distinguish the two classes well. Suppose that there is already such a boundary, which, in the case of this linear and measurable situation in the graph, is
Enter the linear combination of eigenvector, assuming that the input eigenvector is X∈rn x∈rn (the input vector in the graph is two-dimensional), and Y y takes a value of 0, 1. Then the decision boundary can be expressed as w1x1+w2x2+b=0 w1x1+w2x2+b=0, and if there is an example that makes HW (x) =w1x1+w2x2+b>0 HW (x) =w1x1+w2x2+b>0, then it can be judged that the class is 1, This process is actually a perceptual machine, that is, only through the symbol of the decision function to determine which category belongs to. And the logical regression needs further, it finds the direct relationship between the classification probability P (y=1) p (y=1) and the input vector x x, and then judges the category by comparing the probability value, and the Logit function in the preceding section satisfies the requirement, which makes the output value of the decision function Wtx+b wtx+ b is equal to the probability value ratio to take logarithm logp (y=1|x) 1−p (y=1|x) Logp (y=1|x) 1−p (y=1|x), the solution of this equation is obtained by the input vector x x causes two kinds of probability to be:

P (y=1|x) =ew⋅x+b1+ew⋅x+b (1) p (y=1|x) =ew⋅x+b1+ew⋅x+b (1)

P (y=0|x) =11+ew⋅x+b (2) p (y=0|x) =11+ew⋅x+b (2)

where W is called weight, b b is called bias, wherein the w⋅x+b w⋅x+b is considered a linear function of x x. Then compare the above two probability value, the probability value is the corresponding class of x. Sometimes for easy writing, b b is written to W, i.e. w= (w0,w1,..., wn) w= (w0,w1,..., wn) w0=b w0=b and X0=1 x0=1. Also known as the probability of occurrence of an event odds is the ratio of probabilities to occurrence and not occurring, in the case of two classification, i.e. P (y=1|x) p (y=0|x) =p (y=1|x) 1−p (y=1|x) p (y=1|x) p (y=0|x) =p (y=1|x) 1−p (y=1|x). The logarithm of the odds is the Logit function mentioned above, Logit (P (y=1|x)) =logp (y=1|x) 1−p (y=1|x) =w⋅x logit (P (y=1|x)) =logp (y=1|x) 1−p (y=1|x) = W⋅x. Thus, a definition of logical regression is obtained, and the logarithmic probability of output y=1 Y=1 is a model represented by the linear function of input x x, i.e. logistic regression model (Hangyuan li. "Statistical machine learning"). And a direct study of Formula 1 can get another definition of logical regression, the closer the value of the linear function to the positive infinity, the nearer the probability value is to 1; the closer the linear value to the negative infinity, the closer the probability value is to 0, and the model is the logistic regression model (Hangyuan li. "Statistical machine learning"). So the logical regression idea is to fit the decision boundary (where the decision boundary is not limited to linear or polynomial), then establish the probability relation between the boundary and the classification, and get the probability of two classification cases. Here's a great post. [6] recommended, elaborated the logic regression mentality.

When deriving the problem of multiple classifications, it is assumed that wt1x+b1=p (Y=1|X) p (y=k|x) w1tx+b1=p (y=1|x) p (y=k|x), Wt2x+b2=p (y=2|x) p (y=k|x) w2tx+b2=p (y=2|x) (Y=K|X) p )... And so on, then deduced P (y=k|x) =11+∑k−1k=1ewtkx P (y=k|x) =11+∑K=1K−1EWKTX, P (y=1|x) =ewt1x1+∑k−1k=1ewtkx P (y=1|x) =ew1tx1+∑k= 1K−1EWKTX and so on.

With the above classification probability, we can establish the likelihood function and determine the parameters of the model by the maximum likelihood estimation method. Set P (y=1|x) =HW (x) p (y=1|x) =HW (x), the likelihood function is ∏[HW (xi)]YI[1−HW (xi)] (1−yi) ∏[HW (xi)]YI[1−HW (xi)] (1−yi), and the logarithmic likelihood function is

L (W) =∑i=1nlogp (yi|xi;w) =∑I=1N[YILOGHW (xi) + (1−yi) log (1−HW (xi))] (3) L (w) =∑i=1nlogp (yi|xi;w) =∑I=1N[YILOGHW (xi) + ( 1−yi) log (1−HW (xi))] (3)

3. Solution

The method of optimizing logistic regression has very many [7], has different implementations of Python [8], here only talk about gradient descent, Newton method and BFGS. The main objective of optimization is to find a direction in which the parameters of the likelihood function can be reduced after the parameter is moved in this direction, which is often obtained by the combination of a first-order deviation or a second-order biased derivative. The loss function of logistic regression is

Minj (W) =MIN−1M[∑I=1MYILOGHW (xi) + (1−yi) log (1−HW (xi))] (4) Minj (W) =MIN−1M[∑I=1MYILOGHW (xi) + (1−yi) log (1−HW (xi))] (4)

Firstly, J (W) J (W) is used to WJ the first order second derivative of the WJ, and the G G and H H are respectively expressed. G G is the gradient vector and H H is the sea-sen matrix. This only takes into account an instance of the likelihood function produced by Yi Yi on a parameter WJ WJ.

Gj=∂j (W) ∂wj=y (i) HW (x (i)) HW (x (i)) (1−HW (x (i))) (−x (i) J) + (1−y (i) 11−HW (x (i)) HW (x (i)) (1−HW (x (i))) x (i) j= (Y (i) −HW (x (i)) )) x (i) (5) gj=∂j (w) ∂wj=y (i) HW (x (i)) HW (x (i)) (1−HW (x (i))) (−XJ (i)) + (1−y (i)) 11−HW (x (i)) HW (x (i)) (1−HW (x (i)) "XJ" (i) = (y (i) −HW (x (i)) x (i) (5)

HMN=∂2J (W) ∂wm∂wn=hw (x (i)) (1−HW (x (i)) x (i) MX (i) n (6) hmn=∂2j (w) ∂wm∂wn=hw (x (i)) (1−HW (x (i))) XM (i) Xn (i) (6)

These methods generally use iterative approach to the minimum value, the given parameter W0 W0 as the starting point, and a threshold Εϵ is needed to determine when the iteration stops. 3.1 Gradient Descent method

Gradient descent is to find the descent direction by using the first derivative of W in J (W) J (W), and update the parameters in an iterative way, the update mode is Wk+1j=wkj+αgj Wjk+1=wjk+αgj and K K is the iteration number. Each time you update the parameters, you can compare | | J (wk+1) −j (wk) | | || J (wk+1) −j (wk) | | or | | wk+1−wk| | || wk+1−wk| | Stops the iteration in a way that ϵϵ the size of a threshold value, which stops at a smaller threshold. 3.2 Newton Method

The basic idea of Newton's method is to do Higertel expansion of f (x) near the existing minimum estimated value, and then to find the next estimate of the minimum point [9]. Assuming that the WK wk is the current minimum estimate, then there are

φ (w) =j (wk) +j′ (wk) (W−WK) +12j′′ (wk) (W−WK) 2 (7) φ (w) =j (wk) +j′ (wk) (W−WK) +12j″ (wk) (W−WK) 2 (7)

Then φ′ (w) =0φ′ (W) = 0 was obtained w=wk−j′ (wk) j′′ (wk) w=wk−j′ (wk) J″ (wk). So there's an iterative update,

wk+1=wk−j′ (wk) j′′ (wk) =wk−h−1k⋅gk (8) wk+1=wk−j′ (wk) J″ (wk) =wk−hk−1⋅gk (8)

A threshold value Εϵ is also required in this method, when | | gk| | <epsilon | | gk| | Stops the iteration when <epsilon. In addition, this method requires that the objective function is second order continuous differentiable, and the J (W) J (W) in this paper is in accordance with the requirements. 3.3 BFGS

Due to the need to solve the second-order bias in Newton's method, this computational volume will be relatively large, and sometimes the Haisen matrix obtained by the objective function cannot keep positive definite, so the quasi-Newton method is proposed. Quasi-Newton method is the general term of some algorithms, whose goal is to approximate the forest-sea matrix (or its inverse matrix) in some way. For example, BFGS is a quasi-Newton method, which is named after the first letter combination of four inventors, and is one of the most common methods for solving unconstrained nonlinear optimization problems. The goal is to approximate the Haisen Matrix H h in an iterative way, assuming that the approximation value is BK≈HK bk≈hk, then we hope to achieve the goal by calculating BK+1=BK+ΔBK BK+1=BK+ΔBK. And suppose Δbk=αuut+βvvtδbk=αuut+βvvt, and by 3.2 knowable, δw=wk+1−wk= (h−1) k+1 (GK+1−GK) = (h−1) kδgδw=wk+1−wk= (h−1) k+1 (GK+1−GK) = (H−1) KΔG, the update of the bk+1 bk+1, you can get

δg=bkδg+ (αutδw) u+ (βvtδw) v (9) δg=bkδg+ (αutδw) u+ (βvtδw) v (9)

Here, direct to Αutδw=1αutδw=1, βvtδw=−1βvtδw=−1, U=δg u=δg and V=bkδw v=bkδw, you can get Α=1 (Δg) tδwα=1 (Δg) tδw and β=−1 (δw) tbkδwβ=−1 (Δw) TB Kδw. So that we can get an updated formula for ΔBKΔBK.

ΔBK=ΔG (Δg) t (Δg) tδw−bkδw (δw) Tbk (δw) tbkδw (10) δbk=δg (Δg) t (Δg) tδw−bkδw (δw) Tbk (δw) tbkδw (10)

The (10) is also transformed, and the Sherman-morrison formula is used to directly find (b−1) k+1 (b−1) k+1 and (b−1) K (b−1) K, with the dk+1 dk+1 and DK dk tables. The update formula becomes a

Dk+1= (I−δw (Δg) t (Δg) tδw) Dk (I−δg (δw) t (Δg) tδw) +δw (δw) t (Δg) Tδw (dk+1=) (I−δw (Δg) t (Δg) tδw) Dk (I−δg (δw) t (Δg) tδw) +δw ( ΔW) T (Δg) tδw (11)

The process of updating parameters with BFGS is as follows: Determine the amount of change, (δw) k=−dk⋅gk (δw) k=−dk⋅gk update parameters, Wk+1=wk+λ (δw) k wk+1=wk+λ (δw) K Find Δg=gk+1−gkδg=gk+1−gk by (11) Find the Dk +1 dk+1

The coefficient of the equation Λ=argminj (wk+λ (δw) k) Λ=argminj (Wk+λ (δw) k), that is, in the direction of the descent from a number of values to search for the optimal drop size, here I think the learning rate can be substituted. Therefore, the difference between this update method and Newton's method is that it updates the value of the approximate forest-sea matrix after the parameter w W is updated, and Newton's method is to fully compute the forest-sea matrix before updating W W. There is also a method of improving BFGS from computation called L-BFGS, which does not store the forest-sea matrix directly, but rather reduces the space required for parameter storage by storing the part of the calculation process δw (g) k−m+1,k−m+2,..., kδw (g) k−m+1,k−m+2,..., K. 4. Regularization

Regularization is not only the existence of logical regression, it is a general algorithm and thought, so the algorithm that will produce the fitting phenomenon can use regularization to avoid the fitting, before talking about regularization, we should talk about what is too fit. 4.1 Cross Fitting

The previous model introduction and algorithmic solution can be trained by training data sets (triangles and stars in Figure 2) to predict the classification of a new data (such as the pink circle in Figure 2), which is called generalization capability for predicting new data. The result of the new data prediction is poor generalization ability, generally speaking, the poor generalization ability is due to the occurrence of a fitting phenomenon. Cross-fitting is a phenomenon that predicts good training data but does not predict the unknown, usually because the model is too complex or the training data is too small. That is, when the ratio of complexityofthemodeltrainingsetsize complexityofthemodeltrainingsetsize is too large, the fitting will occur. The complexity of the model is embodied in two aspects, one is too many parameters, the other is the parameter value is too large. Parameter values The General Assembly causes the derivative to be very large, then the fitted function fluctuations will be very large, that is, the following figure shows that from left to right are less than fit, fitting and cross fitting.

Fig. 4 The same data has defaulted fitting, fitting and over fitting (picture source [12])

In cases where the model is too complex, the model learns a lot of features, leading to the possibility of fitting all the training samples, as in the previous illustration, where the fitted curves classify each point correctly. For example, if you want to predict whether a house is expensive or inexpensive, the area of the house and the area to which it belongs is a useful feature, but if the training centers just about all the expensive houses are developer a development, cheap developer B development, then when the model becomes more complex to learn the features become more, Which developers of the House will be considered a useful feature by the model, but in fact this can not be judged by the standard, this phenomenon is a fitting. Therefore, in this example, you can see that there are two solutions, one is to reduce the characteristics of learning not to learn the characteristics of developers, one is to increase the training set, so that the training set of your house is a sample of B development.

Thus, the solution of the fitting can be done from two aspects, one is to reduce the complexity of the model, one is to increase the number of training sets. Regularization is a way to reduce the complexity of the model. Two methods of 4.2 regularization

Since the number of parameters of the model is generally defined and regulated by the person, regularization is often used to limit the value of the model parameter not to be too large or to be called a penalty term. Typically, a regularization item φ (w) φ (w) is added to the objective function (empirical risk)

J (W) =−1M[∑I=1MYILOGHW (xi) + (1−yi) log (1−HW (xi))]+λφ (W) (A) J (W) =−1M[∑I=1MYILOGHW (xi) + (1−yi) log (1−HW (xi))]+λφ (W) ( 12)

This regularization term generally uses the L1 norm or the L2 norm. The form is Φ (w) =| | x| | 1φ (w) =| | x| | 1 and Φ (w) =| | x| | 2φ (w) =| | x| | 2.

First, for L1 Norm Φ (w) =|w| Φ (W) =|w|, when the gradient descent method is used to optimize the objective function, the objective function is derivative, and the gradient change caused by the regularization term is 1 when the wj>0 wj>0 is taken 1.

The resulting parameter WJ WJ minus the product of the learning rate and the (13) type, so when the WJ WJ is greater than 0, WJ WJ subtracts a positive number, causing WJ WJ to decrease, and when WJ WJ is less than 0, the WJ WJ subtracts a negative number, causing WJ WJ to become larger, so this regular entry causes the argument The number of WJ WJ is approaching 0, that is why L1 is able to make the weight sparse, so that the parameter value will be controlled to nearly 0. L1 is also known as the Lasso regularization.

Then, for the L2 Norm Φ (W) =∑nj=1w2jϕ (W) =∑j=1nwj2, the same derivative is obtained, and the gradient change is ∂φ (w) ∂wj=2wj∂φ (W) ∂wj=2wj (generally used λ2λ2 to eliminate the coefficient 2). The same update makes the value of WJ WJ not become particularly large. In the machine study also will L2 is called the weight decay, in the regression question, regarding the L2 regular return also is called Ridge Regression Ridge return. Weight decay also has the advantage that the objective function becomes convex function, and the gradient descent method and the L-BFGS can converge to the global optimal solution.

Note that L1 regularization results in a parameter value of 0, but L2 only reduces the value of the parameter, because the L1 derivative is fixed, and the value of the parameter changes each time is fixed, and the L2 is smaller because of the small amount of change. and (12) The λλ also has the very important function, it weighs the fitting ability and the generalization ability to the entire model influence, the λλ is bigger, to the parameter value punishment is bigger, the generalization ability is better.

In addition, from a Bayesian perspective, the regularization term actually gives a priori knowledge of the model, and the L2 is equivalent to adding a priori (the L2 is represented as Λ2WTWΛ2WTW) with a mean 0 covariance as the 1/λ1/λ, and when the λλ is 0, the regular term is not added. Then it can be seen that the covariance is infinity, and w can become arbitrarily large without control. The larger the λλ, the smaller the covariance, the smaller the variance of the parameter values, and the more stable the model tends to be (refer to [10] for the highest ticket answer). 5. The relationship between logistic regression and other models 5.1 logistic regression and linear regression

Before talking about the relationship between the two, it is important to discuss what role the sigmoid function used in the logical regression actually played. In the example below, it is necessary to determine whether the tumor is malignant or benign, the horizontal axis is the tumor size, the longitudinal axis is a linear function of HW (x) =wtx+b HW (x) =wtx+b value, so in the left figure can be based on the training set (Red Fork in the figure) to find a decision boundary, and with 0.5 as the threshold, the HW (x) ⩾0.5 HW (x) ⩾0.5 situation is predicted to be malignant tumor, this way in the case of a relatively centralized data, but once it appears as the outlier point in the right image, it will lead to the learning of the linear function deviation (it produces a greater amount of weight change), The previously set 0.5 threshold is not used, either by adjusting the threshold or by adjusting the linear function. If we adjust the threshold, the linear function in this graph looks like 0~1, but in other cases it may be from −∞−∞ to ∞∞, so the threshold size is difficult to determine, and if the value of Wtx+b wtx+b can be transformed to a controllable range, then the threshold is OK. So the sigmoid function is found, the Wtx+b wtx+b value is mapped to (0,1), and the probability is interpreted. If you regulate linear functions, then the most needed is to reduce the effect of outliers, outliers tend to lead to relatively large |wtx+b| |wtx+b| value, through the sigmoid function just can weaken the effect of this type of value, this value after sigmoid close to 0 or 1, thus the WJ The partial derivative of the WJ is HW (x (i)) (1−HW (x (i))) x (i) J HW (x (i)) (1−HW (x (i))) XJ (i), which is very small regardless of whether it is close to 0 or 1. So it can be said that Sigmoid has played two roles in logistic regression, one is to map the result of linear function to (0,1), one is to reduce the effect of outliers.

Fig. 5 Classification of benign malignant tumors (fig. [12])

With the above analysis, let's look at the relationship between logistic regression and linear regression (linear regression I'm not going to start here. Not clear to see [11]), some people feel that logical regression is essentially linear regression, they both have to learn a linear function, logical regression is nothing more than a layer of function mapping, But my understanding of linear regression is to fit the distribution of input vector x, and the linear function in logistic regression is to fit the decision boundary, and their target is different. So I don't think logical regression is better than linear regression, and they have different problems to solve. But they can all be summed up in one thing, which is the generalized linear model GLM (generalized linear models) [12]. First, what is the exponential cluster (exponential family), when the probability distribution of a random variable can be expressed as P (y;η) =b (y) exp (ηtt (y) −a (η)) p (y;η) =b (y) exp (ηtt (y) −a (η)), it can be said that it belongs to the exponential cluster , different distributions can be obtained by adjusting the ηη. The Gaussian distribution and Bernoulli distribution corresponding to the linear regression and the logistic regression are exponential clusters, for example, T (y) =y t (y) =y, A (η) =−log (1−ϕ) =log (1+eη) A (η) =−log (1−ϕ) =log (1+eη) and B (y) =1 b (y) = 1 generation of upper-class get P (y;η) =exp (Ylog (Φ1−ϕ) +log (1−ϕ)) =exp (Ylogϕ+log (1−ϕ)) =ϕy (1−ϕ) 1−y P (y;η) =exp (Ylog (Φ1−ϕ) +log (1−ϕ)) =exp (Ylogϕ) +log (1−ϕ)) =ϕy (1−ϕ) 1−y.

GLM needs to meet the following three conditions. In the case of a given observed value x and parameter w, the output y obeys the value of the exponential cluster distribution predicted by the parameter ηη HW (x) =e[y|x] HW (x) =e[y|x]η=wtxη=wtx

Therefore, the choice of appropriate parameters can be analyzed by the GLM regression and logistic regression is a special case, sometimes see some people will start from the GLM logic regression formula to deduce. In a word, linear regression and logistic regression belong to the same model, but they have to solve different problems, the former solves the problem of regression, the latter solves the problem of classification, the former output is continuous value, the latter output is discrete value, The loss function of the former is the Gaussian distribution of the output Y, and the latter loss function is the Bernoulli distribution of the output. 5.2 Logical regression and maximum entropy

The maximum entropy is the logical regression when solving the two classification problem, and it is a logistic regression when solving the multiple classification problem. In order to prove the relationship between the maximum entropy model and the logistic regression, it is necessary to prove that the model of the two is the same, that is, the form of H (X) should be consistent. Because the maximum entropy is solved by changing the Cheng Lagrange dual problem of conditional extremum problem with constraint condition, the entropy of the model is

−∑V=1K∑I=1MH (x (i)) vlog (H (x (i)) v) (14) −∑V=1K∑I=1MH (x (i)) vlog (H (x (i)) v) (14)

And assuming that the constraint conditions are as follows, where V,u V,u is the index of the output class, J J is the index of the corresponding input vector x x, a (U,y (i)) A (U,y (i)) is the indicator function, two values equal output 1, other output 0[13]. And the third constraint is obtained by making the formula (5) equal to 0, its significance is that parametric wu,j wu,j the best value is to let each sample I correspond h (x (i)) u h (x (i)) U's behavior close to the indicator function a (u,y (i)) A (U,y (i)).

⎧⎩⎨⎪⎪⎪⎪h (x) v⩾0∑kv=1h (x) v=1∑mi=1h (x (i)) UX (i) j=∑mi=1a (U,y (i)) x (i) J always always for All u,j (() {h (x) v⩾0 Always∑v=1kh ( x) V=1 always∑i=1mh (x (i)) uxj (i) =∑i=1ma (U,y (i)) XJ (i) for all u,j (15)

The formula of Softmax can be deduced directly by the constraint condition (15). Based on this, the constraint conditions in statistical learning methods are looked back. If you assume that P (y|x) =h (x) p (y|x) =h (x), the F (x,y) f (x,y) on the left side of the formula actually has a value of 1, then the two constraints are actually the same.

∑X,YP (x) ˜p (y|x) F (x,y) =∑x,yp (x,y) ˜f (x,y) (16) ∑x,yp (x) ~p (y|x) F (x,y) =∑x,yp (x,y) ~f (x,y) (16)

Therefore, it can be said that the maximum entropy in solving the two classification problem is the logical regression, in solving multiple classification problems is a number of logistic regression. In addition, the maximum entropy and logical regression are called logarithmic linear models (log linear model). 5.3 Logical Regression and SVM

Logical regression and SVM as a classical classification algorithm, is put together in the discussion of a lot of times, know and Quora each of the views are very interesting from different angles of analysis, suggestions can look [14][15][16]. Here we only discuss some of the points I agree with. If you do not know the origin of SVM, suggest to see Jerrylead series Blog [17], I do not mention here.

The same point: all of the classification algorithms are supervised learning algorithms are all discriminant models can be used in the kernel function method for nonlinear classification target is to find a classification of the hyperplane can reduce the effect of outliers

Different points: The loss function is different, the logical regression is cross entropy LOSS,SVM is the hinge loss logical regression in the optimization parameters of all the sample points are involved in the contribution, SVM only take away from the separation of the most recent support vector samples. This is also for what logical regression does not use the kernel function, it needs to compute too many samples. And because the logistic regression is affected by all samples, it is necessary to balance the sample number of each class when the sample is unbalanced. Logical regression to probabilistic modeling, SVM to the classification of hyper-planar modeling logic regression is to deal with empirical risk minimization, SVM is the structure of risk minimization. This is reflected in the SVM self L2 regularization term, logical regression has no logical regression. The influence of the farther point of the separation plane is weakened by the nonlinear transformation, and SVM only takes the support vector to eliminate the influence of the farther point. Logical regression is statistical method, SVM is geometric method 5.4 logical regression and naive Bayesian

The two algorithms have some similarities, and they are often mentioned as typical classification algorithms in contrasting discriminant models and generation models, so here's a little summary.

The same point is that they can both solve the classification problem and all are supervised learning algorithms. Moreover, interestingly, when assumed naive Bayesian conditional probability P (x| Y=CK) P (x| Y=CK) obeys the Gaussian distribution when Gaussian Naive Bayes, it calculates the P (y=1| X) P (y=1| X) Form is the same as logistic regression [18].

The difference is that the logical regression is the P (y|x) p (y|x), and the naïve Bayesian asks for the P (x,y) p (x,y) for the model generation. The former requires iterative optimization, which is not required. In the case of small amount of data, the latter is better than the former, and the data volume is sufficient to better the former. Because naive Bayesian assumes the conditional probability P (x| Y=CK) P (x| Y=CK) is conditional independent, that is, each feature weight is independent, if the data does not conform to this situation, naive Bayesian classification performance does not have a logical return to good. 5.5 Logistic regression and energy model

(Added March 3)

The energy based model is not a concrete algorithm, but a framework idea, which considers the dependency relationship between input and output variables to represent E (x,y) e (x,y) with a value, which is called energy, and the function that models the relationship is called an energy function. This model is useful if the input variable is kept unchanged, the energy is low when the correct output is maintained, and the energy is high when the error output corresponds.

So when given the training set S, the structure and training of the energy model is composed of four parts [19]: There is a suitable energy function e (w,y,x) e (w,y,x) inference algorithm, for a given input variable X and energy function form, find a Y-value to make the least energy, that is, y∗= Argminy∈ye (w,y,x) Y∗=argminy∈ye (w,y,x) has loss function L (w,s) L (w,s), which is used to measure the good or bad learning algorithm of the energy function under the training set S, which is used to find the appropriate parameter W, Select the energy function that minimizes the loss function in a series of energy functions.

It can be seen that 13 steps construct different models by selecting different energy functions and loss functions, and 24 steps are how to train such a model.

So when we assume that the two classification problem is resolved, Y values are-1 and 1, and if the energy function is assumed to be E (w,y,x) =−YGW (x) =wtx E (w,y,x) =−YGW (x) =WTX, the loss function is negative log-likelihood loss, The loss function can then be obtained in the form of LNLL (w,s) =1p∑pi=1log (1+exp (−2YIWTX)) LNLL (w,s) =1p∑i=1plog (1+exp (−2YIWTX)), this form with the substituting (1) (2) formula (3) The formula of the loss function is consistent, which shows that the algorithm is the logical regression. So logical regression is a special case of energy model.

And here, for the two classification problem, if the energy function remains unchanged, the loss function adopts hinge loss and a regularization term, then the loss function expression of SVM can be deduced (the kernel function of SVM is embodied in the energy function, which is not specifically unfolded here for the convenience of explanation). That's why the most essential difference between logical regression and SVM is that the loss function is different. 6. parallelization

Because can not find a lot of parallelization data, here is an analysis of the BO main Feng Yan given the implementation of [20]. In fact, the most important goal of the parallelization of logistic regression is to compute the gradient. Change the target's label to-1 and 1, then the gradient formula can be combined into ∑mi=1 (11+exp (Y (i) wTx (i)) −1) y (i) x (i) ∑i=1m (11+exp (Y (i) wTx (i)) −1) y (i) x (i), The most important thing in the gradient calculation is matrix multiplication, and the general approach is to find a way to cut the matrix into the right size block. For two categories, there are now m samples, n features, if there are m*n compute nodes, and the compute nodes are arranged into m rows N columns, then each node is allocated m/m a sample, n/n a feature, as shown in the following figure.

Fig. 6 Data segmentation of parallel LR (picture source [20])

I am not accustomed to the original, the following are changed to IJ, and draw the matrix operation of the process diagram. where x (I,J), I∈[1,m],j∈[1,n] x (I,J), I∈[1,m],j∈[1,n] represents the block in column J of line I after the input data is divided.

X (i,j), K x (I,j), K represents the line K in this block, and Wj Wj represents the block J of the parameter.

First step calculation

D (I,j), K=WTJX (I,j), K D (I,j), K=wjtx (i,j), K

Second Step calculation

Di,k=∑j=1nd (i,j), K di,k=∑j=1nd (i,j), K

Fig. 7 Gradient 12 step of parallel LR

Third Step calculation

G (i,j) =∑k=1m/m (11+exp (yi,kdi,k) −1) yi,kx (i,j), K G (i,j) =∑k=1m/m (11+exp (yi,kdi,k) −1) yi,kx (i,j), K

Step Fourth calculation

GJ=∑I=1MG (i,j) gj=∑i=1mg (I,J)

Fig. 6 Gradient 34 step of parallel LR

Thus, the logical regression can be done by parallel calculation after the decomposition steps above. 7. Summary

This article has been written for several days, and sometimes it's written in a circle, because can expand said place too much, write these content, I looked for some face questions to see, the theory part basically can cover, but involves the real application still need to take time to understand, the final parallel understanding is not thorough enough, Matrix multiplication I used the GPU to achieve, but did not touch a large number of data, and do not know where the real problem will occur. Logical regression can be explained in many ways, it is really a beautiful algorithm. 8. Reference

[1] Verhulst and the logistic equation (1838)

[2] Mathematical enquiries on the population growth

[3] Proceedings of the National Academy of Sciences

[4] The regression analysis of binary sequences

[5] The Origins of Logistic regression

[6] Machine Learning Series (2) interpreting logistic regression from the perspective of elementary Mathematics

[7] A comparison of numerical optimizers for logistic regression

[8] numerical optimizers for Logistic regression

[9] Newton method and Quasi-Newton Method learning Notes (i) Newton method

[10] Knowing: The use of "regularization to prevent fit" in machine learning is a principle

[11] multivariable linear regression Linear regression with multiple variable

[of] CS229 lecture notes

[Equivalence of regression and maximum entropy models

[i] Linear SVM and LR have any similarities and differences.

Under what conditions the SVM and logistic regression are used respectively.

[] Support Vector Machines:what is the difference between Linear SVMs and Logistic regression?

[17] SVM for support vector machines

[A] generative and discriminative classifiers:naive BAYES and LOGISTIC regression

[A] A Tutorial on energy-based Models

[20] Parallel logic regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.