Logistic regression (Logistic regression) is a common machine learning method used in the industry to estimate the possibility of something. For example, a user may buy a product, a patient may suffer from a disease, and an advertisement may be clicked by the user. (Note: "possibility", not the "probability" in mathematics. The result of logisitc regression is not a probability value defined in mathematics, and cannot be used as a probability value directly. This result is often used for weighted summation with other feature values, rather than directly multiplying them)
What kind of things does it mean, and what are the applicable and inapplicable situations?
I. official definition:
,
Figure 1. The logistic function,ZOn the horizontal axis andBytes(Z) On the vertical axis
Logistic regression is a method for learning the F: X −> Y equation or P (Y | X). Here, Y is a discrete value, x = <X1, x2 ..., XN> is a discrete or continuous value of each variable in any vector.
Ii. My explanations
It is too painful to look at the formula. Just let us talk about it separately. Logistic Regression has three main components: regression, linear regression, and logsitic equations.
1) Regression
Logistic regression is a linear regression, and linear regression is a regression. So what about the return of Xiami?
In fact, regression is to estimate the unknown parameters of known formulas. For example, the formula is y = a * x + B, and the unknown parameters are a and B. We now have a lot of real (x, y) data (training samples). Regression uses this data to automatically estimate the values of A and B. The estimation method can be simply understood as, after a given training sample point and a known formula, for one or more unknown parameters, the machine will automatically enumerate all possible values of the parameters (for multiple parameters, You need to enumerate their different combinations) until you find the parameter that best matches the distribution of sample points (or parameter combinations ). (Of course, there are some optimization algorithms in the actual computation, which will certainly not be enumerated)
Note: The premise of regression is that the formula is known, otherwise the regression cannot be performed. In real life, where is the known formula? Haha), so the formulas in regression are basically the data analyst's guess after reading a large amount of data (in fact, most of them come up with a head shot, um ...). Based on these formulas, regression is classified into linear regression and nonlinear regression. In linear regression, formulas are all "once" (mona1 equation, binary one equation...), while non-linearity can have various forms (N n equation, log equation, etc ). The specific example is described in linear regression.
2) Linear Regression
Let's look at a simple example of a dollar variable: Suppose we want to find a rule between Y and X, where X is the price of the shoes, and Y is the sales volume of the shoes. (Why is this rule? In this way, we can help pricing to make more money, which is frequently used in primary school ). We know the sales data of previous years (x0, y0), (x1, Y1 ),... (Xn, yn) makes sample sets and assumes that they meet the linear relationship: Y = A * x + B (the specific values of A and B are not sure ), linear regression is to find the optimal values of A and B based on previous years, so that Y = A * x + B minimizes the error in all sample sets.
Maybe you will think --- dizzy! This is easy! What kind of regression is required! I drew an XY Coordinate System on the grass paper myself. I can draw it at a few points! (Well, I admit that we were tortured by such drawing questions in junior high school ). In fact, the mona1 variable is really intuitive, but it is hard to see it intuitively if it is a multivariate variable. For example, in addition to the price of the shoes, the quality of the shoes, the investment in advertising, and the flow of people in the block where the store is located will affect sales. We want to get this formula: vertex = A * x + B * Y + C * z + D * ZZ + E. At this time, the plot cannot be drawn, and the law is very difficult to find. Therefore, it is good to hand it over to linear regression. (It doesn't matter how linear regression is implemented. For programmers, we just treat it as a program command. If you want to learn more after reading this article, You Can See note 1 at the end of this Article ). This is the value of Linear Regression Algorithms.
Note that the premise that linear regression can achieve good results here is that Y = A * x + B makes sense at least in general (because we think the more expensive the shoes, the less you sell, the cheaper you sell. In addition, shoe quality, advertising investment, and passenger flow all have similar rules). However, not all types of variables are suitable for linear regression. For example, X is not the price of a shoe, but the shoe size), no matter what the regression (A, B), the error rate will be very high (because in fact the size is too large or the size is too small will reduce sales ). In short:If our formula is wrong, no regression will produce good results.
3) Logistic Equation
The above plot is a specific real value. However, in many cases, we need to return a 0 ~ similar to the probability value ~ The value between 1 (for example, can a pair of shoes be sold today? Or can an advertisement be clicked by users? We hope to get this value to help us decide whether the shoes will not be shelved or displayed in the Advertisement Exhibition ). The value must be 0 ~ 1, but the limit obviously does not meet the requirements of this range. Therefore, Logistic equation is introduced for normalization. It is explained again that this value is not a probability value defined in mathematics. So what we get is not a probability value. Why do we need to take this effort to normalize the value to 0 ~ Between 1? The advantage of normalization is that the value has the boundary of comparability and convergence, so that when you continue to calculate it (for example, you are not only concerned about the sales volume of shoes, instead, we need to weighted sum the possibility of selling shoes, the local security situation, the local transportation cost, and other factors, and determine whether to open shoes in this place using the comprehensive addition result ), normalization ensures that the results obtained this time do not overwrite other feature or be overwritten by other feature because the boundary is too large/too small. (For example, if the minimum sales volume of shoes is 100, but the maximum number of shoes can be sold is, the local security status is 0 ~ The value between 1 is completely ignored if the two are directly summed and the security condition is ignored. This is the main reason for using Logistic regression rather than direct linear regression. Now, you may have begun to realize that, yes,Logistic regression is a linear regression after the Logistic equation is normalized.
As for using Logistic instead of others, this normalization method is often more reasonable (people say they are called logistic ), results that are too large or too small (often noise) can be suppressed to ensure that the mainstream results will not be ignored. For specific formulas and figures, see section 1. Official definitions. F (x) is the real number of values in the preceding example, and Y is the obtained 0 ~ The value of the possibility of selling between 1. (This section "possibility" is not "probability". I would like to thank zjtchow for pointing out in his reply)
Iii. Applicability of Logistic Regression
1) it can be used for Probability Prediction or classification.
Not all machine learning methods can predict the likelihood probability (for example, SVM does not work. It can only get 1 or-1 ). The advantage of probability prediction is that results are comparable. For example, after we get the possibility that different advertisements are clicked, We can display n of the most likely clicks. In this way, even if the probability is high or the probability is low, we can get the optimal topn. When used for classification problems, you only need to set a threshold. The possibility is higher than the threshold, which is lower than the threshold.
2) can only be used for linear Problems
Logistic regression can be used only when the feature and target are linearly related (unlike SVM ). This has two guiding significance: on the one hand, when we know the non-linearity of the model in advance, we will definitely not use Logistic regression; on the other hand, when using Logistic regression, we should pay attention to selecting the feature with a linear relationship with the target.
3) Each feature does not need to meet the condition independent hypothesis, but the contribution of each feature is calculated independently.
Unlike Naive Bayes, Logistic Regression must satisfy the conditional independence hypothesis (because it does not evaluate the posterior probability ). However, the contribution of each feature is calculated independently, that is, LR will not automatically help you combine different features to generate new feature (this fantasy cannot be held at all times, that is, decision tree, lsa, plsa, LDA or what you want to do ). For example, if you need a feature such as TF * IDF, you must give it clearly. If you only give two-dimensional TF and IDF respectively, it is not enough, in this way, only results similar to a * TF + B * IDF will be obtained, without the effect of C * tF * IDF.
(End)
========================================================== ======================================
I am a split line. The following content does not affect the use of Logistic regression. Reading it does not help. I just prepared for the hope of finding out what the problem was.
========================================================== ======================================
Note 1: linear regression solution
The linear regression method is suitable for the case where the number of feature values is arbitrary. Therefore, to be intuitive (and to reduce the number of code words ...), in the above example, the feature is one-dimensional, that is, X: Shoe price, target Y: Shoe sales. The linear regression solution is to find a straight line, so that the actual data points (XI, Yi) are the shortest distance from this line. There are actually two problems: 1) how to define the "minimum overall distance? That is, what is the formula for calculating the overall error (also knownCost Function). Without this, I can't say that I have found a straight line with the smallest error. 2) how to find the smallest line of error f (x) = A * x + B given the cost function.
Next separately,
1) cost function (Least Squares ).
The linear regression solution is referred to as the "Least Square Method" (do not click it to see it. The formula on the encyclopedia is too messy. I will explain it below ),In fact, the "Least Square Method" is not a solution. It is an evaluation method, that is, the cost function.
The second multiplication refers to the square, that is, it defines the overall error: the "predicted value" of each vertex minus the square of the "real value" difference (the second multiplication "), sum. The formula is: sum [(f (xi)-yi) ^ 2], where F (x) is the current predicted straight line. The "minimum" means that when the square is accumulated and the least hour, the straight line is the least error. The above is the least square method.
So why is square? (Skip this section if you don't want to care about it: d) -- first of all, because the difference value has positive and negative values, and positive and negative values will be offset when the square is not accumulated -- but is it better to use the sum of absolute values? In fact, one of my favorite explanations is that if we think that the actual point is deviated from the predicted value by "Normal Distribution", we should minimize the sum of squares, instead of minimizing the sum of absolute values or other sums. Because the probability of a normal distribution is approximately long, p (x) = constant * E ^ [-(x-Expected Value) ^ 2/constant]. Note that the-(x-Expected Value) ^ 2 indicates that the square is not the absolute value. To minimize the square of the error is to maximize the p (x) of the normal distribution, that is, to make the current error interpretation the most reasonable. Is it not difficult to understand it intuitively? Liu weipeng explained this problem in detail in this article about Bayes. PS: This article is also the best one I have read about Bayesian. I would like to know more about Bayesian. (etc. LZ has written so many times, let's read this LZ blog article first ~~ : D)
2) gradient descent (real solution)
The "Least Squares" only defines the cost function,The gradient descent method (gradient descent) is generally used for specific solutions (finding the optimal straight line). Gradient Descent is a greedy search method. You can open another blog post to fully describe it... for this article, you only need to know that it is doing this: it first finds a straight line, calculates the error, and then continuously fine-tune it. Each step of fine-tuning reduces the error by a little, until the end cannot be reduced, the final line is the optimal line.
We can understand the specific method as follows (note that the above statement is correct, but the following statement is only intuitive and cannot be formally defined. For the standard description, see here ):
1> it first finds a straight line f (x) = A0 * x + B0, and calculates the cost function ).
2> for A0 (For B0, if C0, D0, E0, F0 are the same), try to make a1 = A0 + a small number, generate a new line A1 * x + B0 and check whether the new error (cost function result) is smaller than the original one. If yes, use this A1; otherwise, check whether a1 = A0-a small number can reduce the error. If yes, use this A1. If not A0 is already the best parameter, it will not change a1 = A0; similarly, the processing parameter B0. Finally, a new line f (x) = A1 * x + B1 is obtained. (This process is actually achieved through partial guidance)
3> repeat Step 1 until a and B can no longer change. The final f (x) = An * x + BN is the optimal straight line and the solution ends.
Because of the special nature of this problem (the cost function of the Least Square is a convex function), no matter what the starting A0 and B0 are, the above method can find a unique optimal straight line. In addition, because the gradient is used to calculate the "small number", the "small number" will become smaller and smaller as the straight line approaches the optimal straight line, and eventually approaches 0, so you don't have to worry about taking less than step 1.
From: su ran Xu, http://hi.baidu.com/hehehehello/item/40025c33d7d9b7b9633aff87