The following is reproduced in the content, mainly to introduce the theoretical knowledge of logistic regression, first summed up the experience of their own reading
In simple terms, linear regression is a result of multiplying the eigenvalues and their corresponding probabilities directly, and the logistic regression is the result of adding a logical function
Here is the sigmoid function, similar to the step function in the case of large coordinate scales
At the time of confirming the weight value of the feature corresponding to the regression coefficient
The most commonly used method is the maximum likelihood method, the EM parameter estimation, which is the premise that the first derivative can have the solution.
If the first derivative cannot obtain the analytic value, then the gradient ascending method is generally selected, and the regression coefficients are updated to converge by the finite iterative process and the cost function.
/////////////////////////////////////////////////////////////////////////////////////////////////////////////// //////////////////////////////////////
The following content is referenced: http://blog.csdn.net/zouxy09/article/details/20319673
Logistic regression (logisticregression)
Logistic regression (logistic regression) is the most commonly used machine learning method in the industry to estimate the likelihood of something. In the classic "Mathematical Beauty" also saw it used in advertising prediction, that is, according to an ad by the user click on the possibility of the most likely to be clicked by the user ads placed in the user can see the place, and then called him "you point me ah!" "The user points, you have the money to collect." That's why our computers are now awash with ads.
There is a similar possibility that a user buys a product, the likelihood of a patient suffering from a certain disease, and so on. The world is random (except, of course, man-made deterministic systems, but there may be noise or wrong results, but the likelihood of this error is too small, as small as tens of millions of years, as small as negligible), so the occurrence of everything can be expressed by probability or probability (Odds). "Probability" refers to the ratio of the likelihood of something occurring to the probability that it does not occur.
Logistic regression can be used for regression, but also for classification, mainly two classification. Remember the support vector machine SVM We talked about in the last few verses? It's a two classification. For example, it can divide two different categories of samples, thinking is to find the most distinguishable from the classification of the super-plane. But when you give a new sample to it, it can give you only one answer, whether you this sample is a positive class or a negative class. For example, if you ask SVM, if a girl likes you, it will only answer the questions you like or dislike. It is too rude for us to hope or despair, which is not conducive to physical and mental health. If it can tell me, she likes, a little like, do not like or do not like, you do not want to think about and so on, tell you that she has a 49% chance to like you, than directly said she does not like you, come gentle. And also provides additional information, she came to your side you have how much hope, you have to try again how many times, the enemy victorious, haha. The Logistic regression is so gentle that what it gives us is the likelihood that your sample belongs to the positive class.
and some math. (For more understanding, see References) Suppose our sample is {x, Y},y is 0 or 1, denotes a positive or negative class, andx is our sample eigenvector of M-dimensional. Then this sample x belongs to the positive class, that is, the "probability" of y=1 can be represented by the following logical function:
Here Theta is the model parameter, which is the regression coefficient, σ is the sigmoid function. In fact, this function is obtained by the logarithm of the logarithm probability (that is, the probability of x belonging to the positive class and the probability of a negative class):
In other words, Y is the variable of our relationship, for example, she likes you, and a number of independent variables (factors) related, such as your character, the car is two or four rounds, the longer than Koh Phangan or sharp elder brother have a spell, have thousand feet mansion or three inches hermitage and so on, we have these factors expressed as x1, X2,..., Xm. So how does this woman consider these factors? The quickest way is to add up the score of these factors, and the more you get, the more you like it. But everyone in the mind actually has a rod, everyone considers different factors, radish greens, each their own. For example, this girl more fancy your character, the weight of the character is 0.6, do not value you have no money, no money to struggle together, then there is no money of the weight is 0.001 and so on. The weights of these correspondence x1, X2,..., xm, are called regression coefficients, expressed as θ1,θ2,..., θm. Their weighted sum is your total score. Please choose your favorite boy, non-sincere do not disturb! Ha ha.
So the logistic regression above is a linear classification model, which differs from linear regression in that a large number of linear regression outputs, such as from negative infinity to positive infinity, are compressed to between 0 and 1, so that the output value is expressed as "possibility" to persuade the general public. Of course, there is a good advantage to compressing large values into this range, which is to eliminate the effects of particularly conspicuous variables (not knowing if they are correct). The realization of this great function in fact only needs a trivial one, that is, in the output plus a logistic function. In addition, for the two classification, it is simple to think: if the probability of the sample x belongs to a positive class is greater than 0.5, then it is a positive class, otherwise it is a negative class. In fact, SVM's class probability is the sample to the boundary distance, this activity actually let the logistic regression to dry.
So, logisticregression is a linear regression that is normalized by the logistic equation, that's all.
Okay, here's the gossip about LR. Into the Orthodox machine learning framework, the model is selected, but the model parameter θ is unknown, we need to use the data we collected to train the solution to get it. So the next thing we're going to do is build the cost function.
the most basic learning algorithm of logisticregression is the maximum likelihood. What is the maximum likelihood, you can look at my other blog post "from the maximum likelihood to the EM algorithm shallow solution".
Suppose we have n separate training samples {(x1, y1), (x2, y2),..., (xn, yn)},y={0, 1}. The probability that every observed sample (Xi, Yi) appears is:
Why is it like this? When the Y=1, the back of the item is not, that is only the probability that the x belongs to the 1 class, when the Y=0, the first item is not, it is only the probability of the back of the X is 0 (1 minus the probability of x 1). So whether Y is 0 or 1, the number you get above is the probability of (x, y) appearing. So our entire sample set, which is the likelihood function of n independent samples (because each sample is independent, the probability that n samples appear is multiplied by the probability of their respective occurrences):
The maximum likelihood method is to find the maximum coefficient of the likelihood function in the model θ*. This maximum likelihood is our cost function.
OK, the cost function has, and the next step is to optimize the solution. We first try to take the above cost function derivative, see the derivative of 0 can be solved, that is, there is no analytic solution, there is this solution, it is happy, one step. If not, you need to iterate, time consuming and laborious.
We first change the L (θ): Take the natural logarithm, and then simplify (don't see a bunch of formulas to be afraid of oh, very simple oh, just need to be patient a little bit, you push to know. Note: There is xI, indicating that it is the first sample, the following did not make a distinction, I believe your eyes are sharp), get:
In this time, using L (θ) to differentiate θ, get:
Then we make the derivative 0, and you will be disappointed to find that it cannot be solved analytically. If you don't believe it, try it. So there is no way, can only rely on the big on the iteration to fix. The classical gradient descent algorithm is chosen here.
Second, optimization solution
2.1. Gradient descent (gradient descent)
Gradient descent, also known as steepest descent, is a method to find the local optimal solution of function by using the first order gradient information, and it is also the simplest and most commonly used optimization method in machine learning. It's very simple thinking, and I'm beginning to say, to find the minimum, I just need to go down every step (that is, each step can make the cost function smaller), and then continue to walk, it will certainly be able to go to the minimum value, such as shown:
But, I also need to get to the minimum value faster, how to do? We need every step to find the fastest downhill place, that is, every step I go in some direction, more than the other way, to the minimum value near. and the quickest direction of this downhill is the negative direction of the gradient.
For the logistic regression, the gradient descent algorithm is freshly baked, as follows:
wherein, the parameter α is called the study rate, is each step to go how far, this parameter is quite crucial. If you set too much, then it's easy to hover over the optimal value, because you're too much of a step. For example, from Guangzhou to Shanghai, but your step distance is Guangzhou to Beijing so far, no half-step of the argument, I can take so big stride, is lucky? Or is it unfortunate? There are always two sides of things, it is the advantage of it can quickly from the best value from the place to return to the nearest to the optimal value, just near the optimal value of the time, it powerless. But if set too small, that convergence speed is too slow, to the snail, although it will fall in the best point, but this speed if it is se years, we do not have this patience ah. So some of the improvements are in this place where the learning rate is under the knife. I began to iterate is that the study rate is large, slowly close to the optimal value of the time, my learning rate becomes smaller. The essence of the so-called mining both Ah! This optimization is specifically shown in 2.3.
The pseudo-code for the gradient descent algorithm is as follows:
################################################
Initialize the regression coefficient to 1
Repeat the following steps until convergence {
Calculate the gradient of the entire data set
Using alpha x gradient to update regression coefficients
}
Returns the regression coefficient value
################################################
Note: Because the cost function of the logit regression in this paper is a likelihood function, it is necessary to maximize the likelihood function. So we're going to use a gradient-ascending algorithm. But because it is the same as the gradient descent principle, just one is to find the maximum value, one is to find the minimum. The direction of the maximum value is the direction of the gradient, and the direction of the minimum is the negative direction of the gradient. Does not affect our instructions, so at that time we forgot to change, thank you comment below @wxltt point. In addition, the maximum likelihood can be converted to the minimum value by taking the negative logarithm. Code inside the comment is also wrong, write the code is gradient rise, written off into a gradient drop, the inconvenience caused to everyone, hope everyone haihan.
2.2. Random Gradient drop SGD (stochastic gradient descent)
The gradient descent algorithm needs to traverse the whole data set (calculate the regression error of the whole data set) every time the regression coefficients are updated, which is fair to the small data set. But when it comes to billions of samples and thousands of features, it's a bit of a problem, and its computational complexity is too high. The improved method is to update the regression coefficients with only one sample point (the regression error) at a time. This method is called random gradient descent algorithm. Because the classifier can be incrementally updated when a new sample arrives (let's say we've trained a classifier h on database A, a new sample X. For non-incremental learning algorithms, we need to mix x with Database A, make a new database B, and retrain the new classifier. But for the incremental learning algorithm, we only need to update the parameters of the existing classifier H with the new sample x, so it belongs to the online learning algorithm. As opposed to online learning, processing the entire dataset at once is called batch processing.
The pseudo-code of the random gradient descent algorithm is as follows:
################################################
Initialize the regression coefficient to 1
Repeat the following steps until convergence {
For each sample in the data set
Calculate the gradient of the sample
Using alpha xgradient to update regression coefficients
}
Returns the regression coefficient value
################################################
2.3. Improved random gradient descent
Evaluation of an optimization algorithm is mainly to see whether it converges, that is, whether the parameters reached a stable value, whether it will continue to change? is the convergence rate fast?
The random gradient descent algorithm is shown in 200 iterations (see the third and fourth sections to see here again.) Our database has 100 two-dimensional samples, each of which adjusts the coefficients one at a time, so there is a total of 200*100=20000 adjustments) three regression coefficients change process. The coefficient X2 has reached the stable value after 50 iterations. But the coefficients X1 and X0 are stable after 100 iterations. And hateful is the coefficient X1 and X2 still in very naughty cycle fluctuations, the number of iterations is very large, the heart still can't stop. The reason for this is that there are some sample points that cannot be correctly categorized, that is, our data sets are non-linear, but our logistic regression is a linear classification model, which can do little to the non-linear condition. However, our optimization program does not recognize these abnormal sample points and treats them equally, adjusting the coefficients to reduce the classification errors of these samples, leading to a drastic change in the coefficients of each iteration. For us, we expect the algorithm to avoid bouncing back and forth so that it can quickly stabilize and converge to a certain value.
For the random gradient descent algorithm, we make two improvements to avoid the above fluctuation problem:
1) At each iteration, adjust the value of the update step Alpha. As the iteration progresses, the alpha becomes smaller, which mitigates the high-frequency fluctuations of the coefficients (i.e., each iteration coefficient changes too much and the span of the jump is too large). Of course, in order to avoid alpha as the iteration decreases to close to 0 (when the coefficients are almost no longer adjusted, then the iterations are meaningless), we constrain alpha to be more than a slightly larger constant term, see code.
2) Each iteration, change the order of optimization of the sample. That is, randomly selecting samples to update the regression coefficients. This reduces cyclical fluctuations, because the order of the samples changes so that each iteration is no longer cyclical.
The pseudo code of the improved stochastic gradient descent algorithm is as follows:
################################################
Initialize the regression coefficient to 1
Repeat the following steps until convergence {
For each sample in a randomly traversed data set
As the iteration progresses, the value of alpha is reduced
Calculate the gradient of the sample
Using alpha x gradient to update regression coefficients
}
Returns the regression coefficient value
################################################
Comparing the original random gradient descent and the improved gradient drop, you can see a difference of two points:
1) The coefficient no longer fluctuates periodically. 2) The coefficients can be quickly stabilized, i.e. fast convergence. This only iterates 20 times and then converges. And the above random gradient descent needs to be iterated 200 times to stabilize.
Machine learning-A brief introduction to logistic regression theory