I. Introduction to Logistic regression
Logistic regression, also known as logistic regression analysis, is a generalized linear regression analysis model, which is commonly used in data mining, disease automatic diagnosis, economic prediction and other fields.
Logistic regression is a generalized linear regression (generalized linear model), so there are many similarities with multiple linear regression analysis.
The formula is as follows:
The image is as follows:
We can see from the above image that the value of the logistic regression is (0, 1), when the input is 0 o'clock, its output is 0.5, when the input is less than 0, and more and more hours, its output is closer to 0, conversely, when its input is greater than 0, and it is getting larger, its output is more and more close to 1.
Usually we use linear regression to predict values, but logistic regression comes with the word "regression", but it is usually used to solve the two classification problem.
When its output is greater than 0.5, we can assume that the sample belongs to a class, and less than 0.5, that the sample belongs to the class.
However, because a sample data usually has multiple characteristics, we can not directly into the logistic regression formula, so we need to use the linear regression described earlier, so that the sample's multiple eigenvalues to generate a specific value, in the formula into the equation, so the expression of z is as follows:
You can get a detailed expression for a data about logistic regression:
With the above, we can do a logistic regression analysis of an arbitrary data, but there is a problem, that is, about the value of θ, only the theta in the formula is known, we can apply this formula to an unclassified data, then how to calculate θ?
Take a look at the formula derivation below.
Ii. Logistic regression Formula derivation
On top of that, we need to get theta, and we'll do a detailed analysis of how to obtain theta.
Usually in machine learning, we often have a process called training, the so-called training, that is, by means of a known classification (or label) of data, to obtain a model (or separator), and then use this model to label the data of the unknown tag (or classify it).
So, we use a sample (known as a classified data) to make a series of estimates to get theta. This process is called parameter estimation in probability theory.
Here, we will use the derivation process of the maximum likelihood estimation to obtain the formula for calculating θ:
(1) First we make:
(2) The above two-type integration:
(3) Find the likelihood function:
(4) to calculate the logarithm of its likelihood function:
(5) When the likelihood function is the maximum value, the resulting θ can be considered as the parameter of the model. To find the maximum value of the likelihood function, we can use a method to increase the gradient, but we can manipulate the likelihood function slightly to turn it into a gradient descent, then use the idea of gradient descent to solve this problem, transform
The expression is as follows:
(because a negative coefficient is multiplied, the gradient rises and the gradient drops.) )
(6) Since we are going to use the current θ value to get the new θ value by updating, we need to know the direction of the θ update (that is, whether the current θ is plus a number or minus a number from the final result), so the derivation of the J (θ) will get the update direction (why is the update direction so?). And why did you follow the following formula when you got the update direction? Take a look at the deduction of the gradient descent formula below), the derivation process is as follows:
(7) After getting the updated direction, you can use the following formula to continuously iterate the update to get the final result.
Deduction and deduction of gradient descent formula
As for the optimal solution of the function (maxima and minima), we usually take the derivative of the function in mathematics, and then we can get the equation by equal to 0, and then get the result directly by solving the equation. But in machine learning, our functions are often Dovigo, and it is difficult to solve them directly (sometimes not even solve them) after the equation with a derivative of 0, so there are other ways to get the result, and gradient descent is one of them.
For one of the simplest functions: how do we find the value of y minimum is x (not by means of solution 2x = 0)?
(1) First to take a value of x, such as x =-4, you can get a Y value.
(2) Seek the update direction (if not to update the direction of x update, such as x-0.5, or x+0.5, get the image below).
It can be found that if we were to update X in the negative direction, then I would deviate from the final result, at this time we should be updated in the positive direction, so we need to find x before the update direction (this update direction is not fixed, should be determined according to the current value, such as when x=4, should be negative direction update)
To find the value of the function at this point, y ' = 2x,x = -4, y ' =-8, then its update direction is Y ', for x update we just need to x:=x-α y ' (α (greater than 0) for the update step, in machine learning, we call it learning rate).
PS: Previously said is the Dovigo equation, cannot solve, and not can not be derivative, so it can be derivative, and then bring the current x into.
(3) Repeat the previous (1), (2) steps until X converges.
Gradient Descent Method:
For this formula, if:
(1) m is the total number of samples, that is, each iteration of the update to consider all samples, then called batch gradient descent (BGD), this method is very easy to obtain the global optimal solution, but when the number of samples, the training process is very slow. Use it when the number of samples is small.
(2) when m = 1, that is, each iteration is updated to consider only one sample, the formula is called Random gradient descent (SGD), this method is characterized by a fast training speed, but the accuracy is reduced, not the overall optimal. For example, for the following functions (when x=9.5, the final result is the optimal solution of the region):
(3) So there are two methods, when M is a part of all sample quantity (such as m=10), that we consider a small sample of each iteration update, the formula is called small batch gradient descent (MBGD), it overcomes the disadvantage of the two methods and takes into account their advantages, which is most commonly used in the real environment.
Machine Learning Algorithm---Logistic regression and gradient descent