1, the main idea of linear regression is to fit a straight line through historical data, and use this line to predict new data. (For example: The A.B class is located on both sides of a linear function)
2, there are many factors in the real world, so we need to use multivariate linear function to describe an event (result)
3. Multivariate linear function: A multivariable analysis of the relationship between the two classification observations and some influential factors (x1,x2,x3,..., xn), for example, in medicine, according to some of the patient's symptoms to determine whether it suffers from a disease.
4. Multivariate Linear regression formula:
5, sigmoid function:
By bringing the multivariate linear function z into the sigmoid function, we get the generalized linear regression model
6, the function output of sigmoid is between (0,1), the median is 0.5, so we can consider the sigmoid function as the probability density function of sample data
Because the hθ (x) output is between (0,1), it also indicates that the data belongs to a certain kind of probability, for example:
hθ (x) <0.5 indicates that the current data belongs to Class A
hθ (x) >0.5 indicates that the current data belongs to Class B
7. How to use generalized linear regression model
Considering the vector x= (x1,x2,x3,..., xn) with n independent variables, the conditional rate P (y=1| X) = P is the probability of the occurrence of an event relative to the observed amount. Then the logistic regression model can be expressed as
So the ratio of the occurrence of the event to the probability of not occurring is
This ratio is called the occurrence ratio of the event, the logarithm of which is obtained
If there are m observation samples, the observed values are Y1,y2,y3,... ym, and pi = P (yi = 1| xi) is the probability of yi=1 under given conditions, then the probability of yi=0 is P (yi = 0 | xi) = 1-PI, so the probability of obtaining a set of observations is
Because each observation sample is independent of each other, their joint distribution is the product of each edge distribution. Get the likelihood function
Then our goal is to find the maximum parameter estimation of the likelihood function, the maximum likelihood estimate is to find the parameter w0,w1,w2,w3,... WN, so that L (W) gets the maximum value, and the function L (w) is taken logarithm
The final deformation is
Where Yi is the true value
is the forecast value
8, to determine the optimal regression coefficient of the process, that is, the data set training process 4.
The steps to find the best regression coefficients are as follows:
1. List classification functions: When H (x) > 0 is Class A, H (x) < 0 is class B
(Theta refers to the regression coefficient, in practice, the result will often be a sigmoid conversion)
2. Give the error estimate function corresponding to the classification function:
(M is the number of samples)
This theta vector is the best regression coefficient vector only if a theta vector makes the above error estimate function J (θ) the minimum value.
3. The value of theta when using gradient descent method or least squares to obtain the minimum value of the error function:
Last state and previous state
For the convenience of presentation, the case of the upper-type is only one sample, in practice, a sum of multiple samples needs to be combined (unless you use the random gradient rise algorithm that follows), the error function in step 2 is added to the minus sign, so the problem can be converted to a maximum value, and the gradient descent method is converted to a gradient rise method.