Reprint Please specify source: http://www.cnblogs.com/ymingjingr/p/4271742.html
Directory machine Learning Cornerstone Note When you can use machine learning (1) Machine learning Cornerstone Note 2--When you can use machine learning (2) Machine learning Cornerstone Note 3--When you can use machine learning (3) (modified version) machine learning Cornerstone Notes 4-- When to use machine learning (4) Machine learning Cornerstone Note 5--Why machines can learn (1) machine learning Cornerstone Notes 6--Why machines can learn (2) machine learning Cornerstone Notes 7--Why machines can learn (3) machine learning Cornerstone Notes 8-- Why machines can learn (4) machine learning Cornerstone Note 9--machine how to learn (1) machine learning Cornerstone Note 10--machine how to learn (2) machine learning Cornerstone Note 11--machine how to learn (3) machine learning Cornerstone Note 12-- How machines can learn (4) machine learning Cornerstone Note 13--Machine How to learn better (1) machine learning Cornerstone Note 14--Machine How to learn better (2) machine learning Cornerstone Note 15--Machine How to learn better (3) machine learning Cornerstone Note 16-- How the machine can learn better (4) Logistic Regression
Publication regression (the most common translation: Logistic regression).
10.1 Logistic Regression problem
Logistic regression problem.
The heart disease recurrence problem was analyzed using a two-tuple classification, and the output space contained only two {+1,-1}, respectively, indicating recurrence and non-recurrence. In the case of noise, the objective function f can be expressed using the target distribution p, as shown in Equation 10-1, which is shown in machine learning flowchart 10-1.
(Equation 10-1)
Figure 10-1 Heart disease relapse two Yuan classification flowchart
Usually, however, it is not certain to inform the patient that a heart attack will recur or not, but rather to inform the patient of the likelihood of recurrence in a probabilistic manner, as shown in 10-2, the likelihood of a patient having a heart attack relapse is 80%.
Fig. 10-2 the possibility of recurrence in the form of probability
This condition is referred to as the soft two-tuple (soft binary classification), the objective function f is expressed as shown in Equation 10-2, and its output is in the form of probabilities, so between 0~1.
(Equation 10-2)
In the face of the objective function like Equation 10-2, the ideal DataSet D (input plus output space) should be shown in 10-3.
Figure 10-3 Ideal Data Set D
All of the output is in the form of probabilities, such as the recurrence of heart disease to illustrate that the general patient only has heart disease and no recurrence of the two cases, and it is not possible to record in the history of his disease incidence probability, the actual training data should be shown in 10-4.
Figure 10-4 Actual training data
The actual training data can be considered as ideal training data with noise.
The question is how to use these actual training data to solve the problem of soft two-tuple classification, that is, how the function is designed.
First, recall which part of the two hypothetical functions (two-tuple and linear regression) mentioned in the previous chapters.
The answer is to ask for the weighted total score of each attribute (score), (Do you remember the significance of the weighted sum in the second chapter?) ) can be represented using equation 10-3.
(Equation 10-3)
How do you convert this score from a value in the entire real range into a 0~1? The topic of this chapter is presented here, and is represented by the logistic function (logistic functions). The higher the score s, the higher the risk, the smaller the fraction s the less the risk. Suppose the function h is shown in Equation 10-4, as shown in 10-5 of the function curve.
(Equation 10-4)
Figure 10-5 of the logistic function
The mathematical expression of the specific logistic function is shown in Equation 10-5.
(Equation 10-5)
A few special numerical tests can be done to map the scores on the whole set of real numbers to the 0~1, substituting the negative infinity, the generation of 0, or the generation of positive infinity. The logistic function perfectly maps the values on the entire set of real numbers to the 0~1 interval.
Look at the graph of the function, which is a smooth (differential everywhere), monotone (monotonic) s-shaped (sigmoid) function, and is therefore called the sigmoid function.
By means of the mathematical expression of the logistic function, the expression of the hypothetical function of the soft two-tuple classification is rewritten, as shown in Equation 10-6.
(Equation 10-6)
10.2 Logistic Regression Error
Logistic regression error.
Compare the logisitic regression with the previously learned two-tuple and linear regression, as shown in 10-7.
Fig. 10-7 comparison of two-tuple, linear regression and logistic regression
Where fractional S is present in every hypothetical function, the first two learning models have error measurements that correspond to 0/1 errors and squared errors, and how the Err function used by logistic regression should be represented is what this sectionto describes.
The objective function from logistic regression can be deduced from the derivation of equation 10-7.
(Equation 10-7)
The upper part of the curly braces is not difficult to understand, is the objective function of the equation around the result of the reversal, and the lower part of the derivation is very simple, because the probability of +1 and 1 is equal to 1.
Assuming that there is a dataset, the probability of generating a sample of this dataset from the target function can be expressed in equation 10-8.
(Equation 10-8)
is the multiplication of each input sample that produces the corresponding output marker probability. From Equation 10-7, it is known that equation 10-8 can be written as Formula 10-9.
(Equation 10-9)
But the function f is unknown, only the hypothetical function h is known, can the hypothetical function h be substituted for the F in Equation 10-9? What does it mean if you do this? It means that the likelihood of a function h producing the same DataSet sample D is mathematically translated into likelihood (likelihood), and the formula after substitution is shown in Equation 10-10.
(Equation 10-10)
Assuming that the function H and the unknown function f are very close (that is, err is very small), then the likelihood of the H generating Data sample D or the likelihood (likelihood) and F generating the same data D (probability) is also very close. Since the function f produces the data sample D, it is possible to assume that the function f produces the Data sample D is very likely. It is therefore possible to infer the best hypothetical function g, which should be the most plausible hypothetical function h, expressed in equation 10-11.
(Equation 10-11)
When assuming that the function H uses the logistic function of Equation 10-6, the special properties of equation 10-12 can be obtained.
(Equation 10-12)
So equation 10-10 can be written in Equation 10-13.
Note here that when calculating the maximum, all the size has no effect, because all the hypothetical functions are multiplied by the same, that is, the likelihood of H is only related to the multiplication of the function h for each sample, such as Equation 10-14.
(Equation 10-14)
The token is represented, and the token is placed in the assumed function in place of the sign to make the whole equation more concise. Looking for the likelihood of the largest hypothetical function h, so the equation 10-14 can be used to find the maximum likelihood of the formula, and through a series of transformations to get the formula 10-15.
(assuming function h corresponds to the weighted vector w one by one)
(The multiplication formula is not easy to solve the biggest problem, so the logarithm is calculated here, the natural logarithm E is the base)
(It was all about minimizing the problem, so the biggest problem plus a minus sign turns into the smallest problem, and in order to be similar to the previous error measurement, it becomes one more.) )
(The surrogate expression is the result above)
(Equation 10-15)
In Equation 10-15, this error function is called a cross-entropy error (cross-entropy errors).
10.3 Gradient of Logistic Regression Error
The gradient of the logistic regression error.
To derive the logistic regression, the next task is to find the smallest weight vector w.
expression as shown in Equation 10-16.
(Equation 10-16)
By carefully observing the formula, you can conclude that the function is a convex function of continuous (continuous) (differentiable), so its minimum value is obtained when the gradient is zero. So how do we solve it? In order to partial differential the weights of the weight vector W, the linkage law of the differential is used to solve the complex formula partial differential. The complex representations in equation 10-16 are represented by temporary symbols, in order to emphasize the temporary nature of the symbol, not to use the letter representation, but to use and, in particular, the formula 10-17.
(Equation 10-17)
The partial differential process for a single component of the weight vector w is shown in Equation 10-18.
(Equation 10-18)
Where the function is the logistic function described in section 10.1. The formula for finding the gradient can be written as shown in Equation 10-19.
(Equation 10-19)
After the gradient is calculated, due to the convex function, the weighted vector w is calculated as zero, even if the function obtains the smallest W.
Observed, the function is found to be a function as a weight, about the weighted sum function.
Suppose a special case, the function of the ownership value is zero, that is, all is zero, you can get to the negative infinity, that is, also means that all the corresponding to the same number, that is, linear can be divided.
To exclude this special case, when the weighted sum is zero, the solution of the problem cannot be solved by using a closed solution similar to that used to solve linear regression, and how is this minimum value calculated?
Remember the earliest use of the PLA solution method? By iterative solution, the solution steps of PLA can be combined into the form of Equation 10-20.
(Equation 10-20)
The vector does not change when, plus. Some symbols will be used to represent the formula more generically, as shown in Equation 10-21.
(Equation 10-21)
Multiply by a 1, with a representation that indicates the step of the update, the updated part of the PLA is represented by V, indicating the direction of the update. Such algorithms are called Iterative optimization methods (iterative optimization approach).
10.4 Gradient Descent
Gradient drops.
Logistic regression minimization is also using the iterative optimization method mentioned in the previous section, by changing the weight vector step-by-step, to find the minimum variable weight vector, the iterative optimization method of the update formula as shown in Equation 10-22.
(Equation 10-22)
In view of logistic regression problem, how to design the parameters in this formula is the main problem that this section solves.
Recall PLA, where the parameters are derived from correcting errors, observing the logistic regression, and designing a method that can quickly find the optimal weight vector for its characteristics.
10-8 for the logistic regression, the weight vector w is a smooth and micro convex function, where the point of the bottom of the image corresponds to the best w, making the smallest. How can I select parameters and make the update formula quickly reach that point?
Figure 10-8 Logistic regression
For a clear division of labor, set as the unit vector only represents the direction, representing the step size of each update change. In a fixed situation, how to choose the direction to ensure the fastest update speed? is changed in the steepest direction. That is, in the case of fixed, the fastest speed (with direction) to find the smallest w, as shown in Equation 10-23.
(Equation 10-23)
The above is a nonlinear band constrained formula, it is still very difficult to find the minimum w, consider converting it into an approximate formula, by looking for the approximate formula of the smallest W, to find the original formula of the minimum w, used here to Taylor expansion (Taylor Expansion), recalling the Taylor formula in one-dimensional space, As shown in Equation 10-24.
(Equation 10-24)
Similarly, in the very hour, Equation 10-23 is written in the form of multidimensional Taylor expansion, as shown in Equation 10-25.
(Equation 10-25)
Which corresponds to the equivalent of equation 10-24. Popular point explanation, the form of the original curve as a small segment of the form of a segment, that is, the curve can be seen around a very small segment.
Thus solving equation 10-26 of the smallest case w, can be considered to be approximate to the solution formula 10-23 of the minimum condition of W.
(Equation 10-26)
The formula is a known value, and a value greater than 0 is given, so the problem of finding the smallest equation 10-26 can be converted to the problem of finding the smallest equation 10-27.
(Equation 10-27)
The smallest of the two vectors is in the opposite direction, that is, the product is negative, and because it is a unit vector, so the direction is shown in Equation 10-28.
(Equation 10-27)
In very small cases, the formula 10-27 into Equation 10-22 is the formula 10-28, the specific update formula.
(Equation 10-27)
The update formula represents the weight vector w each move in the opposite direction of the gradient one small step, in this way the update can be quickly found to make the smallest W. This approach is called gradient descent (gradient descent), abbreviated to GD, which is a common and simple method.
Finish the choice of the parameter V, and then look back at the effect of the given parameter's value on the gradient drop, shown in 10-9.
Figure 10-9 The effect of the parameter size on the gradient descent
10-9 the left, the very slow descent, so the speed of the search for the best w is very slow, in the middle of figure 10-9, when too large, the decline is not stable, or even the lower the higher the situation, the appropriate should be reduced as the gradient decreases, the most right, that is, the parameters are variable and proportional to the gradient size.
Depending on the condition proportional to the gradient size, it can be re-given, as shown in Equation 10-28.
(Equation 10-28)
The final formula 10-27 can be written as Equation 10-29.
(Equation 10-29)
At this point it is called a fixed learning rate (fixed learning rates), and equation 10-29 is the gradient drop at the fixed learning rate.
The steps of the logistic regression algorithm are as follows:
Set the weight vector w initial value is, set the number of iterations is T,;
Calculate gradients;
The weight vector w is updated;
Until the number of iterations is sufficient.
Machine learning Cornerstone Note 10--machine how to learn (2)