We have said the logistic regression, the training sample is, (and here is the D-dimensional, the following model formula X is d+1 dimension, which is more than one dimension is the intercept horizontal 1, where the y=±1 can also be written in other values, this does not affect the model, as long as the two types of problems can be), After training the parameter θ in this model (or the model, which is a model) and then giving a new one, we can predict the probability of the corresponding label=1 or 0 based on the model.
Two types of problems are dealt with, and we want to extend the two types of problems, that is, based on a well-trained model, we can predict the probability of multiple values corresponding to LABEL=1,2,...K according to the model. Our first and most important part is to determine what this new model is. For an X, the new model (j=1,2. k) to add up equals 1.
Let's assume that the new model is:
.......................................................................................................................... (1)
(in this model, after the previous processing, each one adds one dimension)
Where the parameters of the model are implemented in the Softmax regression, it is convenient to use a matrix that will be listed by row, as follows:
here is a question : in logistic regression, there are two kinds of problems, we only use a theta, here we can also use only k-1 a θk can represent all models? Specifically, we just need to put the set to 0. so = 1, so that the formula (1) can be used less than one, we verify that if k=2 is two kinds of problems, this model degenerate into logistic regression, we make θ2=0, then we get:
, to be proven. So our parameter matrix does have parameter redundancy, and this question continues.
What we need to do next is to ask for cost function:
We know that the cost function of the logistic (without constraint) is to bring each sample into its label corresponding to the model formula (the label is 1, the surrogate, is 0 in), Then multiply the results of all the samples into the model and take the logarithm log (which is the result of each sample being brought into the model) and take the average. We do the same here, just because we have a lot of classes here, and we use an "expression function" to make the formula neat:
is an expression function whose value rule is: A value is true and the value is a false expression.
For example, the value of an expression is 1, and the value is 0.
Our cost function is (without constraint):
We know the cost function minimization of the logistic regression model, which is illustrated here by the gradient descent method:
Θ here is a k* (n+1) matrix that corresponds to all the parameters in the model, and we now have a theta parameter matrix value
, what is the value of the new θ ' parameter matrix obtained by the gradient descent method? So, for example, we are updating the value of this parameter of θ,
First we ask for the deviation of this parameter of θ:
The derivation is obtained by a number (that is, the result of bringing all and the current θ parameter matrix values into the left side of the equation), rather than adding an increment to the first element of θ, as this is already a derivative of θ. Some places are updated by the gradient, the gradient is a vector, but the gradient is also a derivative of each parameter to get the number, and then the composition of the vector. This is written in order to make it easier to understand (in the program or in the matrix operation, so it will be different from this formula, but the core idea is the same). Then the new θ ' parameter matrix values the first element θ ' (=θ)-A. The same method is then used to get the new parameter matrix value θ ' other elements θ ' (v,u). We get theta ' after we iterate this way again to get the new parameter matrix value θ "... finally get the model parameters that converge.
At this point we discuss the parameter redundancy issues mentioned earlier:
Now that the parametric matrix of our model is θ, then we have a sample, and we want to find out that the corresponding label for this sample is equal to the probability that each I (I=1,2...K) is utilized.
At this point we make each line of the matrix θ become (). So for any j,j∈, there's
That is, each row of the parameter matrix θ subtracts the new parameter matrix θ by subtracting a constant vector, then the two parameter matrices are equivalent, that is, the probability of a sample corresponding to the label equal to each I (I=1,2...K) is the same as the two parameter matrix. At this time we assume that if the parameter is a minimum point of the cost function, then it is also the minimum point, which can be any vector. So the solution to minimize is not unique. (Interestingly, because it is still a convex function, if it is only using gradient descent method, it will not encounter the problem of local optimal solution.) But the Hessian matrix is singular/irreversible, which leads directly to the problem of numerical computation in the case of Newton's optimization, so we still want to find a way to solve the numerical problems caused by parameter redundancy when using gradient descent, Newton's method or other algorithms.
At this point we can consider this equal to one, then this becomes a 0 vector, so that the new parameter matrix is less a set of variables, only need k-1 group, we can build the model, so that our cost function optimization results only the unique solution. And that's what we did in the logistic formula.
In practical applications, in order to make the algorithm more simple and clear, it is often reserved all parameters, and not arbitrarily set a parameter to 0. But at this point we need to make a change to the cost function: adding weight attenuation. The weight attenuation can solve the numerical problems caused by the parameter redundancy of Softmax regression.
We modify the cost function by adding a weight falloff, which punishes too large parameter values, and now our cost function becomes:
then why join this weight decay term, which is the L2 regular term, can solve the numerical problems caused by parameter redundancy?
-
with this weight decay term (
), the cost function becomes the strict convex function, so that the unique solution can be guaranteed. At this point the Hessian matrix becomes a invertible matrix, and because
is convex function, gradient descent method and L-bfgs algorithm can be guaranteed to converge to the global optimal solution.
When the optimization parameter is given a new θ ' in each iteration, we need to change the A above as long as the previous one, that is, a minus a number. You want to update an element of theta element θ (v,u), that is, the corresponding a into a minus the original parameter matrix corresponding to the element θ (v,u) to get a ', and then update θ ' (v,u) =θ (v,u)-A
Softmax regression (theory partly explained)