The so-called Softmax regression is an upgraded version based on the logistic regression.
Logistics is a two category, and Softmax can be categorized in multiple categories.
1 Logistic regression
Before we learn Softmax regression, we first return to the relevant knowledge of the logistic regression.
(See HTTP://BLOG.CSDN.NET/BEA_TREE/ARTICLE/DETAILS/50432411#T6)
The function of the logistic regression is
His name is regression, but the function is classification, because the original data regression curve is a logistic curve, their values are either close to 1 or close to 0.
In addition, their objective function is the maximum likelihood estimate obtained by multiplying the probabilities:
In order to calculate the general logarithm, L fit well at maximum.
The iterative formula obtained by using the gradient descent method is the same as the iterative formula of linear regression, which is good coincidence. There are other algorithms (Perceptron Learning algorithm) in the form of iterations as well.
The cost function in the paper is processed, and the minimum value is directly calculated.
2 Softmax Regression
Above can see the formula of the logistic main idea is to use the probability, when y equals 0 o'clock to get is equal to 0 probability, Y=1 is equal to 1 probability.
The idea of Softmax is to get the probability of multiple classifications separately, the formula is as follows:
A good understanding of the following style:
It can be seen as a matrix form of K-formulas. The probability that a group of X will get a k category.
Now that we've made assumptions about the probabilities of each category,
Then you can imitate the maximum likelihood to get the following cost function
If the input x corresponds to the category J, then the corresponding probability is.
Multiplying the probability of all m x is the maximum likelihood function, and finding the minimum value after adding the minus sign to the log (product disguised addition) is equivalent to the maximum likelihood value.
The cost function above is the meaning.
The iterative formula for its gradient is as follows
3 parameter characteristics of Softmax
After knowing its rationale, we think of a problem, if we know the probability of the first k-1 classification, then the probability of the k will need to know? Obviously is not necessary, this can be understood as Softmax parameter redundancy overparameterized Direct understanding (Bo Master Understanding, cautious letter).
The understanding in this article is more rigorous:
Subtract a vector directly from the probability formula ψ The formula for getting the probability is still unchanged.
That is to say, the result of the optimization result is that the optimal solution of the condition is still satisfied by subtracting the vector, in other words, the optimal solution has countless. That is Hessian matrix is singular (singular/non-invertible), using Newton method is not good use.
According to the above understanding we can set the parameters of one of them all to 0, so that there is no redundancy, but in practice we do not do this, but rather add the rule phase, but this is not called the rule and called weight Decay.
4 Weight Decay
What it looks like after you add a penalty
Since the next item is bound to be greater than 0, then the Hessian matrix will not be irreversible, so it becomes a strict convex function, all kinds of solutions can be used. can also be directly understood as, all parameters have two new constraints or optimization direction, so the best solution is only one. The iteration formula is as follows:
5 Softmax VS. Binary classifiers
When we have K classification, do we choose Softmax or K two classification?
The answer is that if this K classification is mutually exclusive, we choose Softmax, but if it is overlapping with each other part of the failure, such as in men, women, children, little girls Such classification can not be used Softmax
UFLDL Softmax Regression