Softmax,softmax loss and cross entropy of convolution neural network series _tensorflow

Source: Internet
Author: User

We know that the convolution neural network (CNN) in the field of image application has been very extensive, generally a CNN network mainly includes convolution layer, pool layer (pooling), full connection layer, loss layer and so on. Although it is now open to a lot of deep learning frameworks (such as Mxnet,caffe, etc.), training a model becomes very simple, but how do you know how these layers are implemented? Do you know anything about Softmax,softmax Loss,cross entropy? I believe a lot of people are not necessarily clear. Although there is a lot of information on the Internet, but the quality is uneven, often dazzling. In order to make everyone less detours, specially sorted out the ins and outs of these points of knowledge, hope not only to help themselves consolidate knowledge, but also to help others understand these content.

This article mainly introduces the entire connection layer and the loss layer content, is the network inside compares the foundation piece of content. First, the calculation from the fully connected layer to the loss layer is cleared. Look at this picture below, from reference 1 (you're really lazy to paint).

The left part of the equal sign of this picture is what the whole connection layer does, W is the parameter of the whole connection layer, we also call the weight value, x is the input of the full connection layer, which is the characteristic. You can see from the graph that feature X is the vector of n*1, how do you get it? This feature is obtained by the processing of multiple convolution layers and pool layers in front of the fully connected layer. Assuming that the full connection layer is connected to a convolution layer, the output of this convolution layer is 100 features (that is, we often say that the feature map is channel to 100), and the size of each feature is 4* 4, these features are flat into n*1 vectors (this time n is 100*4*4=1600) before they are entered to the full join layer. Explain the X, and then see W,w is a fully connected layer of the parameters, is a t*n matrix, this n and X of the n correspondence, T represents the number of categories, such as you are 7 categories, then T is 7. What we call the training of a network, for the full connection layer is to find the most suitable w matrix. So the full connection layer is to perform wx to get a t*1 vector (that is, the logits[t*1 in the graph), and there is no size limit for each number in the vector, that is, from negative infinity to positive infinity. Then if you are a multiple classification problem, you will typically follow a softmax layer behind the full connection layer, the Softmax input is the t*1 vector, the output is the t*1 vector (that is, the prob[t*1 in the graph), and each value of the vector represents the probability that the sample belongs to each class. Only the output vector has a size range of 0 to 1 for each value.

Now you know what the Softmax output vector means, the probability that the sample belongs to each class.

So what does Softmax do to get a 0 to 1 probability of doing something? Let's take a look at the formula of the Softmax (I was disgusted with the formula when I looked at it myself, but I'd like to have a quiet look):

The formula is very simple, said Softmax's input is WX, assuming that the model's input sample is I, to discuss a 3 classification problem (class with 1,2,3), the real category of sample I is 2, then this sample I after all layers of the network reached the Softmax layer before the WX, Which means that wx is a 3*1 vector, so AJ in the formula above represents the first J value of the 3*1 vector (which eventually gets s1,s2,s3), and the AK in the denominator represents the 3 values in the 3*1 vector, so there's a summation symbol (where the sum is k from 1 to T, T is equivalent to the T in the figure above, which is the meaning of the number of categories, and the range of J is 1 to T. Because the E^X constant is greater than 0, so the molecule is always positive, the denominator is multiple positive and, so the denominator is definitely positive, so SJ is positive, and the scope is (0,1). If it is not in the training model, but in the test model, then when a sample passes through the SOFTMAX layer and outputs a t*1 vector, it takes the index of the number that is the largest in the vector as the predictive label for the sample.

Therefore, the goal of training the W of the fully connected layer is to make the output of WX the highest prediction probability corresponding to the real label after the calculation of the Softmax layer.

For example: Suppose your wx=[1,2,3], then after the Softmax layer you will get [0.09,0.24,0.67], which indicates that the probability of this sample belonging to the 1,2,3 class is 0.09,0.24,0.67 respectively.

——————————— – Gorgeous split line ——————————————

Understand the Softmax, it is necessary to say Softmax loss.
What do you mean by Softmax loss? as follows:

First L is a loss. SJ is the first J value of the output vector s of Softmax, which is described above, indicating the probability that this sample belongs to category J. YJ has a summation symbol in front of it, J's range is also 1 to the class number T, so y is a 1*t vector, the inside of the T value, and only 1 values are 1, the other T-1 values are 0. So which position is the value of 1? The answer is that the value for the position of the real label is 1, and the other is 0. So this formula actually has a simpler form:

At this point, of course, J is the real label that points to the current sample.

Let me give you an example. Suppose a 5 classification problem, then a sample I of the label y=[0,0,0,1,0], that is, the sample I real label is 4, assuming the model predicted the result probability (Softmax output) p=[0.1,0.15,0.05,0.6,0.1], you can see that this prediction is right, Then the corresponding loss L=-log (0.6), which is the loss of-log (0.6) When the sample generates such a prediction p through such a network parameter. So suppose p=[0.15,0.2,0.4,0.1,0.15], this prediction is outrageous, because the real label is 4, And you think the probability of this sample being 4 is only 0.1 (far less than the other probability, and if it's in the test phase, then the model predicts that the sample belongs to Category 3) and corresponds to the loss L=-log (0.1). So suppose p=[0.05,0.15,0.4,0.3,0.1], this prediction is wrong, but not so outrageous as the previous one, corresponding loss L=-log (0.3). We know that the log function is a negative number when the input is less than 1, and the log function is an increment function, so-log (0.6) <-log (0.3) <-log (0.1). To put it simply, you predict that the error is greater than the loss of the forecast, and that the predictions are ridiculously wrong rather than a slight loss of the predicted error.

——————————— – Gorgeous split line ——————————— –

Clear the Softmax loss, you can see cross entropy.
Corss entropy is the meaning of cross entropy, its formula is as follows:

Does it feel like the formula for Softmax loss? When the input p of cross entropy is the output of Softmax, cross entropy equals Softmax loss. PJ is the first J value of the input probability vector p, so if your probability is obtained by Softmax formula, then cross entropy is Softmax loss. This is my own understanding, if wrong please correct.

Reference 1:http://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.