Transferred from: http://blog.csdn.net/u014380165/article/details/77284921
We know that convolutional neural Network (CNN) has been widely used in the field of image, in general, a CNN network mainly includes convolutional layer, pool layer (pooling), fully connected layer, loss layer and so on. Although it is now open to many deep learning frameworks (such as Mxnet,caffe, etc.), it is very easy to train a model, but how do you know exactly how these layers are implemented? Do you know anything about Softmax,softmax Loss,cross entropy? I believe many people do not necessarily know. Although the online information is many, but the quality is uneven, often see dazzling. In order to let everyone less detours, specially organized the context of these knowledge points, hope not only to help themselves to consolidate knowledge, but also to help others understand the content.
This article mainly introduces the content of the whole connection layer and the loss layer , which is a kind of content that is based on the network. First, the calculation from the full connection layer to the loss layer is cleared. Take a look at this picture from Reference 1 (I'm too lazy to draw).
The left part of the equal sign of this graph is the whole connected layer, W is the parameter of the fully connected layer, we also call the weight value, andx is the input of the fully connected layer, which is the feature. It can be seen from the graph that feature X is a vector of n*1, and how is this obtained? This feature is obtained by a plurality of convolution layers and a pooled layer in front of the fully connected layer, assuming that the entire connecting layer is connected to a convolution layer, the output of this convolutional layer is 100 characteristics (that is, we often say that the channel of the feature map is 100), The size of each feature is 4*4, then these features are flat into n*1 vectors before they are entered into the fully connected layer (this time n is 100*4*4=1600). After explaining the X, and then see W,w is the parameters of the fully connected layer, is a t*n matrix, the N and x corresponding to the n,T denotes the number of categories, such as you are 7 classification, then T is 7. What we call training a network is the most suitable w matrix for an all-connected layer. so the fully connected layer is the implementation of WX to get a t*1 vector (that is, logits[t*1 in the graph), the vector inside each number is no size limit, that is, from negative infinity to positive infinity . Then if you are a multi-classification problem, the general will be in the full connection layer behind a softmax layer, the input of this softmax is the t*1 vector, the output is also the t*1 vector (that is, the prob[t*1 in the diagram), each value of this vector represents the probability that the sample belongs to each class), Only the output vector has a size range of 0 to 1 for each value.
Now you know what the Softmax's output vector means, the probability that the sample belongs to each class.
So what do softmax do to get 0 to 1 probability? Take a look at the formula of Softmax (I used to look at these things when the formula is also very offensive, but the quiet heart to see it):
The formula is very simple, previously said that the input of Softmax is WX, assuming that the model input sample is I, to discuss a 3 classification problem (the category is represented by the "2"), the real class of sample I is the actual category of the Softmax, then this sample I through the network all layers to reach the layer before it gets WX, That is, WX is a 3*1 vector, then AJ in the above formula represents the J value of this 3*1 vector (which will eventually get S1,s2,s3), and the AK in the denominator represents 3 values in the 3*1 vector, so there will be a summation symbol (here the sum is k from 1 to T, T and T in the above figure correspond to equal, that is, the meaning of the number of classes, and the range of J is also 1 to T. Because E^x constant is greater than 0, so the numerator is always positive, the denominator is a number of positive sum, so the denominator is definitely positive, so SJ is positive, and the range is (0,1). If you are not training the model, but are testing the model, then when a sample passes through the SOFTMAX layer and outputs a t*1 vector, the index of the largest value in the vector is taken as the prediction label for the sample.
so the goal of the W that we train the all-connected layer is to make its output of WX be the highest predicted probability corresponding to the real label after the SOFTMAX layer has been computed.
For example: Assuming your wx=[1,2,3], then you will get [0.09,0.24,0.67] after the Softmax layer, and these three numbers indicate that the probability that the sample belongs to the first class is 0.09,0.24,0.67 respectively.
———————————————————————— Gorgeous split-line ———————————————————————-
To understand the Softmax, it is necessary to say that softmax loss.
What does that Softmax loss mean? as follows:
First L is the loss. SJ is the J value of the output vector s of the Softmax , which has been described earlier, indicating the probability that the sample belongs to category J . YJ has a summation symbol in front of it, the range of J is also 1 to the class number T, so y is a 1*t vector, inside the t value, and only 1 values are 1, the other T-1 values are 0. So where is the value of 1? The answer is that the value that corresponds to the location of the real label is 1, and the others are 0. so this formula actually has a simpler form:
of course, this time to limit J is the real tag that points to the current sample.
Let me give you an example. Suppose a 5 classification problem, then a label of sample I y=[0,0,0,1,0], that is, the true label of sample I is 4, assuming that the model predicts the result probability (softmax output ) p=[0.2,0.3,0.4,0.6, 0.5], it can be seen that the prediction is right, then the corresponding loss L=-log (0.6), that is, when the sample through such network parameters to produce such a prediction p, its loss is-log (0.6). So suppose p=[0.2,0.3,0.4,0.1, 0.5], this prediction is outrageous, because the real tag is 4, And you think this sample is 4 of the probability of only 0.1 (far less than other probabilities, if it is in the test phase, then the model will predict that the sample belongs to Category 5), corresponding to the loss of L=-log (0.1). So suppose p=[0.2,0.3,0.4,0.3, 0.5], this prediction is wrong, but without the previous one so outrageous, the corresponding loss of L=-log (0.3). We know that the log function is a negative number when the input is less than 1, and the log function is an increment function, so-log (0.6) <-log (0.3) <-log (0.1). Simply put, your predictions are bigger than the losses predicted, and the predictions are far worse than the slight losses predicted.
———————————————————————-Gorgeous split-line —————————————————————————-
After clearing the Softmax loss, you can look at cross entropy.
Corss entropy is the meaning of cross-entropy , and its formula is as follows:
Does it feel like the Softmax loss formula? when the input p of the cross entropy is the output of Softmax, cross entropy equals Softmax loss. PJ is the J value of the input probability vector p, so if your probability is obtained by Softmax formula, then cross entropy is Softmax loss. This is my own understanding, if wrong please correct.
Reference 1:http://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/