Understanding the function of cross entropy as loss function in neural network

Source: Internet
Author: User
The role of cross-entropy

One of the most common ways to solve multi-classification problems with neural networks is to set N output nodes at the last layer, whether in shallow neural networks or in CNN, for example, the last output layer in alexnet has 1000 nodes:

And even if the ResNet cancels the all-connected layer, it will have a 1000-node output layer at the end:

In general, the number of nodes in the last output layer is equal to the target number of the classification task. Assuming that the last number of nodes is n, for each example, the neural network can get an n-dimensional array as the output, and each dimension in the array corresponds to a category. In the best case, if a sample belongs to K, then the output node of this category should be 1, and the output of the other nodes is 0, that is, [0,0,1,0,.... 0,0], this array is the label of the sample, is the neural network most expected output results, The cross-entropy is used to determine how close the actual output is to the desired output. softmax regression processing

The original output of the neural network is not a probability value, essentially just the input value does a complex weighting and a value after the nonlinear processing, then how to change this output to a probability distribution.
This is the function of the SOFTMAX layer, assuming that the original output of the neural network is Y1,y2,...., yn, then the output after Softmax regression processing is:

It is clear that:

And the output of a single node becomes a probability value, after softmax processing results as the final output of the neural network. the principle of cross entropy

The cross-entropy depicts the distance between the actual output (probability) and the expected output (probability), that is, the smaller the cross-entropy value, the closer the two probability distribution is. Assuming that the probability distribution P is the desired output, the probability distribution Q is the actual output and H (P,Q) is the cross entropy, then:

How does this formula characterize distances, for example:
Assuming n=3, the expected output is p= (1,0,0), the actual output q1= (0.5,0.2,0.3), q2= (0.8,0.1,0.1), then:

Obviously, Q2 is closer to P, and its cross-entropy is smaller.
In addition, the cross-entropy has another form of expression, or the use of the above hypothetical conditions:

The result is:

All of the above instructions are for a single sample case, and in the actual use of the training process, the data is often combined into a batch to use, so the output of the neural network used should be a m*n two-dimensional matrix, where m is the number of batch, n is the number of classes, The corresponding label is also a two-dimensional matrix, or take the above data, combined into a batch=2 matrix:


So the result of the cross-entropy should be a column vector (according to the first method):

For a batch, the last average is 0.2. achieving cross entropy in TensorFlow

This form can be used in TensorFlow:

where y_ represents the desired output, Y represents the actual output (probability value), and the * is multiplied between the matrix elements, not the matrix multiplication.
The above code realizes the first form of cross-entropy calculation, it is necessary to explain that the calculation process and the above-mentioned formula is somewhat different, according to the above steps, the average cross-entropy should be calculated first batch of each sample of the cross-entropy after averaging calculation obtained, and the use of Tf.reduce_ The mean function actually calculates the average of the entire matrix, and the result is different, but it does not change the actual meaning.
In addition to the Tf.reduce_mean function, the Tf.clip_by_value function is to limit the size of the output, in order to avoid the case of negative infinity log0, the output value is limited to (1e-10, 1.0), In fact, 1.0 of the limit is meaningless, because the probability of how much more than 1.

Because cross-entropy is often used in combination with the Sorfmax function in a neural network, TensorFlow encapsulates it:

The difference from the first code is that Y uses the original output of the last layer of the neural network.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.