Understanding the function of cross entropy as loss function in neural network

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The role of cross-entropy

One of the most common ways to solve multi-classification problems with neural networks is to set N output nodes at the last layer, whether in shallow neural networks or in CNN, for example, the last output layer in alexnet has 1000 nodes:

And even if the ResNet cancels the all-connected layer, it will have a 1000-node output layer at the end:

In general, the number of nodes in the last output layer is equal to the target number of the classification task. Assuming that the last number of nodes is n, for each example, the neural network can get an n-dimensional array as the output, and each dimension in the array corresponds to a category. In the best case, if a sample belongs to K, then the output node of this category should be 1, and the output of the other nodes is 0, that is, [0,0,1,0,.... 0,0], this array is the label of the sample, is the neural network most expected output results, The cross-entropy is used to determine how close the actual output is to the desired output. softmax regression processing

The original output of the neural network is not a probability value, essentially just the input value does a complex weighting and a value after the nonlinear processing, then how to change this output to a probability distribution.
This is the function of the SOFTMAX layer, assuming that the original output of the neural network is Y1,y2,...., yn, then the output after Softmax regression processing is:

It is clear that:

And the output of a single node becomes a probability value, after softmax processing results as the final output of the neural network. the principle of cross entropy

The cross-entropy depicts the distance between the actual output (probability) and the expected output (probability), that is, the smaller the cross-entropy value, the closer the two probability distribution is. Assuming that the probability distribution P is the desired output, the probability distribution Q is the actual output and H (P,Q) is the cross entropy, then:

How does this formula characterize distances, for example:
Assuming n=3, the expected output is p= (1,0,0), the actual output q1= (0.5,0.2,0.3), q2= (0.8,0.1,0.1), then:

Obviously, Q2 is closer to P, and its cross-entropy is smaller.
In addition, the cross-entropy has another form of expression, or the use of the above hypothetical conditions:

The result is:

All of the above instructions are for a single sample case, and in the actual use of the training process, the data is often combined into a batch to use, so the output of the neural network used should be a m*n two-dimensional matrix, where m is the number of batch, n is the number of classes, The corresponding label is also a two-dimensional matrix, or take the above data, combined into a batch=2 matrix:

So the result of the cross-entropy should be a column vector (according to the first method):

For a batch, the last average is 0.2. achieving cross entropy in TensorFlow

This form can be used in TensorFlow:

where y_ represents the desired output, Y represents the actual output (probability value), and the * is multiplied between the matrix elements, not the matrix multiplication.
The above code realizes the first form of cross-entropy calculation, it is necessary to explain that the calculation process and the above-mentioned formula is somewhat different, according to the above steps, the average cross-entropy should be calculated first batch of each sample of the cross-entropy after averaging calculation obtained, and the use of Tf.reduce_ The mean function actually calculates the average of the entire matrix, and the result is different, but it does not change the actual meaning.
In addition to the Tf.reduce_mean function, the Tf.clip_by_value function is to limit the size of the output, in order to avoid the case of negative infinity log0, the output value is limited to (1e-10, 1.0), In fact, 1.0 of the limit is meaningless, because the probability of how much more than 1.

Because cross-entropy is often used in combination with the Sorfmax function in a neural network, TensorFlow encapsulates it:

The difference from the first code is that Y uses the original output of the last layer of the neural network.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Understanding the function of cross entropy as loss function in neural network

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Understanding the function of cross entropy as loss function in neural network

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support