Cross-entropy cost function (function and formula derivation)

Source: Internet
Author: User

The cross-entropy cost function (cross-entropy) is a way to measure the predicted and actual values of an artificial neural network (ANN). Compared with the two-time cost function, it can promote the training of Ann more effectively. Before introducing the cross-entropy cost function, this paper briefly introduces two cost functions and their shortcomings.


1. Shortage of two-time cost function

One of the aims of Ann is to enable machines to learn knowledge as humans do. When people learn to analyze new things, when they find themselves making mistakes, the greater the intensity of correction. such as shooting: when the athlete found his shooting direction from the right direction farther, then he adjusted the shooting angle should be bigger, the basketball is easier to throw into the basket. Similarly, we hope that: when Ann is training, if the error between the predicted value and the actual value is greater, then in the course of the reverse propagation training, the range of parameter adjustment will be greater, so that the training can converge faster. However, if you use the two cost function to train Ann, the actual effect is that if the error is greater, the amplitude of the parameter adjustment may be smaller and the training will be slower.

Taking a neuron's class two classification training as an example, two experiments (Ann's usual activation function is the sigmoid function, the experiment also uses this function): input an identical sample data x=1.0 (the sample corresponding to the actual classification Y=0), two experiments each random initialization parameters, Thus, different output values are obtained after the first forward propagation of each individual, resulting in different costs (errors):

Experiment 1: The first output value is 0.82

Experiment 2: The first output value is 0.98

In experiment 1, the random initialization of the parameters, so that the first output value of 0.82 (the sample corresponding to the actual value of 0); After 300 iterations of training, the output value is reduced from 0.82 to 0.09, approximating the actual value. In Experiment 2, the first output value is 0.98, and the output value is only reduced to 0.20 after 300 iterations.

It can be seen from the cost curve of two experiments that the cost of experiment 1 decreases rapidly with the increase of training times, but the cost of experiment 2 decreases very slowly at first, and intuitively, the higher the initial error, the slower the convergence.

In fact, the reason for the slow training due to the large error is that two-time cost function is used. The formula for the two-time cost function is as follows:



where c is the price, x represents the sample, Y represents the actual value, a represents the output value, and n represents the total number of samples. For the sake of simplicity, the same sample is illustrated as an example, at which time the two cost functions are:



At present, the most effective algorithm for training Ann is the inverse propagation algorithm. In short, the training of Ann is through the reverse propagation cost, to reduce the cost-oriented, adjust parameters. The main parameters are: The connection weights between the neurons w, and the bias B of each neuron itself. The method of parameter adjustment is to use the gradient descent algorithm (Gradient descent) to adjust the parametric size along the gradient direction. The gradient of W and B is deduced as follows:



where z represents the input of the neuron, indicating the activation function. As can be seen from the above formula, the gradient of W and b is proportional to the gradient of the activation function, the greater the gradient of the activation function, the faster the size of W and b adjusts, and the quicker the training converges. The usual activation function of the neural network is the sigmoid function, and the curve of the function is as follows:



As shown in the figure, the initial output value of Experiment 2 (0.98) corresponds to a gradient significantly smaller than the output value of experiment 1 (0.82), so the parameter gradient of experiment 2 is slower than the experimental 1. This is the higher the initial cost (error), the more slowly the training causes. Contrary to our expectations, that is: not like people, the greater the error, the greater the extent of correction, so that the faster you learn.

One might say that choosing an activation function that does not change or change the gradient does not solve the problem. The pattern Tucson, which, though simply and rudely solves the problem, may cause more and more troublesome problems. Furthermore, functions such as sigmoid, such as the Tanh function, have many advantages, and are well suited for activating functions, specifically for Google's own use.



2. Cross-entropy cost function

In other words, instead of activating the function, we replace the two-time cost function with the cross-entropy cost function:



where x represents the sample, and n represents the total number of samples. Then, recalculate the gradient of the parameter w:



Among them (see Appendix for specific proof):


Thus, the original in the gradient formula of W is eliminated; In addition, the gradient formula represents the error between the output value and the actual value. Therefore, the higher the error, the greater the gradient, the faster the parameter w adjusts, the faster the training speed. In the same vein, the gradient of B is:



It is proved that the training effect brought by cross-entropy cost function is better than the two-time cost function.



3. How the cross-entropy cost function is generated.

Taking the gradient calculation of bias B as an example, the cross-entropy cost function is deduced:



In the 1th section, the gradient formula for B, deduced from the two-time cost function, is:



To eliminate the equation, we want to find a cost function that makes:



That



To calculate the integral on both sides, you can get:



And this is the cross-entropy cost function described earlier.




Appendix:

The sigmoid function is:


Can be certified:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.