This article is part of the third chapter of neural networks and deep learning, and discusses the cross-entropy cost function used in machine learning algorithm.

1. From the variance cost function

The cost function often uses the variance cost function (i.e., the mean square error MSE), for example, for a neuron (single input single output, sigmoid function), the cost function is defined as:

Where y is the output we expect, A is the actual output of the neuron "a=σ (z), where Z=wx+b".

In the course of training the neural network, we update W and B with the gradient descent algorithm, so we need to calculate the derivative of the cost function to W and B:

Then update W, B:

W <--w-η*? C/?w = w-η* a *σ′ (z)

b <--b-η*? C/?b = b-η* a *σ′ (z)

Because of the nature of the sigmoid function, σ′ (z) will be very small when Z takes most of the value (as the labeled ends are almost flat), which makes the W and b updates very slow (because η * a *σ′ (z) is close to 0).

2. Cross-entropy cost function (cross-entropy)

To overcome this shortcoming, the cross-entropy cost function is introduced (the following formula corresponds to one neuron, multiple input single output):

Where y is the desired output, a is the actual output of the neuron "a=σ (z), where z=∑wj*xj+b"

As with the variance cost function, the cross-entropy cost function also has **two properties** :

- Non-negative nature. (So our goal is to minimize the cost function)
- The cost function is close to 0 when the real output A is close to the desired output Y. (such as y=0,a~0;y=1,a~1, the cost function is close to 0).

In addition, it can overcome the problem that the variance cost function updates the weight too slowly. We also look at the derivative of it:

As you can see, there is no σ′ (z) in the derivative, and the update of the weights is subject to the effect of σ (z) y, which is affected by the error. So when the error is large, the weight update is fast, when the error is small, the weight of the update is slow. This is a very good nature.

3. Summary

When we use the sigmoid function as the activation function of neurons, it is better to use the cross-entropy cost function instead of the variance cost function to avoid the training process being too slow.

However, you may ask, why is the cross-entropy function? There are countless kinds of functions without σ′ (z) in the derivative, how can we think of the cross entropy function? This is naturally there is a story, more in-depth discussion will not write, youth please self-understanding.

In addition, the cross-entropy function is in the form of? [Ylna+ (1?y) ln (1?A)] instead of? [Alny+ (1?a) ln (1?y)], why? Because Lny has no meaning when the desired output is y=0, ln (1-y) is meaningless when Y=1 is expected. And since a is the actual output of the sigmoid function, it never equals 0 or 1, and is only infinitely close to 0 or 1, so there is no problem.

4. Also say: Log-likelihood cost

The logarithmic likelihood function is also often used as the cost function of Softmax regression, in the above discussion, our last layer (i.e. the output) is through the sigmoid function, so the cross-entropy cost function is adopted. The more common practice in deep learning is to use Softmax as the last layer, at which time the cost function is log-likelihood costs.

In fact, it's useful to think of a Softmax output layer with Log-likelihood cost as being quite similar to a sigmoid OUTPU T layer with cross-entropy cost.

In fact, the two are consistent, the logistic regression is the sigmoid function, Softmax regression is a logistic regression of the multi-category extension. Log-likelihood cost function can be reduced to the form of cross-entropy cost function in two categories. Refer to UFLDL tutorial for details

Reprint Please specify source: http://blog.csdn.net/u012162613/article/details/44239919

Cross-entropy cost function