activation function
sigmoid output Layer
For the output layer to be sigmoid, if the mean square error function is used, then the neural network may have a "very large error and slow learning" situation, because the loss function on the partial derivative of the weight parameter is related to the sigmoid (x) function, and in the case of sigmoid (x) function saturation, No matter how much error, the partial derivative is relatively small, so that learning speed will be slow.
If you use the loss function as Cross-entropy, you can solve this slow learning situation, see reference 1.
K-sigmoid nodes can be used in the output layer to solve the classification problem. Softmax Output Layer
The output layer uses the Softmax (x) softmax (x) function, which is entered as an M-dimensional feature vector x x, and the output is a vector of M-d. The output layer can be used as the output layer of the M-classified neural network, and the output a A can be regarded as the probability distribution of M classification. RELU
Mitigate gradient vanishing problem loss function Euclidean distance (mean square error)
cross-entropy correlation between learning speed and derivative of activation function
Learning speed and error related reference http://neuralnetworksanddeeplearning.com/chap3.html