標籤:less XML data learn ati ble ica reduce using
The state of the art of non-linearity is to use ReLU instead of sigmoid function in deep neural network, what are the advantages?I know that training a network when ReLU is used would be faster, and it is more biological inspired, what are the other advantages? (That is, any disadvantages of using sigmoid)?
Best answer in stackexchange:
Two additional major benefits of ReLUs are sparsity and a reduced likelihood of vanishing gradient. But first recall the definition of a ReLU is h=max(0,a)h=max(0,a) where a=Wx+ba=Wx+b.
One major benefit is the reduced likelihood of the gradient to vanish. This arises when a>0a>0. In this regime the gradient has a constant value. In contrast, the gradient of sigmoids becomes increasingly small as the absolute value of x increases. The constant gradient of ReLUs results in faster learning.
The other benefit of ReLUs is sparsity. Sparsity arises when a≤0a≤0. The more such units that exist in a layer the more sparse the resulting representation. Sigmoids on the other hand are always likely to generate some non-zero value resulting in dense representations. Sparse representations seem to be more beneficial than dense representations.
Reference: http://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-network
ReLU
ReLU的全稱是rectified linear unit。上面的回答基本上涵蓋了它勝過sigmoid function的幾個方面:
- faster
- more biological inspired
- sparsity
- less chance of vanishing gradient (梯度消失問題)
早期使用sigmoid或tanh啟用函數的DL在做unsupervised learning時因為 gradient vanishing problem 的問題會無法收斂。ReLU則這沒有這個問題。
What are the advantages of ReLU over sigmoid function in deep neural network?