In addition to the weight attenuation described earlier, the deep learning model often uses the Discard method (dropout) to cope with overfitting problems.
Methods and principles
To ensure the certainty of the test model, the use of the Discard method occurs only when the model is trained, not when the model is tested. When a layer in a neural network uses the discard method, the neurons in the layer will have a certain probability of being discarded.
Set the drop probability to \ (p\). Specifically, the level of any neuron after the activation function is applied, there is a probability of \ (p\) squared 0, there is a probability of \ (1?p\) divided by \ (1?p\) to do the stretching. The drop probability is a super parameter of the Discard method.
In multilayer perceptron, the output of the hidden layer node:
\[h_i = \phi (x_1 w_1^{(i)} + x_2 w_2^{(i)} + x_3 w_3^{(i)} + x_4 w_4^{(i)} + b^{(i)}), \]
Set the drop probability to \ (p\), and set the random variable \ (\xi_i\) has a \ (p\) probability of 0, with a \ (1?p\) probability of 1. Then, the calculation expression of the hidden unit \ (h_i\) using the Discard method becomes
\[h_i = \frac{\xi_i}{1-p} \phi (x_1 w_1^{(i)} + x_2 w_2^{(i)} + x_3 w_3^{(i)} + x_4 w_4^{(i)} + b^{(i)}).
Note that the Discard method is not used when testing the model. Because \ (\mathbb{e} (\frac{\xi_i}{1-p}) =\frac{\mathbb{e} (\xi_i)}{1-p}=1\), the same neuron's expectation of output values during model training and testing is constant.
Output layer:
\[o_1 = \phi (h_1 w_1 ' + h_2 w_2 ' + h_3 w_3 ' + h_4 w_4 ' + h_5 w_5 ' + B ') \]
cannot be overly dependent on either of the \ (h_1,..., h_5\) . This usually results in a weight parameter \ (w_1 ',..., w_5 ' \) in the \ (o_1\) expression approaching 0. Therefore, the Discard method can play a regularization role, and can be used to deal with overfitting.
Realize
Discard the values in X as Drop_prob.
def dropout(X, drop_prob): assert 0 <= drop_prob <= 1 keep_prob = 1 - drop_prob # 这种情况下把全部元素都丢弃。 if keep_prob == 0: return X.zeros_like() mask = nd.random.uniform(0, 1, X.shape) < keep_prob return mask * X / keep_prob
Define network parameters: three-tier network structure for minst tasks.
num_inputs = 784num_outputs = 10num_hiddens1 = 256num_hiddens2 = 256W1 = nd.random.normal(scale=0.01, shape=(num_inputs, num_hiddens1))b1 = nd.zeros(num_hiddens1)W2 = nd.random.normal(scale=0.01, shape=(num_hiddens1, num_hiddens2))b2 = nd.zeros(num_hiddens2)W3 = nd.random.normal(scale=0.01, shape=(num_hiddens2, num_outputs))b3 = nd.zeros(num_outputs)params = [W1, b1, W2, b2, W3, b3]for param in params: param.attach_grad()
The full join layer and the activation function ReLU are strung together, and the output of the activation function is used as a discard method. We can set the drop probability of each layer separately. In general, it is recommended to set a smaller drop probability near the input layer . The network structure is as follows:
drop_prob1 = 0.2drop_prob2 = 0.5def net(X): X = X.reshape((-1, num_inputs)) H1 = (nd.dot(X, W1) + b1).relu() # 只在训练模型时使用丢弃法。 if autograd.is_training(): # 在第一层全连接后添加丢弃层。 H1 = dropout(H1, drop_prob1) H2 = (nd.dot(H1, W2) + b2).relu() if autograd.is_training(): # 在第二层全连接后添加丢弃层。 H2 = dropout(H2, drop_prob2) return nd.dot(H2, W3) + b3
Training and testing:
num_epochs = 5lr = 0.5batch_size = 256loss = gloss.SoftmaxCrossEntropyLoss()train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)gb.train_cpu(net, train_iter, test_iter, loss, num_epochs, batch_size, params, lr)
Result output:
epoch 1, loss 0.9913, train acc 0.663, test acc 0.931epoch 2, loss 0.2302, train acc 0.933, test acc 0.954epoch 3, loss 0.1601, train acc 0.953, test acc 0.958epoch 4, loss 0.1250, train acc 0.964, test acc 0.973epoch 5, loss 0.1045, train acc 0.969, test acc 0.974
Gluon implementation
When the model is trained, the dropout layer randomly discards the previous layer's output elements at the specified drop probability, and the dropout layer does not work when testing the model.
Using gluon, we can construct multilayer neural networks more easily and use discard methods.
import syssys.path.append('..')import gluonbook as gbfrom mxnet import autograd, gluon, init, ndfrom mxnet.gluon import loss as gloss, nndrop_prob1 = 0.2drop_prob2 = 0.5net = nn.Sequential()net.add(nn.Flatten())net.add(nn.Dense(256, activation="relu"))# 在第一个全连接层后添加丢弃层。net.add(nn.Dropout(drop_prob1))net.add(nn.Dense(256, activation="relu"))# 在第二个全连接层后添加丢弃层。net.add(nn.Dropout(drop_prob2))net.add(nn.Dense(10))net.initialize(init.Normal(sigma=0.01))
Training and Results:
num_epochs = 5batch_size = 256loss = gloss.SoftmaxCrossEntropyLoss()train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.5})train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)gb.train_cpu(net, train_iter, test_iter, loss, num_epochs, batch_size,None, None, trainer)# outputepoch 1, loss 0.9815, train acc 0.668, test acc 0.927epoch 2, loss 0.2365, train acc 0.931, test acc 0.952epoch 3, loss 0.1634, train acc 0.952, test acc 0.968epoch 4, loss 0.1266, train acc 0.963, test acc 0.972epoch 5, loss 0.1069, train acc 0.969, test acc 0.976
MXNET: Discard method