MXNET: Discard method

Last Update:2018-08-23 Source: Internet

Author: User

Tags mxnet

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In addition to the weight attenuation described earlier, the deep learning model often uses the Discard method (dropout) to cope with overfitting problems.

Methods and principles

To ensure the certainty of the test model, the use of the Discard method occurs only when the model is trained, not when the model is tested. When a layer in a neural network uses the discard method, the neurons in the layer will have a certain probability of being discarded.

Set the drop probability to \ (p\). Specifically, the level of any neuron after the activation function is applied, there is a probability of \ (p\) squared 0, there is a probability of \ (1?p\) divided by \ (1?p\) to do the stretching. The drop probability is a super parameter of the Discard method.

In multilayer perceptron, the output of the hidden layer node:

\[h_i = \phi (x_1 w_1^{(i)} + x_2 w_2^{(i)} + x_3 w_3^{(i)} + x_4 w_4^{(i)} + b^{(i)}), \]

Set the drop probability to \ (p\), and set the random variable \ (\xi_i\) has a \ (p\) probability of 0, with a \ (1?p\) probability of 1. Then, the calculation expression of the hidden unit \ (h_i\) using the Discard method becomes

\[h_i = \frac{\xi_i}{1-p} \phi (x_1 w_1^{(i)} + x_2 w_2^{(i)} + x_3 w_3^{(i)} + x_4 w_4^{(i)} + b^{(i)}).

Note that the Discard method is not used when testing the model. Because \ (\mathbb{e} (\frac{\xi_i}{1-p}) =\frac{\mathbb{e} (\xi_i)}{1-p}=1\), the same neuron's expectation of output values during model training and testing is constant.

Output layer:
\[o_1 = \phi (h_1 w_1 ' + h_2 w_2 ' + h_3 w_3 ' + h_4 w_4 ' + h_5 w_5 ' + B ') \]

cannot be overly dependent on either of the \ (h_1,..., h_5\) . This usually results in a weight parameter \ (w_1 ',..., w_5 ' \) in the \ (o_1\) expression approaching 0. Therefore, the Discard method can play a regularization role, and can be used to deal with overfitting.

Realize

Discard the values in X as Drop_prob.

def dropout(X, drop_prob):    assert 0 <= drop_prob <= 1    keep_prob = 1 - drop_prob    # 这种情况下把全部元素都丢弃。    if keep_prob == 0:        return X.zeros_like()    mask = nd.random.uniform(0, 1, X.shape) < keep_prob    return mask * X / keep_prob

Define network parameters: three-tier network structure for minst tasks.

num_inputs = 784num_outputs = 10num_hiddens1 = 256num_hiddens2 = 256W1 = nd.random.normal(scale=0.01, shape=(num_inputs, num_hiddens1))b1 = nd.zeros(num_hiddens1)W2 = nd.random.normal(scale=0.01, shape=(num_hiddens1, num_hiddens2))b2 = nd.zeros(num_hiddens2)W3 = nd.random.normal(scale=0.01, shape=(num_hiddens2, num_outputs))b3 = nd.zeros(num_outputs)params = [W1, b1, W2, b2, W3, b3]for param in params:    param.attach_grad()

The full join layer and the activation function ReLU are strung together, and the output of the activation function is used as a discard method. We can set the drop probability of each layer separately. In general, it is recommended to set a smaller drop probability near the input layer . The network structure is as follows:

drop_prob1 = 0.2drop_prob2 = 0.5def net(X):    X = X.reshape((-1, num_inputs))    H1 = (nd.dot(X, W1) + b1).relu()    # 只在训练模型时使用丢弃法。    if autograd.is_training():        # 在第一层全连接后添加丢弃层。        H1 = dropout(H1, drop_prob1)    H2 = (nd.dot(H1, W2) + b2).relu()    if autograd.is_training():        # 在第二层全连接后添加丢弃层。        H2 = dropout(H2, drop_prob2)    return nd.dot(H2, W3) + b3

Training and testing:

num_epochs = 5lr = 0.5batch_size = 256loss = gloss.SoftmaxCrossEntropyLoss()train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)gb.train_cpu(net, train_iter, test_iter, loss, num_epochs, batch_size, params,             lr)

Result output:

epoch 1, loss 0.9913, train acc 0.663, test acc 0.931epoch 2, loss 0.2302, train acc 0.933, test acc 0.954epoch 3, loss 0.1601, train acc 0.953, test acc 0.958epoch 4, loss 0.1250, train acc 0.964, test acc 0.973epoch 5, loss 0.1045, train acc 0.969, test acc 0.974

Gluon implementation

When the model is trained, the dropout layer randomly discards the previous layer's output elements at the specified drop probability, and the dropout layer does not work when testing the model.
Using gluon, we can construct multilayer neural networks more easily and use discard methods.

import syssys.path.append('..')import gluonbook as gbfrom mxnet import autograd, gluon, init, ndfrom mxnet.gluon import loss as gloss, nndrop_prob1 = 0.2drop_prob2 = 0.5net = nn.Sequential()net.add(nn.Flatten())net.add(nn.Dense(256, activation="relu"))# 在第一个全连接层后添加丢弃层。net.add(nn.Dropout(drop_prob1))net.add(nn.Dense(256, activation="relu"))# 在第二个全连接层后添加丢弃层。net.add(nn.Dropout(drop_prob2))net.add(nn.Dense(10))net.initialize(init.Normal(sigma=0.01))

Training and Results:

num_epochs = 5batch_size = 256loss = gloss.SoftmaxCrossEntropyLoss()train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.5})train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)gb.train_cpu(net, train_iter, test_iter, loss, num_epochs, batch_size,None, None, trainer)# outputepoch 1, loss 0.9815, train acc 0.668, test acc 0.927epoch 2, loss 0.2365, train acc 0.931, test acc 0.952epoch 3, loss 0.1634, train acc 0.952, test acc 0.968epoch 4, loss 0.1266, train acc 0.963, test acc 0.972epoch 5, loss 0.1069, train acc 0.969, test acc 0.976

MXNET: Discard method

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More