MXNET: Weight Decay

Last Update:2018-08-23 Source: Internet

Author: User

Tags mxnet

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Weight attenuation is a common method to fit the problem.

\ (l_2\)Norm Regularization

In deep learning, we often use the L2 norm regularization, which is to add L2 norm penalty on the basis of the original loss function of the model, so as to get the function of minimizing the training.

L2 Norm penalty term refers to the sum of the squares of each element of the model weight parameter and the product of a hyper-parameter. For example:\ (w_1\),\ (w_2\) is the weight parameter, B is the deviation parameter, and the new loss function with the \ (l_2\) Norm penalty is:

\[\ell (W_1, w_2, b) + \frac{\lambda}{2} (w_1^2 + w_2^2), \]

\ (\lambda\) regulates the gravity of the penalty.

With the \ (l_2\) norm, in the case of a random gradient descent, using a single-layer neural network as an example, the iterative formula of weights is changed to:

\[w_1 \leftarrow w_1-\frac{\eta}{|\mathcal{b}|} \sum_{i \in \mathcal{b}}x_1^{(i)} (x_1^{(i)} w_1 + x_2^{(i)} w_2 + b-y ^{(i)})-\lambda w_1,\]

\[w_2 \leftarrow w_2-\frac{\eta}{|\mathcal{b}|} \sum_{i \in \mathcal{b}}x_2^{(i)} (x_1^{(i)} w_1 + x_2^{(i)} w_2 + b-y ^{(i)})-\lambda w_2.\]

\ (\eta\) for the learning rate,\ (\mathcal{b}\) for the number of samples, visible:\ (l_2\) norm regularization to each step of the weight update added \ (-\lambda w_1\) and \ (\lambda w_2\). This is why the \ (l_2\) Norm regularization is called the weight decay (weight decay).

In practice, we also sometimes add the squared sum of the deviating elements to the penalty items .

Suppose that the input of a neuron in a neural network is \ (x_1,x_2\), using the activation function \ (\phi\) and outputting \ (\phi (x_1w_1+x_2w_2+b) \). Assuming that the activation function \ (\phi\) is Relu, Tanh, or sigmoid, if \ (w_1,w_2,b\) is very close to 0, the output is also close to 0. In other words, the neuron is less useful, even as a neuron is missing from the neural network. This effectively reduces the complexity of the model and reduces overfitting.

High dimensional linear regression experiment

We use the high dimensional linear regression as an example to introduce an overfitting problem and try to deal with overfitting by using the L2 norm regularization.

To generate a data set

The dimension that sets the Data sample feature is P. For any sample of the training dataset and the test data set characterized by \ (x_1,x_2,..., x_n\) , we use the following linear function to generate the label for the sample:

\[y = 0.05 + \sum_{i = 1}^p 0.01x_i + \epsilon,\]

Where is the noise item? A normal distribution with a mean value of 0 and a standard deviation of 0.1.

To make it easier to observe the fitting, we consider the high dimensional linear regression problem, such as setting dimension p=200, and we deliberately set the number of samples for the training dataset to be low, for example 20.

n_train = 20n_test = 100num_inputs = 200true_w = nd.ones((num_inputs, 1)) * 0.01true_b = 0.05features = nd.random.normal(shape=(n_train+n_test, num_inputs))labels = nd.dot(features, true_w) + true_blabels += nd.random.normal(scale=0.01, shape=labels.shape)train_features, test_features = features[:n_train, :], features[n_train:, :]train_labels, test_labels = labels[:n_train], labels[n_train:]

Initialize model parameters

def init_params():    w = nd.random.normal(scale=1, shape=(num_inputs, 1))    b = nd.zeros(shape=(1,))    params = [w, b]    for param in params:        param.attach_grad()    return params

Defining the L2 Norm penalty

def l2_penalty(w):    return (w**2).sum() / 2

Define training and testing

batch_size = 1num_epochs = 10lr = 0.003net = gb.linregloss = gb.squared_lossdef fit_and_plot(lambd):    w, b = params = init_params()    train_ls = []    test_ls = []    for _ in range(num_epochs):        for X, y in gb.data_iter(batch_size, n_train, features, labels):            with autograd.record():                # 添加了 L2 范数惩罚项。                l = loss(net(X, w, b), y) + lambd * l2_penalty(w)            l.backward()            gb.sgd(params, lr, batch_size)        train_ls.append(loss(net(train_features, w, b),                             train_labels).mean().asscalar())        test_ls.append(loss(net(test_features, w, b),                            test_labels).mean().asscalar())    gb.semilogy(range(1, num_epochs+1), train_ls, 'epochs', 'loss',                range(1, num_epochs+1), test_ls, ['train', 'test'])    return 'w[:10]:', w[:10].T, 'b:', b

Setting the Lambd=0, the training error is much smaller than the test (generalization) error, which is a typical overfitting phenomenon.

fit_and_plot(lambd=0)# output('w[:10]:', [[ 0.30343655 -0.08110731  0.64756584 -1.51627898  0.16536537  0.42101485   0.41159022  0.8322348  -0.66477555  3.56285167]]<NDArray 1x10 @cpu(0)>, 'b:', [ 0.12521751]<NDArray 1 @cpu(0)>)

Using regularization, the overfitting phenomenon is alleviated to some extent. However, more accurate model parameters are still not being learned. This is mainly because the number of samples in the training dataset is too small for the relative dimensions .

fit_and_plot(lambd=5)# output('w[:10]:', [[ 0.01602661 -0.00279179  0.03075662 -0.07356022  0.01006496  0.02420521   0.02145572  0.04235912 -0.03388886  0.17112994]]<NDArray 1x10 @cpu(0)>, 'b:', [ 0.08771407]<NDArray 1 @cpu(0)>)

MXNET: Weight Decay

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More