Weight attenuation is a common method to fit the problem.
\ (l_2\)Norm Regularization
In deep learning, we often use the L2 norm regularization, which is to add L2 norm penalty on the basis of the original loss function of the model, so as to get the function of minimizing the training.
L2 Norm penalty term refers to the sum of the squares of each element of the model weight parameter and the product of a hyper-parameter. For example:\ (w_1\),\ (w_2\) is the weight parameter, B is the deviation parameter, and the new loss function with the \ (l_2\) Norm penalty is:
\[\ell (W_1, w_2, b) + \frac{\lambda}{2} (w_1^2 + w_2^2), \]
\ (\lambda\) regulates the gravity of the penalty.
With the \ (l_2\) norm, in the case of a random gradient descent, using a single-layer neural network as an example, the iterative formula of weights is changed to:
\[w_1 \leftarrow w_1-\frac{\eta}{|\mathcal{b}|} \sum_{i \in \mathcal{b}}x_1^{(i)} (x_1^{(i)} w_1 + x_2^{(i)} w_2 + b-y ^{(i)})-\lambda w_1,\]
\[w_2 \leftarrow w_2-\frac{\eta}{|\mathcal{b}|} \sum_{i \in \mathcal{b}}x_2^{(i)} (x_1^{(i)} w_1 + x_2^{(i)} w_2 + b-y ^{(i)})-\lambda w_2.\]
\ (\eta\) for the learning rate,\ (\mathcal{b}\) for the number of samples, visible:\ (l_2\) norm regularization to each step of the weight update added \ (-\lambda w_1\) and \ (\lambda w_2\). This is why the \ (l_2\) Norm regularization is called the weight decay (weight decay).
In practice, we also sometimes add the squared sum of the deviating elements to the penalty items .
Suppose that the input of a neuron in a neural network is \ (x_1,x_2\), using the activation function \ (\phi\) and outputting \ (\phi (x_1w_1+x_2w_2+b) \). Assuming that the activation function \ (\phi\) is Relu, Tanh, or sigmoid, if \ (w_1,w_2,b\) is very close to 0, the output is also close to 0. In other words, the neuron is less useful, even as a neuron is missing from the neural network. This effectively reduces the complexity of the model and reduces overfitting.
High dimensional linear regression experiment
We use the high dimensional linear regression as an example to introduce an overfitting problem and try to deal with overfitting by using the L2 norm regularization.
To generate a data set
The dimension that sets the Data sample feature is P. For any sample of the training dataset and the test data set characterized by \ (x_1,x_2,..., x_n\) , we use the following linear function to generate the label for the sample:
\[y = 0.05 + \sum_{i = 1}^p 0.01x_i + \epsilon,\]
Where is the noise item? A normal distribution with a mean value of 0 and a standard deviation of 0.1.
To make it easier to observe the fitting, we consider the high dimensional linear regression problem, such as setting dimension p=200, and we deliberately set the number of samples for the training dataset to be low, for example 20.
n_train = 20n_test = 100num_inputs = 200true_w = nd.ones((num_inputs, 1)) * 0.01true_b = 0.05features = nd.random.normal(shape=(n_train+n_test, num_inputs))labels = nd.dot(features, true_w) + true_blabels += nd.random.normal(scale=0.01, shape=labels.shape)train_features, test_features = features[:n_train, :], features[n_train:, :]train_labels, test_labels = labels[:n_train], labels[n_train:]
Initialize model parameters
def init_params(): w = nd.random.normal(scale=1, shape=(num_inputs, 1)) b = nd.zeros(shape=(1,)) params = [w, b] for param in params: param.attach_grad() return params
Defining the L2 Norm penalty
def l2_penalty(w): return (w**2).sum() / 2
Define training and testing
batch_size = 1num_epochs = 10lr = 0.003net = gb.linregloss = gb.squared_lossdef fit_and_plot(lambd): w, b = params = init_params() train_ls = [] test_ls = [] for _ in range(num_epochs): for X, y in gb.data_iter(batch_size, n_train, features, labels): with autograd.record(): # 添加了 L2 范数惩罚项。 l = loss(net(X, w, b), y) + lambd * l2_penalty(w) l.backward() gb.sgd(params, lr, batch_size) train_ls.append(loss(net(train_features, w, b), train_labels).mean().asscalar()) test_ls.append(loss(net(test_features, w, b), test_labels).mean().asscalar()) gb.semilogy(range(1, num_epochs+1), train_ls, 'epochs', 'loss', range(1, num_epochs+1), test_ls, ['train', 'test']) return 'w[:10]:', w[:10].T, 'b:', b
Setting the Lambd=0, the training error is much smaller than the test (generalization) error, which is a typical overfitting phenomenon.
fit_and_plot(lambd=0)# output('w[:10]:', [[ 0.30343655 -0.08110731 0.64756584 -1.51627898 0.16536537 0.42101485 0.41159022 0.8322348 -0.66477555 3.56285167]]<NDArray 1x10 @cpu(0)>, 'b:', [ 0.12521751]<NDArray 1 @cpu(0)>)
Using regularization, the overfitting phenomenon is alleviated to some extent. However, more accurate model parameters are still not being learned. This is mainly because the number of samples in the training dataset is too small for the relative dimensions .
fit_and_plot(lambd=5)# output('w[:10]:', [[ 0.01602661 -0.00279179 0.03075662 -0.07356022 0.01006496 0.02420521 0.02145572 0.04235912 -0.03388886 0.17112994]]<NDArray 1x10 @cpu(0)>, 'b:', [ 0.08771407]<NDArray 1 @cpu(0)>)
MXNET: Weight Decay