Self-learning is a softmax classifier connected by a sparse encoder. As shown in the previous section, the training is performed 400 times with an accuracy of 98.2%.
On this basis, we can build our first in-depth Network: stack-based self-coding (2 layers) + softmax Classifier
In short, we use the output of the sparse self-Encoder as the input of a higher layer of sparse self-encoder.
Like self-learning, it seems that a new layer is added, but it is not:
The new technique is that we have a fine-tuning process to let the residual be passed to the input layer from the highest level and fine-tune the entire network weight.
This fine-tuning is very obvious for improving network performance, as we will see later.
Network Structure:
Figure 1
Pre-Load
minFunccomputeNumericalGradientdisplay_networkfeedForwardAutoencoderinitializeParametersloadMNISTImagesloadMNISTLabelssoftmaxCostsoftmaxTrainsparseAutoencoderCosttrain-images.idx3-ubytetrain-labels.idx1-ubyte
Train the first sparse Encoder
addpath minFunc/options.Method = 'lbfgs';options.maxIter = 400;options.display = 'on';[sae1OptTheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ... inputSize,hiddenSizeL1, ... lambda, sparsityParam, ... beta, trainData), ... sae1Theta, options);
Train the second sparse Encoder
sae2options.Method = 'lbfgs';sae2options.maxIter = 400; sae2options.display = 'on';[sae2OptTheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ... hiddenSizeL1, hiddenSizeL2, ... lambda, sparsityParam, ... beta, sae1Features), ... sae2Theta, sae2options);
Train A softmax Classifier
smoptions.maxIter = 100;[softmaxModel] = softmaxTrain(hiddenSizeL2, numClasses, lambda, ... sae2Features,trainLabels, smoptions);saeSoftmaxOptTheta = softmaxModel.optTheta(:);
Fine-tune the entire network
ftoptions.Method = 'lbfgs';ftoptions.display = 'on';ftoptions.maxIter = 100;[stackedAEOptTheta, cost] = minFunc( @(p) stackedAECost(p,... inputSize,hiddenSizeL2, ... numClasses, netconfig, ... lambda,trainData,trainLabels), ... stackedAETheta, ftoptions);
Cost Function and Gradient
A2 = sigmoid (bsxfun (@ plus, stack {1 }. W * data, stack {1 }. b); a3 = sigmoid (bsxfun (@ plus, stack {2 }. W * A2, stack {2 }. b); temp = softmaxtheta * A3; temp = bsxfun (@ minus, temp, max (temp, [], 1 )); % prevent data overflow hypothesis = bsxfun (@ rdivide, exp (temp), sum (exp (temp ))); % get the probability matrix COST =-(groundtruth (:) '* log (hypothesis (:))/m + lambda/2 * sumsqr (softmaxtheta ); % cost function softmaxthetagrad =-(groundtruth-hypothesis) * A3 '/m + Lambda * softmaxtheta; % gradient function delta3 = softmaxtheta' * (hypothesis-groundtruth ). * A3. * (1-a3); delta2 = (stack {2 }. w' * delta3 ). * A2. * (1-a2); stackgrad {2 }. W = delta3 * A2 '/m; stackgrad {2 }. B = sum (delta3, 2)/m; stackgrad {1 }. W = delta2 * Data '/m; stackgrad {1 }. B = sum (delta2, 2)/m;
Prediction Functions
A2 = sigmoid (bsxfun (@ plus, stack {1 }. W * data, stack {1 }. b); a3 = sigmoid (bsxfun (@ plus, stack {2 }. W * A2, stack {2 }. B ));[~, PRED] = max (softmaxtheta * A3); % records the sequence number of the maximum probability instead of the maximum value.
After more than two hours of training, the final result is very good:
Beforefinetuning test accuracy: 86.620%
Afterfinetuning test accuracy: 99.800%
It can be seen that fine-tuning plays a vital role in deep network training.
Welcome to the discussion and follow up on this blog, Weibo, and zhihu personal homepage for further updates ~
Reprinted, please respect the work of the author and keep the above text and link of the article completely. Thank you for your support!