Preface
Number third is going to attend the CAD/CG conference and cast a paper on the use of generation models and discriminant models to do motion capture data style recognition. This period of time has been engaged in convolution RBM, almost the original experimental content has been forgotten, here Review the discriminant Boltzmann machine training process.
International practice, post several links:
Thesis 1--energy Based Learning classification task using Restricted Boltzmann machines
Link: http://pan.baidu.com/s/1i5foeEx Password: flq7
Thesis 2--classification using discriminative Restricted Boltzmann machines
Link: http://pan.baidu.com/s/1qYGT9z2 Password: HEBJ
Code--rbm-on-classification
Link: Http://pan.baidu.com/s/1qXWmwTQ Password: Xh12
Data--mnist handwritten digital database
Link: Http://pan.baidu.com/s/1boGxPxX Password: 0l25
This paper is mainly based on the paper 1 and its code on the discriminant restriction Boltzmann machine interpretation, the code is mainly used in handwritten digital recognition mnist database.
First, draw a map of the code corresponding to the model structure
The first step
Start with the main function: TRAININGCLASSRBM.M
The first is a series of initialization, do not say, mainly including learning rate, momentum, weight attenuation, batch size, training times, random initialization weights and biases. Here the main focus is on two lines of procedure:
Discrim = 1; % 1 To activate the Discriminative learning, 0 otherwise Freeenerg = 1; % if Discrim is zero and one want to compute the free energy
For these two sentences, you can open the fifth chapter of paper 1 (11th page) said such a sentence:
The idea is that to solve a particular classification problem, we can simply define a goal function to minimize it and then get the model parameters of the RBM. and the target function can choose three kinds:
1. Training objective function of generating model
2. Training objective function of discriminant model
3. Training objective function of mixed model
Obviously, depending on the code, you can see that the program uses variable Discrim to control whether a third method is used, and if Discrim=1 uses the third objective function, if discrim=0 uses the first objective function.
After that, the batch processing of data and the selection of validation sets are done.
Data_valid = data (End-valid_size + 1:end,:); label_valid = labels (end-valid_size + 1:end,:);%convert y in the N Otetion one out of number of Classeslab = Boolean (set_y (Labels (1:end-valid_size,:), num_class));%convert the dataset in Batches[batch_data, Batch_labels, batch_size] = ... Createbatches (Data (1:end-valid_size,:), lab, Num_batch);
Step Two
Formally enter the first-level for loop, used to control the number of training, each training completed will output a corresponding number of information, including the reconstruction error, validation set error, free energy changes.
Sum_rec_error = 0; Sum_class_error = 0; Sum_tot_free_energ = 0;
The next sentence is to change the momentum term in the fifth iteration.
If I_epoch = = 5 momentum = final_momentum; End
and using the simulated annealing (simulated annealing) algorithm to reduce the learning rate
%simulated annealing gamma = (Init_gamma)/(1 + (i_epoch-1)/tau);
Step three
Formally into the inner loop to the data allocation processing, each time only a batch of data, size 100 (the last batch size depends on the situation)
x = Batch_data{ibatch} '; y = batch_labels{ibatch} '; Batch_size=size (x,2);
Enter the calculation of the build model, that is, the model parameter update corresponding to the first target function mentioned in the first step.
The activation probability value P (h|x,y) of the hidden layer is calculated first, which is often said in the RBM positive phase according to the article can get
and the corresponding program is
Hprob = Prob_h_given_xy (x, Y, W, u, b_h);
Where the function prob_h_given_xy is:
function [Prob] = Prob_h_given_xy (x, Y, W, u, b_h) <span style= "white-space:pre" ></span>prob = SIGMF (bsxfun (@plus, W * x + u * y, B_h), [1 0]); end
Then the random activation of the hidden unit, which is a frequently occurring step in the RBM, is called the activation state:
H = sampling (Hprob);
The activation function is implemented as follows:
function [Samp] = sampling (p) <span style= "white-space:pre" ></span>samp = p > rand (Size (p)); end
Then there is the corresponding negative phase, which uses the resulting hidden element state to reverse compute the X and Y cell data
According to the activation function described in the paper
You can find the corresponding activation function program notation:
Calculate P (x|h), it is simple to calculate the weight directly multiplied by the activation state of the hidden element (not the activation probability value) and then the input layer bias.
function [Prob] = Prob_x_given_h (h, W, b_x) <span style= "White-space:pre" ></span>prob = Sig (Bsxfun (@plus, W ' * h, b_x); end
Calculate P (y|h), slightly more complex, in the denominator processing, you need to add sum, you ' *h calculate the matrix size is the number of categories * Batch size, the experiment size is 30*716, and sum is to add all rows to the first row, that is, each sample calculated by the label layer cell data to add sum, The whole process is similar to the normalization.
function [Prob] = Prob_y_given_h (H, U, b_y) <span style= "white-space:pre" ></span>prob_not_norm = exp (bsxfu N (@plus, u ' * H, b_y)), <span style= "white-space:pre" ></span>norm_factor = SUM (prob_not_norm); <span Style= "White-space:pre" ></span>prob = Bsxfun (@rdivide, Prob_not_norm, norm_factor); end
Activation status is also required after calculating the activation probability values for x and Y
Xrprob = Prob_x_given_h (H, W, b_x); yrprob = Prob_y_given_h (H, U, b_y); XR = sampling (xrprob); yr = Sampling_y (yrprob);
Take the final step of negative phase, using the new X and Y state values (0 or 1) to calculate the activation probability value of the hidden layer
Hrprob = prob_h_given_xy (XR, yr, W, u, b_h);
According to the above calculated positive phase and negative phase correlation values to calculate the generation model each parameter's update gradient, mainly contains the hidden layer to the input layer the connection weight w, the hidden layer to the label layer's connection weight V, the input layer bias b_x, the hidden layer bias b_h, the label layer bias b_y. Note that the update uses the activation probability value to update, instead of activating the status value to update.
G_w_gen = (Hprob * x '-Hrprob * xrprob ')/Batch_size;g_u_gen = (Hprob * y '-hrprob * yrprob ')/Batch_size;g_b_x_gen = (SUM (x,2)-sum (xrprob,2))/Batch_size;g_b_y_gen = (SUM (y,2)-sum (yrprob,2))/Batch_size;g_b_h_gen = (SUM (hprob,2)-su M (hrprob,2))/batch_size;
Fourth Step
Enter the objective function calculation of the discriminant model, the second objective function described in the first section above.
Of course, the first is to determine whether we choose to use discriminant models, as well as the initialization of discriminant model calculation of the relevant parameters, including the hidden layer bias gradient, two weights W and u gradient
if (Discrim | | freeenerg) %initialization of the gradients G_B_H_DISC_ACC = 0; G_W_DISC_ACC = 0; G_U_DISC_ACC = zeros (H_size,num_class);
In particular, there is no input layer bias here, in detail can be seen in paper 1 in this sentence:
This means that B (article B is the bias of the input layer) gradient is 0, because the calculation of P (y|x) is not used when the input layer bias.
Let's take a look at the calculation method of P (y|x)
where F represents free energy.
It's better to write, the code behind will calculate this free energy separately, where DY is the label layer bias
The next step is to implement this sampling function incrementally
Note that the softplus here is also a function, not a simple addition, specific introduction click here
① first found that there is always a w*x in the formula, so it is proposed to calculate, to avoid the subsequent repetition of the calculation
%ausiliary variables w_times_x = w * x;
② Next calculates the free energy function
For iclass= 1:num_class o_t_y (:,:, IClass) = Bsxfun (@plus, B_h + u (:, IClass), w_times_x); Neg_free_energ (iClass,:) = b_y (iClass) + SUM (log (1 + exp (o_t_y (:,:, IClass))); end;
Note the dimensions of the two variables here:
O_t_y dimension size is 800*716*30, representing the number of hidden element neurons, a number of samples, the total number of tags, like the hidden layer of the 30 800*716 of the unit values, and then the calculation of the time is also true, each calculation uses a for loop for the third dimension of the 800* The 716 matrix is traversed separately, which should be to calculate the probability that this sample belongs to each label.
Neg_free_energ size is 30*716, which represents the total number of labels, the total number of samples
The sum of all the samples represented by each neuron in the label layer is computed in turn. It's kind of like using X and Y to get H, then mapping h through Softplus, and finally summing up 800 units of each of the hidden layers of each sample to get a sample of the corresponding tag cell value. For example, when O_t_y calculates Iclass=1, it is actually a 800*716 matrix, then the vector of 1*716 is obtained by sum, which is directly the value of all samples of the first cell of the label layer.
And then record the total energy of the current batch of data and use it with the next batch of free energy and add
Sum_tot_free_energ = Sum_tot_free_energ + sum (-neg_free_energ (y));
Thisif (Discrim | | freeenerg) "The Freeenerg in the calculation is complete.
③ the next step is to calculate the discrim corresponding parameter update gradient.
The main calculation is P (y|x) according to the above-mentioned formula.
In the program, the obtained Neg_free_energ is 0-valued, and the value of the molecule is calculated.
Med = mean2 (Neg_free_energ); p_y_given_x = exp (neg_free_energ-med);
Using the numerator value, calculate the numerator/denominator value, note that the addition here is the 30 tag layer values corresponding to each sample, that is, sum (p_y_given_x) is the result of adding 30*716 into the 1*716 dimension of the vector
p_y_given_x = Bsxfun (@rdivide, p_y_given_x, sum (p_y_given_x));
④ is the final calculation of gradients.
First calculate the gradient of the label layer, according to the original formula
It is plain to use the original label minus P (y|x) to infer the value of the tag, and finally calculate the addition and then the label layer bias update gradient is obtained
dif_y_p_y = y-p_y_given_x; G_B_Y_DISC_ACC = SUM (dif_y_p_y,2);
Then it calculates the gradient of the remaining three weights W, V, and the hidden layer bias b_h, and view the calculation method of paper 1
which
The relevant calculation procedures are as follows:
First, the O function of the first and second items is computed
And then they are nested into the calculation of B_h, W, u, respectively.
G_B_H_DISC_ACC = G_b_h_disc_acc + sum (sig_o_t_y_selected, 2)-sum (Bsxfun (@times, Sig_o_t_y, p_y_given_x (IClass,:)), 2); %update the gradient of W fot the class ' iClass ' G_W_DISC_ACC = G_W_DISC_ACC + sig_o_t_y_selected * x (:, Y (IClass,:)) '-Bsx Fun (@times, sig_o_t_y, p_y_given_x (IClass,:)) * x ';%update the gradient of U fot the class ' IClass ' G_u_disc_acc (:, IClass) = SUM (Bsxfun (@times, Sig_o_t_y, dif_y_p_y (IClass,:)), 2);
Note that the two parts of the code are included with a For loop.
For iclass= 1:num_class
As mentioned above, the third dimension of o_t_y (dimension is the number of hidden layer units * Number of samples * label category) is traversed, the number of hidden units per traversal * Batch sample number as the hidden layer Unit is sig_o_t_y (code dimension is 800*716), Then extract the sample labeled IClass (for example, 716 samples, 29 sample labels for the current loop IClass) get sig_o_t_y_selected (Iclass=1 when the article dimension is 800*29) so it is better to understand, Just like the RBM that generates the model:
When calculating the hidden layer bias b_h, the difference between sig_o_t_y_selected and sig_o_t_y*p (y|x) can be calculated.
When calculating the connection weight w, calculate sig_o_t_y_selected*x_selected (this variable represents the sample matrix labeled IClass in the Batch sample X) and the Sig_o_t_y*p (y|x) *x corresponding to the difference;
When calculating the connection weight u, just calculate the product of Sig_o_t_y and dif_y_p_y (IClass,:), because the difference is calculated when the dif_y_p_y is calculated, so there is no difference like the above calculation of B_h and W.
⑤ all the calculations, just update some of the parameters in the discriminant model.
G_b_y_disc = G_b_y_disc_acc/batch_size;g_b_h_disc = G_b_h_disc_acc/batch_size;g_w_disc = G_w_disc_acc/batch_size;g_u _disc = g_u_disc_acc/batch_size;
Fifth Step
Using the objective function introduced in the first step to update the global gradient
Delta_w = Delta_w * momentum + ... Gamma * (Discrim * g_w_disc + Alpha * g_w_gen-weight_cost * w);d Elta_u = Delta_u * momentum + ... Gamma * (Discrim * g_u_disc + alpha * g_u_gen-weight_cost * u);d elta_b_x = delta_b_x * momentum + ... Gamma * Alpha * g_b_x_gen;delta_b_y = delta_b_y * momentum + ... Gamma * (Discrim * g_b_y_disc + alpha * g_b_y_gen);d elta_b_h = delta_b_h * momentum + ... Gamma * (Discrim * g_b_h_disc + alpha * g_b_h_gen); w = w + delta_w;u = U + delta_u;b_x = b_x + delta_b_x;b_y = b_y + Delta _b_y;b_h = B_h + delta_b_h;
After doing an average, said not very understand what is used, as if to cut the gradient? And in each iteration of the output of the validation set error is still used in the above b_y, B_h, W, u instead of doing the average b_y_avg, B_h_avg, W_avg, U_avg. This has yet to be explored.
Sixth Step
All parameters need to be identified later, very simple, there is an input testdata, the corresponding label testlabels, the total number of categories Num_class, corresponding model parameters b_y, B_h, W, u, and then use the energy function f to calculate the probability that the sample belongs to each label, As I said earlier, the size of the Neg_free_energ is the number of tags * sample number, so that each sample corresponding to the probability of each label is less, take the biggest probability just.
function [Err] = Predict (TestData, Testlabels, Num_class, B_y, B_h, W, u) numcases= size (testdata, 1); w_times_x = w * testdata '; Neg_free_energ = Zeros (Num_class, numcases); For iclass= 1:num_class Neg_free_energ (iClass,:) = b_y (iClass) + SUM (log (1 + ... ) Exp (Bsxfun (@plus, B_h + u (:, IClass), w_times_x))); End; [~, Prediction] = max (Neg_free_energ); Err = SUM (prediction ~= testlabels ')/numcases * 100;end
Seventh Step
To add, how do you see the effect of the model?
First of all, the reconstruction error does not have to say much, it must be slowly lowered
The second is the free energy, according to the RBM description, the most stable moment of the system is the time of least energy.
Of course, the classification error must also be reduced
Also, sometimes the energy will suddenly rise, do not panic at this time, first wait to see if it has been elevated, it is possible that the completion of a certain iteration down, similar to this
The interpretation of the source code of Boltzmann machine paper based on discriminant model