Energy-based model (EBM)The energy-based model associates every variable we are interested in with a scalar energy. learning is to modify the energy equation so that its shape has what we need. for example, we hope that the expected structure has low energy. the energy-based probabilistic model defines a probability distribution, which is determined by the energy equation: the normalized factor Z is called the allocation function, which is similar to the physical system. the energy-based model can be obtained through the SGD (random gradient descent) algorithm based on the negative logarithm similarity (nll) of data. for logarithm regression, we first define the logarithm similarity, followed by the loss function, that is, the negative logarithm similarity (nll ). use random gradient descent, where the model parameters (weights, offsets, etc)
EBM with hidden layer UnitIn many instances, we cannot observe samples.
XOr we want to introduce some variables that are not observed to increase the differentiation of the model. Therefore, we define an observation item.
XAnd a hidden item
H. In this case, to map this formula to an approximate equation (1), we introduce a free energy mark (taken from physical), as defined below: (2) the formula can be rewritten as [here, X of Sigma should have a wave number, indicating all X], so that the negative logarithm similarity (nll) gradient of the data has a particularly interesting form: [the first item is molecular partial guide, and the second item is denominator partial guide. Note that D exp (-f (x)/D = exp (-f (x )) * D (-f (x)/d] note that the gradient consists of two parts: positive and negative. these two parts do not correspond to their symbols in the equation, but reflect their influence on probability density. the first part adds the probability of training data (by reducing the corresponding free energy), and the second part reduces the probability of Samples produced by the model. it is usually difficult to analyze these gradients because it involves computing. this is exactly the same as the expectation obtained from all possible structures of input x (given the distribution P of the model ). the first step to solve this problem is to use a fixed number of model samples to estimate expectations. A sample used to estimate the negative component gradient is called a negative particle, which is expressed as a negative particle. the gradient can be expressed as: theoretically, we want to sample data by P (for example, Monte Carlo ). we have almost obtained a practical random Algorithm for training EBM. the only missing factor is how to extract these negative particles. the Markov Chain Monte Carlo method is suitable for the RBM model (a special case of EBM. Restricted Boltzmann Machine (RBM) is a special form of logarithm Linear Markov Random Field. Its energy function is linear for free parameters. to enable them to express complex distributions (such as from restricted variable parameters to a non-variable parameter), we assume that some variables are never observed (hidden ). by constructing more implicit variables (implicit units), we can increase the capacity of the BM Model. the restricted BM further limits the BM, which has no visible-visible and hidden-hidden connections. the diagram of RBM is as follows: the energy equation e (v, H) is defined as: W represents the weights of hidden and visible elements, and B, C represents the bias of the visible and hidden layers. in this way, we can obtain the following free energy formula: because of the special structure of RBM, the visible and hidden elements are independent of each other. With this feature, we can obtain:
RBM of binary UnitUsing Binary units in general (and), we can get a probability version of the activation function of a general neuron from formula (6) and (2: the free energy of RBM is simplified:
Update equations using binary unitsCombined with equations (5) and (9), we can obtain the logarithm similarity gradient of RBM, the following binary unit: We use t in theano. GRAD (4) is used instead of directly.
RBM samplingP (x) sampling can be obtained by running a Markov chain to converge. It uses the Gaussian sampling as a transitional operation.
- A series of n sub-sampling steps (sub-step), including other N-1 random variables in S.
For RBM, S is composed of a set of visible and hidden neurons, but since they are conditional independent, you can perform
Block sample). Under this condition, we can see that neurons sample while hiding the given neuron value. similarly, hidden neurons also sample when the neuron value is given. step operations in a Markov chain are as follows: neural sIgM (x) = sigmoid (x) = exp (-x)/(1 + exp (-x )) it represents the set of hidden neurons in step N of the Markov chain. it indicates that the value is set to 1/0 at random based on probability, and 1/0 at random based on probability. for example, when sampling is performed, a precise sampling of P (v, h) can be ensured. in theory, every time a parameter is updated during the learning process, this chain needs to run once until convergence. there is no doubt that the cost is too high. therefore, many algorithms are designed to effectively sample P (v, h) during learning.
Contrastive divergence, CD-K)The CD algorithm uses two techniques to accelerate the sampling process:
- Because of the final hope (the true distribution of data), we use a training sample to initialize the Markov chain (that is, we use a distribution closer to P, so that the chain is close to convergence to the final distribution P)
- CD does not wait for the chain to converge. The sampling point set is obtained after the first step of the k. k = 1 has a very good effect.
CD (PCD)Continuous CD uses another P (v, h) sampling estimation. it depends on a single Markov chain, which has a constant state (that is, it does not restart the chain for every observed sample ). for each parameter update, we extract a new sample in K steps through a simple running chain. the chain status is saved for subsequent updates. the basic idea is that if the mixed state of the parameter changes relative to the chain is small enough, the Markov chain can "Catch up" with the changes in the model.
ImplementationWe constructed the RBM class. network parameters can be initialized by the constructor or input parameters. in this way, RBM can be used to construct a deep network. In this case, the weight matrix and the hidden layer offset are shared with the sigmoidal layer of a MLP network.
- Write the constructor, set the default values of some parameters, and complete a series of initialization.
- The weights are initialized to uniform distribution.
- The offset is initialized to 0.
- Defines a symbolic graph related to the (7) (8) two types.
- Propup ()/propdown ()
-
- Given the status of the V/H layer, return the V/H layer [excitation function, mean value/probability]
- These two methods provide the following two method calls
- Sample_h_given_v ()/sample_v_given_h ()
-
- Given the status of neurons in the V/H layer, return the H/V layer [excitation function, probability/mean, sampling point]
- We can use these four functions to define a symbolic graph of the Gamma sample.
-
- Gibbs_vhv defines a step-by-step garbage sampling from visible neurons, which is useful in RBM sampling.
- Gibbs_hvh is the opposite, which will be useful in Cd and PCD updates.
- Return V, H/h, V layer [excitation function, mean, sampling point]