The following content is translated from
A fast Learning algorithm for deep belief nets by Geoffrey E. Hintonandsimon Osindero
Some symbols:
Transpose of the W.t:w
1 Introduction
Learn about a problem facing the New Year network:
1. The posterior distribution of all layers under inferred hidden state conditions is difficult
2. Local extremum and gradient dispersion
3. The learning process is slow (because you want to learn all the parameters at once)
Model: The top two hidden layers two layers form a Lenovo memory (associative memory), and the other hidden layers form a directed acyclic graph (DAG), which passes the representations in the associative memory down to the input layer to reconstruct the visual variables, the sample pixels.
we will describe a model in which the top two layers form an associative memory, 1,
Figure 1
The other hidden layers form a directed acyclic graph (DAG), which passes the representations in the associative memory down to the input layer and transforms it into a visual variable, a sample pixel.
Before continuing with the following chapters, ask a question:
For a New Year's network, why is it difficult to learn one layer at a time?
The reasons are as follows:
2, in order to learn w, we need a posterior distribution of the first hidden layer (the formula derivation in the specific RBM)
1. Due to "explaining away" phenomenon
2. The posterior distribution depends on the prior distribution and the maximum likelihood, even though we only learn one layer at a time, but all of the W are mutually influential.
3. All high-level parameters need to be synthesized to obtain a prior distribution of the first hidden layer
Figure 2
1.
2 Complementary priors
Figure 3 is a simple logistic belief network that contains two independent factors, and when we observe that house jumped occurs, they become highly anti-correlated
(anti-correlated). The deviation on the Eathquake node-10 means that the probability that the node is not occurring when there is no input (P (A)/P (~a) = ). If an earthquake node is activated and the truck node is not activated, the total input to the jump node is 0 (at which point earthquake gives house jumps an input, truck no input), which means that it has the same chance of activating and not activating. Similarly, when earthquake and truck are lost, the probability of house jump is. But the two hidden factors are activated to explain what we observe is very wasteful, because the probability of their simultaneous occurrence is. But if the earthquake node is activated, it will eliminate the probability of ' explains away ' truck node. Why, assuming that the possibility of a house collapse includes earthquakes and terrorist attacks, if we find that the house is down, then there is a good chance that one of them will happen, but if we know that the terrorist was knocked down by a plane, we would not be able to tell if there was an earthquake. Then the probability of the earthquake from "very likely" to change back to "possible", the probability is reduced, this is explaining away.
Figure 2
The phenomenon of the explaining away phenomenon makes the inference to the belief network very complicated. In dense connected networks, it is difficult to find the posterior distribution of hidden variables (from top to bottom, the layers below are deduced from the above layer, that is, the following layer is house jumps, which is earthquake and truck hits), except for some special cases, For example, a hybrid model and a linear model with a sleep noise.
Figure 4
Complementary Priors
Suppose there is a hierarchical graph structure (Figure 4), assuming that we are given the following conditional probabilities
Note that this formula is reversed, from I+1->i
So we can modify the Bengwei in the model.
Figure 5
In this case, we sampled the posterior distribution, actually executing a Markov chain up, complementary priors is actually the steady state of this Markov chain.
(This is the factorial distribution mentioned below)
————————————————————
Explanation of explaining away:
Organize your notes (one)--explaining away's simple comprehension
Bayesian network "de-explaining" (away)
Explanation of complementary Prior: [20140410] Complementary Prior
Setting the Stage; Complementary Priors and variational Bounds
————————————————————
2.1 An infinite directed model with tied weights
Figure 6
Figure 6 is an infinite logistic belief network by tied weights, the downward arrow represents the build model (generative), and the upward arrow represents the recognition model.
We can generate data in an infinitely deep hidden layer through a random initial state, and then pass this random state from top to bottom (where the two-value state of each variable in each layer is chosen: a Bernoulli distribution and a sigmoid (Data*w.t+bias) are compared, See details of the Gibbs sampling section of the RBM).
the variables in the H0 are independent of the v0 condition, so that the inference process is very simple, just multiply the H0 by the w.t.
Ho above model realizes the complemenatary prior.
From this point of view, it is the same as any other one with a non-ring belief network. However, we sampled the true prior distribution from all hidden layers, and deduced the factorial distribution of each hidden layer from the data vectors on the visual unit as the initial state, and then used the transpose of the heavy matrix.
starting from the visual unit, use W.T to infer the factorial distribution of each hidden layer
a step to calculate the posterior distribution of a layer =gibbs sampling
For the H0 of Layer J, the V0 of layer I (Figure 6), the maximum likelihood function for a single data vector v0 gradient:
(2)
Which represents the average of the sampled state, is the probability that the cell I will be activated when a random reconstruction of the visual vector is performed using the state of the hidden element.
The second hidden layer's posterior distribution V1 is calculated by using the two-value state H0, which is sampled from the first hidden layer, and the entire process is equivalent to rebuilding data, so it is sampled from a probabilistic Bernoulli random variable. So the whole learning process is as follows
(3)
In conversions from (2) to (3), dependency does not cause problems with the derivative in the upper form, because it is a conditional probability (
) under the expectation. Because the weights are constant, the full derivative of the weights equals the sum of the derivative of the weights between all the transformations in the layer
3 Restricted Boltzmann machines and ontrastive divergence learning
Defines the correlation variable that represents when the data vector is fixed to the input cell and the hidden state is sampled from the conditional distribution. Then the Gibbs sample is executed (Figure 7) Knowing that a smooth distribution is achieved and the correlation is calculated, the log probability is (5), and (4) is consistent. To a forward network from top to bottom execution is equivalent to making an RBM achieve a smooth distribution , that is, an infinity-to-network definition to an RBM-like distribution
Figure 7
The contrast divergence algorithm in an RBM is very effective, but it does not apply to deep networks (each layer has a different weight), because if the network hierarchy is too many, it is difficult to achieve conditional equalization even if the data vectors are fixed (conditional equilibrium).
4 A Greedy Learning algorithm for transforming representations
Figure 8
Figure 5, a hybrid network. The top side is formed by a non-lateral connection to form a combined memory (associative memory). The layers below are forward, and the bottom-up edge (known as generative connections) can be used to map the associative memory to an image. From top to bottom there is a forward edge called generative connections, which is used for the foriactal representation of these layers below it (above by explanation, do not remember please ctrl+f).
Figure 5 The top two of the non-edge is equivalent to a black box, the box has countless layers (with tied wights). For ease of analysis, assume that all layers have the same number of units. Suppose that the above layer forms a complementary prior, so that we can learn a reasonable value (although not optimal, it is equivalent to assuming that all the Watts above W0 are equal to W0).
Greedy algorithm:
1. Assume that each layer is tied weights, learn, fix , and then use to infer the factorial, estimating the posterior probability of the first hidden layer state parameter. This is equivalent to learning a rbm,8. Since we perform the N-step contrast divergence algorithm, this equates to the equivalent of ignoring the N-layer gradient.
Figure 9
2. Fix the weight of the first layer (fixed in two directions), then learn the weights of the higher level ; This is equivalent to learning another RBM.
Figure 10
If the greedy algorithm changes the higher-level weights matrix, then we can ensure that the generation model is optimized. As Hinton is mentioned in a new view of the EM algorithm that justifies incremental, sparse and other variants: the log probability of a single data vector v0 (in multi-layered Generated model) is associated with free energy. Free energy is the desired energy under the simulated distribution (approximating distribution) minus the entropy of this distribution. For a direct model, the energy under the
Then they are directly bound to
The two-value state of the first hidden layer is the prior probability of the current model (defined by the weight of the H0), which is the probability distribution of the two-value state of the first hidden layer. When and only if the posterior distribution is true, (8) becomes an equation.
When all weights are tied together, the factorial distribution on the H0 generated by (to Data-vector) is the true posterior distribution, so the second step of the greedy algorithm equals the constraint. The 2nd step is fixed and, at this time the derivative of the constraint equals
So the maximization constraint equals the log probability of the maximized data set (where the probability of H0 occurs). If the constraints are stricter, even if the nether grows, it can be reduced, but not less than the 2nd step of the greedy algorithm, because the constraints are very restrictive and the values of the constraints are always rising.
We can iterate over the greedy algorithm, so if we use the maximum likelihood of the RBM learning algorithm to learn every tied weight, we need to untie the lowest weight, we learn one level at a time, to ensure that we do not reduce the entire generation model of the log probability. In fact, we use the contrast divergence learning algorithm instead of the max-likelihood RBM learning algorithm because the former is faster and better. The use of contrast dispersion eliminates the untie step, but we still need to be sure that if we study each level individually, then the other layers should be able to ensure that the model is optimized.
To ensure that the greedy algorithm can even optimize the generation model by learning more layers, we design a model in which all levels are the same size, so that the higher-level weights can be initialized with the values previously learned (the value is learned before they are untie by the weight of the layer below). Of course, this greedy algorithm can also be used for different sizes of layers.
5 back-fitting with the Up-down algorithm
Local Best: the weight matrix of each layer of learning is very effective, but may not get the best results. Once you have learned the weights of the upper echelons, either the weights or the inferences may not be optimal for the lower level.
1. This local optimization does not have any problem with the monitoring method, but may not use unsupervised methods.
2. The label may contain a lot of bits and pieces of information, so there may be a fitting problem (over-fitting).
Wake-sleep algorithm
1. "Wake" or "up-pass", use recognition weights, randomly initialize hidden parameters, use (2) to adjust generaive weights from bottom to top
2. No forward edge in Lenovo memory, and the same method in pre-training.
when training the topmost weights, provide labels as input and use Softmax to classify
3. ' Sleep ' or ' down-pass ' stage, using generative weights, start from top to bottom from two-layer associative memory to adjust recognition weights
Deep Belief Network DBN