Connect
Note: The following two deep learning methods need to be improved, but to ensureArticleFor the continuity and integrity, first paste some up, and then modify it later.
9.3. Restricted Boltzmann Machine (RBM) limits the Polman Machine
Assume there is a two-part graph with no links between nodes in each layer. The layer is a visible layer, that is, the input data layer (v), and the layer is a hidden layer (h ), if all the nodes are random binary variable nodes (only 0 or 1 values can be taken), and the full probability distribution P (v, h) satisfies the Boltzmann distribution, we call this model restricted named mannmachine (RBM ).
Let's take a look at why it is a deep learning method. First, because this model is a binary graph, all hidden nodes are conditional independent (because there is no connection between nodes), that is, P (H | V), when we know v) = P (H1 | V )... P (HN | V ). Similarly, when the hidden layer H is known, all visible nodes are conditional independent. At the same time, because all V and H meet the conditions of the Boltzmann distribution, when the input V is used, P (H | V) can be used to obtain the hidden layer h. After the hidden layer H is obtained, P (v | h) can be used to obtain the visible layer. by adjusting the parameters, we want to make the visual layer V1 obtained from the hidden layer the same as the original visual layer V, the hidden layer is another representation of the visual layer. Therefore, the hidden layer can be used as a feature of the input data of the visual layer. Therefore, it is a deep learning method.
How to Train? That is, how can we determine the weights between the nodes on the visual layer and the hidden nodes? We need to do some mathematical analysis. That is, the model.
The energy of jointconfiguration can be expressed:
The joint probability distribution of a configuration can be determined by the Boltzmann distribution (and the energy of this configuration:
Because hidden nodes are conditional independent (because there is no connection between nodes), that is:
Then we can easily obtain the probability that the J-node in the hidden layer is 1 or 0 on the basis of the given visible layer V (factorizes:
Similarly, based on the given hidden layer H, the probability that the I node of the visible layer is 1 or 0 is also easy to get:
Given a sample set that satisfies the independent distribution: D = {V(1 ), V(2 ),..., V(N)}, we need to learn the parameter θ = {W, a, B }.
We maximize the following logarithm likelihood functions (maximum likelihood estimation: For a probability model, we need to select a parameter to maximize the probability of the current observed sample ):
That is to say, to evaluate the maximum logarithm likelihood function, we can get the parameter W corresponding to the maximum L.
If we increase the number of layers in the hidden layer, we can obtain deep Boltzmann Machine (DBM). If we use Bayesian belief networks (digraph models, of course, there is still no link between nodes in the layer), and using restricted Boltzmann Machine in the part that is far from the visible layer, We can get deepbelief net (DBN ).
9.4 deep belief networks
Dbns is a probability generation model. Compared with the neural network of the traditional discriminant model, the generation model establishes a joint distribution between the observed data and labels. For P (observation | label) and P (Label | observation) are evaluated, while the discriminative model only evaluates the latter, that is, P (Label | observation ). Applying traditional BP to deep Neural NetworksAlgorithmDbns encountered the following problems:
(1) A labeled sample set needs to be provided for training;
(2) slow learning process;
(3) Improper parameter selection will lead to convergence of learning to the local optimal solution.
Dbns consists of multiple restricted Boltzmann machines layers. A typical neural network type is shown in figure 3. These networks are "restricted" into a visible layer and a hidden layer. There is a connection between the layers, but there is no connection between the units in the layer. The hidden layer unit is trained to capture the correlation of higher-order data displayed on the visible layer.
First, we will not consider the top two layers of an associative memory. a dbn connection is determined by generating weights from top to bottom, like a building block, RBMS is easier to connect to weights than the traditional and deep layered sigmoid Belief Network.
at the beginning, A unsupervised greedy layer-by-layer method is used to obtain the weights of the generated model. The unsupervised greedy layer-by-layer method is proved effective by Hinton and called contrastive divergence ).
in this training phase, a vector V is generated at the visible layer, pass the value to the hidden layer. In turn, the inputs at the visual layer are randomly selected to try to reconstruct the original input signal. Finally, these new visible neuron activation units will pass forward to reconstruct the hidden layer activation unit to obtain h (during the training process, the Visual Vector Value is first mapped to the hidden unit; then, the visual unit is reconstructed from the hidden layer Unit. These new visual units are mapped to the hidden unit again to obtain the new hidden unit. Execute this type of repeated steps, which is called the garbage sampling ). These backward and forward steps are familiar with the Gamma sampling, and the correlation difference between the activation unit of the hidden layer and the input of the visible layer serves as the main basis for weight update.
The training time will be significantly reduced, because only one step can be close to the maximum likelihood learning. Each layer added to the network improves the logarithm probability of the training data. We can understand that it is closer to the real expression of energy. This meaningful expansion and the use of unlabeled data are the decisive factors for any deep learning application.
At the maximum two layers, the weights are connected together, so that lower-layer output will provide a reference clue or be associated with the top layer, so that the top layer will connect it to its memory content. What we want most is to identify performance, for example, in a classification task.
After pre-training, DBN can use the BP algorithm to adjust the discriminant performance by using the labeled data. Here, a tag set will be appended to the top layer (Promoting Lenovo memory) to obtain a classification surface of the network through a bottom-up, learned recognition weight. This performance is better than the Network trained by the simple BP algorithm. This can be intuitively explained. The dbns BP algorithm only needs to perform a local search for the weight parameter space, which is faster than the forward neural network, the convergence time is also small.
the flexibility of dbns makes it easier to expand. One extension is convolutional deep belief networks (cdbns )). Dbns does not take into account the two-dimensional structure information of the image, because the input is simply one-dimensional vectorized from an image matrix. Cdbns takes this issue into consideration. It uses the spatial relationship of neighboring pixels and uses a model area called convolution RBMS to generate the transformation immutability of the model, and can easily transform to a high-dimensional image. Dbns does not explicitly deal with the learning of the time relationship between observed variables. Although this research has been conducted, such as the stack time RBMS, dubbed temporal convolutionmachines with sequential learning. This application of sequential learning brings exciting future research directions to the problem of speech signal processing.
Currently, research related to dbns includes stack automatic encoder, which replaces RBMS in traditional dbns by stack automatic encoder. This allows deep multi-layer neural network architecture to be trained using the same rules, but it lacks the strict requirements for layer parameterization. Unlike dbns, the automatic encoder uses a discriminant model, which makes it difficult to sample the input sampling space, making it more difficult for the Network to capture its internal expression. However, the automatic noise reduction encoder can avoid this problem, and it is better than the traditional dbns. It adds random contamination during the training process and stacks to generate the field generalization performance. The process of training a single automatic noise reduction encoder is the same as that of generating a model through RBMS training.
Continue
Source: http://blog.csdn.net/zouxy09/article/details/8781396