After three years of crazy brush theory, I thought it was time to stop and do something useful. So I decided to write them down in kaibo. First, I tried to sort out the learned theories, second, supervise yourself and share with you. Let's talk about deeplearning first, because it has a certain degree of practicality (people say "It's very close to money"), and many domestic bulls have talked about it too, I try to explain it in other ways.
Deeplearning is the second spring of Professor Geoffery Hinton at the University of Toronto. The first spring is the traditional neural network. Because the traditional multi-layer sensor machine is easy to fall into the local minimum, it uses the back propagation algorithm directly) the classification result is not satisfactory. The first reason is that the feature is manual, and the second is the local minimum problem. Deeplearning introduces the generation model in the probability graph model, which can automatically extract the required features from the training set. The typical model is restricted
Machines (RBM for short), The automatically extracted features solve the problem of manual feature consideration, and initialized the neural network weight, and then the back propagation algorithm can be used for classification, the experiment has achieved good results. Therefore, deeplearning is hailed as the next generation of neural networks. Today's topic is to discuss RBM:
Before we talk about RBM, we should first mention the energy-based model. The energy method comes from the thermal dynamics, and the molecules exercise vigorously at high temperatures, it can overcome local constraints (some physical constraints between molecules, such as the attractiveness of key values). When it gradually drops to a low temperature, the molecules will eventually arrange regular structures, this is also a low-energy state. Inspired by this, the early simulated annealing algorithm attempted to jump out of the local minimum at high temperatures. As one of the physical models, the random field also introduces this method. In the Markov Random Field (MRF), the energy model mainly plays two roles: 1. Global Solution measurement (target function); 2. Least energy solution (configuration corresponding to various variables) whether or not the optimal solution can be embedded into the energy function is crucial to the target solution, which determines the solution of our specific problem. One of the main tasks of Statistical Pattern Recognition is to capture the correlation between variables, and the energy model also needs to capture the correlation between variables. The degree of correlation between variables determines the energy level. We can use graphs to represent the correlation between variables, and introduce the probability measure method to form the energy model of the probability graph model. In fact, we can also skip probability representation, for example, in stereo matching, the pixel difference between two pixels is used as the energy. The energy and the minimum configuration between all pixel pairs are the target solution. As a probability graph model, RBM introduces probability to facilitate sampling, because sampling in the CD (Contrastive divergence) algorithm plays the role of simulating the gradient solution.
Let's take a look at the RBM model definition (figure 1) and energy definition (Formula 1 ):
(Figure 1)
(Formula 1)
V indicates the actual sample node, and h indicates the hidden node data. Theta = {W, B, A}, W indicates the network weight, and B indicates the offset of the visible node, a Indicates hiding the node offset. We need to solve these parameters. Generally, we initialize these parameters randomly at the beginning. For binary images of 16*16, there are 256 visible nodes, each node may have two States (). The number of hidden nodes is determined by itself, and each node has two States ). When each node has a specified state, the RBM has a corresponding energy. At this time, all nodes correspond to a combination of statuses into a configuration. The sum of the energy corresponding to all configurations constitutes the normalized constant Z (Allocation Function-partition
Function ). Each configuration energy divided by the constant Z is the probability of the current configuration. In general calculation, in order to avoid numerical calculation problems, we often introduce extreme family functions (logarithm and exponential functions), then the model probability is defined as (formula 2 ):
(Formula 2)
Because RBM is a generative model, our goal is to make Theta the parameter corresponding to the maximum probability p (x). The general idea is to use the maximum likelihood method, gradient algorithms are used to obtain the maximum values of the maximum likelihood function. Gradient derivation is shown in figure (Fig 2). (do not be intimidated by formulas. CD algorithms do not use this algorithm, it is difficult to calculate the gradient ):
(Figure 2)
Although the formula is written in this way, it is not possible to calculate the normalization division Function Z in practice, because the combination state is NP-complete. However, this formula provides the prototype of the contrastive divergence algorithm (Formula 3 ):
(Formula 3)
Alpha is the learning rate, the first in the brackets is the expected distribution of the Training dataset, and the second is the expected distribution of the model. The core of the CD algorithm is to calculate (or approximate) the two values, the following describes the two expected distributions in the above formula. The sampling method is required (I will explain the steps first, and the steps are the most important ):
Item 1 (expected items related to data): Use the visual node V1 to multiply weights, and then use the sigmoid function to calculate the probability of hiding node h each node. This probability is used to sample hidden nodes, the hidden node state H1 is obtained, and the expected energy (<V1 * h1>) of all nodes is calculated using the energy formula as the first item.
Item 2 (expected model item): Based on the hidden node H1 multiplication weight in item 1, calculate the probability of visible node V2 using the sigmoid function, and obtain the V2 State through sampling, the H2 state is also calculated using V2, and the expected energy of all nodes is calculated using the energy formula (<V2 * H2> ). Use the above formula to obtain deltaw,
With a class gradient, you can update the weight and iterate. B and A are similar to each other.
This is to use the CD algorithm to simulate the gradient. The key lies in the sampling method used. Use MCMC-based sampling methods (such as James, MH, and reject sampling) as appropriate, deep understanding of the sampling method helps us understand why CD can simulate a gradient. In Hinton's words, this is an amazing fact. I have summarized this sentence to give us an intuitive understanding: the expected data distribution is the distribution center of our entire data and can be seen as the smallest point. The model expectation is sampled, it is random and distributed around the center. The subtraction of the two can simulate the gradient. In fact, there are more than one method to solve the theta parameter. For example, in russlan's paper, another method is also provided: the random approximation process is used to estimate model expectations, use the variational method to estimate data-related expectations. Due to the limited level of the author, I will not talk about it.
In addition, when the generative model (DBN) is used to evaluate the parameter, the result obtained is similar to the CD formula. The DBN training method can also be trained using similar methods, for details, refer to the russlan code.
The DBM training method is to train RBM separately. For other derivative applications and techniques mentioned in russlan, you can read them in detail as needed. For example, the estimation segmentation function (z) is an important cornerstone of the graph model. The magical function of autoencoders is to simulate the neural distributed description in the neural science, which is equivalent to finding a low-dimensional manifold; fine-tune (fine-tuning part of the deep Network), in fact, after the unsupervised pre-training through the above method, combined with the label part, using BP and other methods to slightly update the weight, it is worth mentioning that the recognition can be a generative model or a discriminative model.
For specific applications, you can try different methods.
I want to talk about the basics. In addition to the probability graph model and the basic support of neural networks, deep learning also uses a lot of trick.
Oh, by the way, Hinton also said that neural networks can simulate the vast majority of current machine learning methods, including SVM, AdaBoost, of course, it also includes the set of probability graph models (including structured learning). This is a bit of a taste of the graph model. The author's field of view is limited, and he does not dare not recognize or recognize it. Let's just talk about it here. The author writes a little bit frivolous, and the formulas are deducted from others' papers,
In addition, the author's level is limited, and some may be wrong. I hope you can correct them more.
The last note: the relationship between the energy and probability of the probability graph model can be found in the relationship between the energy and probability of the probability graph model.
Theorem, they both play a very important role in the conversion from the energy model to the probability. Also, why can p (x) or P (v) Minimize the energy? I didn't understand this question at the beginning. Later, a rigorous netizen (beiliyunlang) took a long time to find the answer and shared it with me unselfishly: Right (figure 2) in the formula (5.7), both sides obtain the logarithm: LNP (v) =-freeenergy (V)-lnz. In this way, when P (v) is the largest, freeenergy (v) is the smallest, the minimum free energy determines that the system achieves the statistical balance. For details, refer:
Free energy.
References:
[1] Learning deep generative models. Ruslan salakhudinov
[2] Learning deep ubuntures for AI. Yoshua bengio
[3] a tutorial on energy-based models. Yann lecun
Reprinted please indicate the source: http://blog.csdn.net/cuoqu/article/details/8886971