Document directory
- 1.1 how to restrict the use of the Polman machine (RBM)
- 1.2 restricted Polman machine (RBM) Energy Model
- 1.3 from energy model to probability
- 1.4 Maximum Likelihood
- 1.5 Sampling Method Used
- 1.6 introduction to Markov Monte Carlo
- References
RBM for deep learning Reading Notes
Statement:
1) I saw a statement from other blogs such as @ zouxy09, and the old man copied it.
2) This blog post is based on the materials that are selflessly contributed by many online experts and machine learning experts. For more information, see references. For specific version declarations, refer to the original document.
2) This document is only for academic exchanges and is not commercially available. Therefore, the specific references of each part do not correspond in detail, and some of them are copied directly from other blogs. If some of them accidentally infringe on everyone's interests, I still hope that haihan will be contacted to delete or modify the content until the people concerned are satisfied.
3) I am a beginner, And I will inevitably make mistakes in my summary. I hope your predecessors will not correct me. Thank you.
4) reading this article requires the basics of machine learning, statistics, and neural networks. (If you don't have it, you can just look at it. If you don't have it, just take a look at it as a bid for your students ).
5) This is the first version. If there is an error, you must continue to modify, add, delete, and delete it. I hope you can give more advice. Please reply directly, and I will try to solve it.
6) since I first published a blog, I didn't even use the editing tool, and many formulas were gone. I had to, but the format was really unsightly. You can give me some advice if you will use the posting tool.
7) I have a Word version and a pdf version, but I don't know how to upload them. If necessary, I can go directly to the deeplearning high-quality communication group. The group number cannot be published because it has not obtained the consent of the group owner, you can contact the group owner @ tornadomeet.
Preface
This article is long. Please read it with patience.
If you really don't want to read it patiently, at least read the sentences marked in red. Otherwise, you have to have a lot of questions.
The structure of this article is scattered. The following is a general process:
How to use RBM --> general purpose --> Why to use the energy model --> why should we define the probability and probability --> solve the relationship between the target and the maximum likelihood --> how to solve --> Other some Supplements
I. restrict the use of the Polman machine (RBM) by 1.1
1.1.1 RBM instructions
A common RBM network structure is as follows.
The above RBM network structure has n visual nodes and M hidden nodes. Each visual node is only related to M hidden nodes and independent from other visual nodes, that is, the state of this visible node is only affected by M hidden nodes. For each hidden node, it is only affected by N visible nodes. This feature makes RBM training easy.
(Ii) Description: Do not mention anything else-for example, if you go out shopping and go to a fork in the road, you just want to stroll around, so you have a 0.5 probability of going to the left, 0.5 of the probability of the road to the right; but you do not know how to choose which road, So you choose to throw a coin, face up to the left, back up to the right. Now you only throw it once and find that it is facing up, and then you go to the left.
-- Back to the above problem, the probability of a node A with a value of 1 is 0.6 (if any). It can also be seen as an uneven coin. The probability of face-up is 0.6, the probability of putting the opposite side up is 0.4. If you want to set the value for node A, you can take the coin and set the value to 1 when the front is up, and set the value to 0 when the back is up, this is equivalent to the process of deciding which way to take.
-- If such an uneven coin cannot be found, replace it with a random number generator (the generated number is a floating point number between 0 and 1), because the random number generator has a probability that the value is less than 0.6, the probability of a value greater than 0.6 is 0.4.
1.1.2 usage of RBM
RBM is mainly used in two ways. One is to encode the data and then give it to the supervised learning method for classification or regression. The other is to obtain the weight matrix and offset, for BP neural network initialization training.
The first method is to use it as a dimensionality reduction method.
The second method has a strange purpose. The reason is that the neural network also needs to train a weight matrix and offset. However, if a back-propagation neural network is used directly and the initial value is not well selected, it will often fall into a local minimum value. According to the actual application results, the weight matrix and offset obtained by RBM training are used as the initial values of the BP neural network, and the results are very good.
This is similar to mountain climbing. If there are many mountain points in a scenic spot and you can select a mountain to climb it, I hope you can climb the top of the highest mountain, however, you have limited energy and can only climb one mountain, and you do not know which mountain is the highest. In this way, it is easy for you to climb a mountain that is not the highest. However, if a helicopter sends you to the highest mountain near the top of the mountain, you can easily climb the highest mountain. At this time, the role of RBM is the helicopter.
In fact, there are two purposes. Let's talk about it below.
Thirdly, RBM can estimate the joint probability P (v, h). If V is used as a training sample and H is used as a class tag (only one hidden node exists, we can get a probability that the hidden node value is 1). Then we can use Bayesian formula to calculate P (H | V), and then we can perform classification, such as naive Bayes, Lda, And hmm. Specifically, RBM can be used as a generative model.
Fourth, RBM can directly calculate the conditional probability P (H | V). If V is used as a training sample and H is used as a class tag (only one hidden node exists, you can obtain a probability that the hidden node value is 1. RBM can be used for classification. Specifically, RBM can be used as a discriminative model.
1.2 restricted Polman machine (RBM) Energy Model
1.2.1 Definition of the Energy Model
Before talking about RBM, let's talk about other aspects, namely the energy model.
What is the energy model? The intuitive understanding is to put a small ball with a rough and not round surface in a bowl with a rough surface. Then, you can just throw it inside to see where the ball is parked in the bowl. Generally, it is more likely to stop at the bottom of the bowl. It is also possible to stop at other places close to the bottom of the bowl, or even close to the mouth of the bowl (this bowl is a relatively shallow bowl ); the energy model defines the position where the ball is parked as a State. Each State corresponds to an energy, which is defined by the energy function, the probability that a ball is in a certain state (for example, the probability of stopping at the bottom of a bowl is certainly not the same as the probability of stopping at the Bowl) can be defined by the energy of the ball in this State (to put it another way, for example, if the ball stops near the bowl mouth, this is a state, which corresponds to an energy E, and the probability P of a State where the ball stops near the bowl mouth can be expressed by E, P = f (E), where F is the energy function). This is the energy model I think.
In this way, there will be something like energy functions and probabilities.
The Polman network is a random network. Describes a random network.
First, the probability distribution function. Because the value of network nodes is random, three probability distributions are used to describe the system to describe the entire network. Joint probability distribution, edge probability distribution, and conditional probability distribution. Understanding the three different probability distributions is the key to understanding the random network. Here we recommend the Bayesian Network introduction by Zhang lianwen. In many documents, restricted Polman networks are an undirected graph. From the perspective of Bayesian Networks, restricted Polman networks can also be seen as a bidirectional directed graph, that is, the probability that a hidden layer node obtains a certain state value can be calculated from the input layer node, and vice versa.
Second, the energy function. Random neural networks are rooted in statistical mechanics. Inspired by the Energy Functional in statistical mechanics, the energy function is introduced. An energy function is a measure that describes the state of the entire system. The more ordered the system or the more concentrated the probability distribution, the less energy the system has. On the contrary, the more out-of-order the system is or the more even the probability distribution is, the more energy the system has. The minimum value of the energy function, which corresponds to the most stable state of the system.
1.2.2 functions of the Energy Model
Why is this energy model necessary? There are several reasons.
1. RBM is an unsupervised learning method. unsupervised learning aims to fit the input data with the greatest possibility, therefore, the purpose of learning the RBM network is to allow the RBM network to fit the input data as much as possible.
2. for a group of input data, it is very difficult to learn because it does not know the distribution. For example, if you know that it complies with the Gaussian distribution, you can write the likelihood function and then solve it to find out what Gaussian distribution this is. But if you do not know what distribution it conforms, even the likelihood functions cannot be written, and there are no problems at all.
Fortunately, there is no path to perfection-the conclusion of statistical mechanics shows that any probability distribution can be transformed into an energy-based model, and many distributions can use the unique nature and learning process of the energy model, some even find common learning methods from the energy model. There is such a good thing, of course, it will be used.
Third, in the Markov Random Field (MRF), the energy model mainly plays two roles: 1. Measurement of the global solution (target function ); 2. The minimum energy solution (configuration corresponding to various variables) is the target solution. That is, the energy model can provide two things for unsupervised learning methods: a) objective functions; B) Objective Solutions.
In other words, that is, using the energy model makes it easy to learn the distribution of a data.
Whether or not the optimal solution can be embedded into an energy function is crucial to determining whether or not we can solve a specific problem. One of the main tasks of Statistical Pattern Recognition is to capture the correlation between variables, and the energy model also needs to capture the correlation between variables. The degree of correlation between variables determines the energy level. The correlation between variables is expressed in graphs, and the energy model of the probability graph model is formed by introducing the probability measure method.
As a probability graph model, RBM uses the sampling technique to solve the problem by introducing the probability. In the contrastive divergence algorithm, the sampling part plays the role of simulating the gradient.
The energy model requires a defined energy function. The energy function of RBM is defined as follows:
This energy function means that the connection structure between each visible node and the hidden node has an energy, in general, each set of values of a visible node and each set of values of a hidden node have an energy. If a set of values of a visible node (that is, the value of a training sample) is (,), a set of values of the hidden node (that is, the value after the training sample code) is (, 1), and then substituted into the above formula, you can get the energy between the connection structures.
The significance of an energy function is explained by the concept of an expert product system (POE, product of expert). This theory was also invented by Hinton. He regards each hidden node as an "expert ", each "expert" can have an impact on the Status distribution of the visible node. A single "expert" may not have a strong distribution of the status of the visible node, however, the observation results of all "experts" are strong enough. I am not quite familiar with the specifics. If you are interested in Hinton's paper, there are also Chinese documents called "Principle and Application of Expert product system, SUN Zheng, Li Ning".
Another question is: Why probability? The following is an explanation.
The energy model requires two things: one is the energy function and the other is the probability. With the probability, it can be combined with the problem to be solved.
Next we will introduce the energy model to probability.
1.3 from energy model to probability 1.3.1 from energy function to probability
To introduce probability, You need to define the probability distribution. Based on the energy model, with the energy function, you can define the joint probability of a visible node and hidden node.
That is, the probability P (v, H) of a set of values (one State) of a visible node and a set of values (one State) of a hidden Node) it is defined by the energy function.
This probability is not defined at will, but interpreted as statistical thermodynamic -- in statistical thermodynamic, when the system is in heat balance with the environment around it, the following formula shows the probability of state I occurrence.
It indicates the energy of the system in the state I. T is the absolute temperature of kaerman, It is the Boltzmann constant, and Z is the constant unrelated to the State.
Here we turn e (v, h), because (v, H) is also a State, and other parameters T are set to 1 because they are irrelevant to the solution, Z is the denominator of the above joint probability distribution. This denominator is to make our probability sum 1 so that P (v, H) is a probability.
Now we get a probability, but we also get a distribution. In fact, this distribution also has a nice name, which can be called the gibdistribution. Of course it is not a standard gibdistribution, it is a special pair of parameters, namely, W, B, and C, of the energy function.
With this joint probability, we can obtain some conditional probabilities, which are obtained by removing unnecessary quantities with credits.
1.3.2 from probability to maximum likelihood
We have obtained the joint probability of a sample and its corresponding encoding, that is, we have obtained the probability density function of the gibdistribution of the RBM network. The purpose of introducing the energy model is to facilitate the solution.
Now return to the goal of the solution-let the RBM network represent the input data that is most likely to fit the box distribution.
In fact, the goal of the solution can also be considered to be to make the distribution of the RBM network and the distribution of input samples as close as possible.
Now let's take a look at the definition of "maximum possible fitting input data.
Assume that Ω represents the sample space, Q represents the distribution of input samples, that is, Q (x) represents the probability of training sample X, and Q is actually the sample to be fitted to represent the probability of distribution; assuming that p is the edge distribution of the Gini distribution represented by the RBM Network (only related to visible nodes, hidden nodes are removed through points, which can be understood as the distribution of various states of visible nodes ), the input sample set is S. Now we can define the distribution of samples and the KL distance of edge distribution represented by RBM network.
If the input sample shows that the distribution is exactly the same as that indicated by RBM, the KL distance is 0, otherwise it is a number greater than 0.
The first item is the entropy of the input sample (defined by entropy). the entropy of the input sample is determined. The second item cannot be directly obtained, however, if Monte carlo sampling is used (which will be introduced later), the sample is input (the input sample must conform to the distribution Q (x). The second item can be used for estimation, theLIndicates the number of training samples. Because the KL value is certainly not less than 0, so the first item must not be less than the second item, so that the second item gets the maximum value, so that the KL distance can be minimized. Finally, we can also find and maximize, it is equivalent to maximization, and this is the maximum likelihood estimation.
The conclusion is that, by solving the maximum likelihood of input samples, we can make the distribution represented by the RBM network the closest to that represented by the sample itself.
This is why the RBM problem can finally be converted to the maximum likelihood for solving.
Since it is necessary to use the maximum likelihood, this is of course meaningful-when the RBM network is trained, if the RBM network is randomly set to a number of States (of course, a State is composed of (V, h). The probability of training samples appearing in the visible node (that is, v) is the highest among the several statuses.
This ensures that the training sample has the highest probability in the Process of anti-encoding (from hidden nodes to visible nodes, that is, to minimize the error of anti-encoding.
For example, if a sample (, 1) is encoded to (, 1), then, (, 1) from the hidden node anti-encoding to the visible node, it also needs to be encoded with a high probability (, 1 ).
At this point, the relationship between the energy model and the maximum likelihood is clearly explained. It may be a bit incorrect. If you see anything wrong, I have changed it.
I previously thought about how to use the principle of least energy in the RBM network when the likelihood function is the largest. As a result, I have read a lot of data and I have no way to associate these two extreme values. This is a long journey. I hope you will pay more attention to it and try to avoid it.
The following describes how to solve the problem. The solution is to find the values of several parameters w, B, and C in the RBM network. This is the solution, and the likelihood function (logarithm likelihood function) is the target function. The relationship between the optimal solution and the target function in the optimization problem must be clarified first.
1.4 Maximum Likelihood
In order to solve the maximum likelihood, We need to calculate the maximum value for the likelihood function,
The process of solving the problem is to export the parameters, and then use the gradient rise method to continuously improve the target function, the final arrival of the shutdown condition (for those who do not understand it here, they will go to the reference "from maximum likelihood to em"),
If the parameter is set to (note that the parameter is a set of parameters), the likelihood function can be written
Then we can evaluate its logarithm (because it is too complex to directly find the derivative of a concatenation, it becomes a logarithm)
It can be seen that the sum is obtained after the derivation of each P (v), and then the following derivation is given.
This is also explained in the paper. The above formula can be further converted to evaluate W's deviation.
Note: from the second "=" to the third "=, the first in square brackets of the second "=" is to reverse the use of Monte Carlo sampling, returning from sampling to points, so we get an expectation; the second item is because the values of the second item are the same for each training sample. Therefore, dividing the accumulated result by L is equivalent to no change, but still the expectation, the expression is changed.
In this case, the gradient can be understood as follows: the first item is equal to the expected value of the free energy function of the input sample data in the distribution of the sample itself (Q (v, H) = P (H | V) q (V), Q indicates the distribution of input samples and their corresponding hidden increment state ), the second item is the expected value of the free energy function in the gibdistribution expressed by the RBM network.
The first item can be obtained, because the training sample has been obtained, that is, when Monte Carlo is used to estimate the expected sample, only the mean value is required.
The second item can also be obtained. However, if you want to traverse all possible values of the combination of V and H, you may not be able to calculate it. If you want to be lazy, the tragedy is, there are no samples in the population distribution represented by the RBM Network (of course, we will introduce how to extract these samples later ).
In order to proceed with the following discussion, we will further simplify the gradient to see what we can get. According to the formula of the energy function, there are three parameters: W, B, and C.
In this step, let's analyze the result. We can get the following
If the second item is required, it is necessary to traverse all possible values of V and calculate several gradient values based on the formula. This is troublesome. Fortunately, Monte Carlo provides a method of laziness, see the following chapter.
As long as a bunch of samples are extracted, these samples are consistent with the RBM network to represent the free distribution (that is, the free distribution p (x) according to the parameter ), the above three partial derivatives can be estimated.
For the above case, this is the case. For each training sample X, some sampling method is used to extract a corresponding sample that conforms to the molar distribution represented by the RBM Network (which means that the sample that complies with the parameter-determined molar distribution p (x), for example, Y; for the entire training set {x1, x2 ,... For XL}, we obtain a set of samples {y1, Y2 ,..., YL}, and then use this group of samples to estimate the second item. Then the gradient can be approximate using the formula below.
XK indicates the k-th training sample, and YK indicates the Gini distribution of the RBM network corresponding to the k-th training sample (we may call this distribution R) (The YK sample is extracted based on the distribution of R, and this sample is extracted from XK, and is sampled using the James. That is to say, the YK is subject to the distribution of R and can be used to estimate the second item, at the same time, YK is related to XK. vyk indicates the status when the value of the visible node is yk. Then vykj indicates the value of the J feature of YK.
In this way, the gradient is generated, and the maximum likelihood problem can be solved. After several theoretical iterations, the solutions of the several parameters w, B, and C can be obtained.
In the formula, V refers to {x1, x2 ,... One sample in XL}, because when the samples are accumulated, the first item is to accumulate all the samples, and the second item is the same, so 1/L does not exist after the samples are accumulated, only y is accumulated. In the CD-K algorithm below, only one X and one y are processed at a time, the cumulative result obtained after completing an L loop on X in the outermost layer is the same.
As mentioned above, "Some Sampling Method" is usually used to sample the data with a pair of unique parameters. Professor Hinton also developed a CD-K algorithm based on this method, which is used to solve RBM sampling. Let's take a look at the following chapter.
1.5 Sampling Method Used
Generally speaking, Professor Hinton has used the gibs' Sampling Method to Solve the RBM sampling problem before making a CD-K.
It is a sampling method based on the Markov Chain Monte Carlo (MCMC) strategy. Specifically, for a D-dimension random vector x = (x1, x2 ,... XD). Suppose we cannot obtain the joint probability distribution p (x) of X, but we know that the other component of X is the conditional distribution of the I-th component Xi, that is, P (XI | Xi-), XI-= (x1, x2 ,... XI-1, Xi + 1... XD ). Then, we can go from an arbitrary State of X (such as (x1 (0), X2 (0 ),..., XD (0), uses the conditional distribution P (XI | Xi-) to iteratively sample each component in this State. As the number of samples increases, random Variables (x1 (N), X2 (N ),..., The Probability Distribution of XD (n) converges with the joint probability distribution P (V) of X at the speed of N geometric series ).
In fact, we can sample the probability distribution P (v) of the position.
Based on the symmetric structure of the RBM model and the conditional independent rows of nodes, we can use the Gaussian sampling method to obtain random samples that are subject to the distribution defined by RBM. In RBM, the specific algorithm for K-step Gaussian sampling is to initialize the State V0 of the visible node with a training sample (or a random initial state of the visible node) and perform the following sampling alternately:
When the number of sampling steps N is large enough, we can obtain the distribution Samples defined by RBM (that is, the samples that conform to the parameter-determined samples of the gamma distribution, after getting these samples, we can calculate the second item of the gradient.
We can see that the K-step sampling is carried out above, and this K is usually relatively large, so it is time-consuming, especially when the number of features of the training sample is large (the number of visible nodes is large) professor Hinton made a simplified version called CD-K, which is used to compare the divergence.
Comparison divergence is a Chinese translation of contrastivedivergence (CD. Unlike the bag sampling, Professor Hinton pointed out that when training samples are used to initialize v0, only a small number of sample steps (generally one step) are needed to obtain a good enough approximation.
At the beginning of the CD algorithm, the visible unit state is set as a training sample, use the conditional probabilities above to extract the corresponding values from {0, 1} for each unit of the hidden node, then, we can extract the corresponding values from {0, 1} for each unit of the visible node. In this way, V1 is obtained. Generally, V1 is enough to estimate the gradient.
The following describes how to use RBM for Fast Learning Based on CD-K.
Among them, the reason why the second item does not have that 1/L is that this gradient will accumulate all samples (the maximum likelihood is the sum of the gradients of all training samples ), the final addition result is equal to the present result (it is exactly l sample, after l is added to the second, the final result is exactly the same as the one after the accumulate symbol of every second, however, after all the results are added here, the same value can also be obtained ).
1.6 introduction to Markov Monte Carlo
The following describes the Markov Monte Carlo (MCMC) method.
This is the basic idea of the Monte Carlo method.
The rest is how to sample a sample that conforms to the distribution p (x). In short, this is a random initial sample, which is transferred multiple times through the Markov chain, finally, we can get a sample that conforms to the distribution p (x. The first example of the first example is a common sample algorithm.
Thank you
Many members of the deep learning high-quality Communication Group: @ zeonsunlight, @ Marvin, @ tornadomeet, @ Long time no visit, @ zouxy09
They provided a wide range of materials while I was writing this note.
In particular, @ zeonsunlight helps me figure out many concepts. @ has not seen me for a long time and helped me correct the symbol errors in my blog.
I am dull. I would like to express my heartfelt thanks to you for your many attempts to answer many questions.
I think the other aspects of this note are clear, but when talking about the energy model, I feel that I have not introduced the energy model clearly, and some of my predecessors hope to provide guidance.
Finally, I still have to say that I am not easy to learn, and there must be a lot of mistakes. I hope you can forgive me and point out that I can correct them in time.
References
[1] An Introduction to restricted Boltzmann machines. Asja Fischer andchristian Igel
[2] Introduction to restricted Polman machine Zhang Chunxia, Ji Nannan, and Guan Wei
[3] learningdeep ubuntures for AI Yoshua bengio
[4] http://blog.csdn.net/cuoqu/article/details/8886971 deeplearning (deep learning) Principle and Implementation (I) @ marvin521
[5] http://blog.csdn.net/zouxy09/article/details/8775360 deep learning (deep learning) learning notes series @ zouxy09
[6]
Http://www.cnblogs.com/tornadomeet/archive/2013/03/27/2984725.html deep learning: 19th (RBM simple understanding) @ tornadomeet
[7]
Http://blog.csdn.net/celerychen2009/article/details/8984316 restricted Polman machine @ celerychen2009
[8]
Http://blog.csdn.net/zouxy09/article/details/8537620 from maximum likelihood to EM algorithm shortest @ zouxy09
Http://www.sigvc.org/bbs/thread-513-1-3.html, RBM and DBN training, Wang Ying
[10] neural network principle ye [m] ye Shiwei, Shi zhongzhi: Mechanical Industry Press
[11] http://www.sigvc.org/bbs/thread-512-1-1.html restricted datagmannmachine (RBM) Derivation Zhu Feiyun