9. Common models or methods of deep learning
9.1 autoencoder automatic Encoder
One of the simplest ways of deep learning is to use the features of artificial neural networks. Artificial Neural Networks (ANN) itself are hierarchical systems. If a neural network is given, let's assume that the output is the same as the input, and then train and adjust its parameters to get the weight in each layer. Naturally, we get several different representations of input I (each layer represents a representation), which are features. An automatic encoder is a neural network that can reproduce the input signal as much as possible. To achieve this recurrence, the automatic encoder must capture the most important factors that can represent the input data, just like PCA, to find the main components that can represent the original information.
The specific process is described as follows:
1) Learning Features with unsupervised learning for given unlabeled data:
In our previous neural networks, for example, in the first figure, the input sample has a label (input, target). In this way, based on the current output and target (Label) the difference between them changes the parameters of the previous layers until convergence. But now we only have unlabeled data, that is, the picture on the right. How can this error be obtained?
For example, if we input into an encoder, we will get a code, which is a representation of input. How do we know that this code represents input? We add a decoder. In this case, decoder will output a message. If the output information is very similar to the input signal (ideally the same ), obviously, we have reason to believe that this code is reliable. Therefore, by adjusting the parameters of encoder and decoder, we can minimize the refactoring error. At this time, we get the first representation of the input signal, that is, the encoding code. Because there is no tag data, the source of the error is to compare it with the original input after direct reconstruction.
2) generate features through the encoder and then train the next layer. Layer-by-layer training:
Then we get the code at the first layer. The minimum reconstruction error makes us believe that this code is a good expression of the original input signal, or it is far-fetched, it is exactly the same as the original signal (the expression is different, reflecting a thing ). There is no difference between the training method at the second layer and the first layer. We regard the code output at the first layer as the input signal at the second layer, and also minimize the reconstruction error, the second layer parameter is obtained, and the code entered in the second layer is obtained, that is, the second expression of the original input information. The other layers can be processed in the same way (training this layer, the parameters of the front layer are fixed, and their decoder is useless and does not need it ).
3) Supervised fine-tuning:
After the above method, we can get a lot of layers. As for the number of layers required (or the depth required, there is no scientific evaluation method at present), it needs to be tested and adjusted by yourself. Each layer gets different expressions of the original input. Of course, we think it is more abstract and better, just like a human's visual system.
At this point, the autoencoder cannot be used to classify data, because it has not learned how to link an input and a class. It just learns how to reconstruct or reproduce its input. Or, it just learns to obtain a feature that can well represent the input, which can represent the original input signal to the maximum extent. To achieve classification, we can add a classifier (such as Rogers regression and SVM) to the top encoding layer of autoencoder ), then, training is performed using the standard multi-layer Neural Network supervised training method (gradient descent method.
That is to say, at this time, we need to input the feature code at the last layer to the final classifier, and fine-tune it through labeled samples and supervised learning. There are also two types, one is to adjust only the classifier (black part ):
Another one: fine-tune the entire system through a tag sample: (if there is enough data, this is the best. End-to-End learning end-to-end learning)
Once supervised training is completed, the network can be used for classification. The top layer of the neural network can be used as a linear classifier, and then we can replace it with a classifier with better performance.
During the study, we can find that adding the features obtained by automatic learning to the original features can greatly improve the accuracy, and even make the classification problem better than the current best classification algorithm!
There are some variants of autoencoder. Here we will briefly introduce two:
Sparse autoencoder sparse automatic Encoder:
Of course, we can add some constraints to obtain new deep learning methods, such: if the regularity limit of L1 is added on the basis of autoencoder (L1 is mainly used to restrict most nodes in each layer to 0, and only a few nodes are not 0, this is the source of the sparse name). We can get the sparse autoencoder method.
For example, it is to restrict the sparse Expression Code obtained each time. Because sparse expressions are often more effective than other expressions (the brain seems to be like this. Some input only stimulates some neurons, and most of the other neurons are restrained ).
Denoising autoencoders automatic noise reduction Encoder:
The auto-Encoder DA is based on the auto-encoder, and training data is added to noise. Therefore, the auto-Encoder must learn to remove this noise and obtain real input without noise pollution. Therefore, this forces the encoder to learn more robust expression of the input signal, which is also the reason for its strong generalization ability than the general encoder. Da can be trained using a gradient descent algorithm.
9.2 Sparse Coding sparse Encoding
If we relax the limit that the output must be equal to the input, and use the basic concept in linear algebra, that is, O = A1 * Phi 1 + A2 * Phi 2 + .... + An * Phi n, where I is the basis, and AI is the coefficient, we can obtain the following optimization problem:
Min | I-o |, where I represents input and O represents output.
By solving this optimization formula, we can obtain the coefficient AI and the base Phi I. These coefficients and the base are another approximate expression of input.
Therefore, they can be used to express input I, which is automatically learned. If we add the regularity limit of L1 in the above formula, we will get:
Min | I-o | + u * (| A1 | + | A2 | +... + | An |)
This method is called Sparse Coding. In general, a signal is expressed as a linear combination of a group of bases, and the signal can be expressed only when a few bases are required. "Sparsity" is defined as: there are only a few non-zero elements or only a few elements far greater than zero. The requirement coefficient AI is sparse. That is to say, for a group of input vectors, we only want to have as few coefficients as possible as far as zero. There is a reason to use sparse components to represent our input data, because the vast majority of sensory data, such as natural images, can be expressed as the superposition of a few basic elements, in an image, these basic elements can be a surface or a line. At the same time, for example, the process of analogy with the primary visual cortex has also been elevated (the human brain has a large number of neurons, but there is little excitement for some images or edges, all others are in the suppression status ).
The Sparse Coding Algorithm is an unsupervised learning method. It is used to search for a group of "super-complete" base vectors to more efficiently represent sample data. Although Principal Component Analysis (PCA) technology allows us to easily find a set of "complete" base vectors, however, what we want to do here is to find a set of "super-complete" base vectors to represent the input vector (that is, the number of base vectors is larger than the dimension of the input vector ). The advantage of the super-complete foundation is that they can more effectively find the structures and patterns hidden in the input data. However, for a super-complete basis, Coefficient AI is no longer uniquely identified by the input vector. Therefore, in the sparse encoding algorithm, we add an additional criterion "sparsity" to solve the problem of degeneracy caused by over-completeness. (For detailed procedures, see ufldl tutorial sparse encoding)
For example, to generate edge detector at the bottom of the image feature extraction, the job here is to select some small patches from randomly in natural images, these patches are used to generate a basis that can describe their "base", that is, the basis composed of 8*8 = 64 basis on the right, and then a test patch is given, we can use the linear combination of basis according to the formula above, and sparse matrix is A. There are 64 dimensions in a, where there are only 3 non-zero items, so it is called "sparse ".
Here, we may wonder why we use the underlying layer as an edge detector? What is the upper layer? Here is a simple explanation. You will understand that edge detector can describe the entire image because of edge in different directions. Therefore, edge in different directions is the basis of the image ...... The result of the Upper-layer basis combination is the upper-layer combination basis ...... (That's what we said in the fourth part above)
Sparse Coding is divided into two parts:
1) training stage:Given a series of sample images [X1, X 2,…], We need to learn to get a group of bases [Φ1, Φ2,…], That is, the dictionary.
Sparse encoding is a variant of the K-means algorithm, and its training process is similar (EM algorithm idea: if the target function to be optimized contains two variables, such as L (W, B ), then we can first fix W, adjust B to minimize l, then fix B, and then adjust W to minimize L. In this way, iteration will alternate and l will be pushed to the minimum value. For more information about the EM algorithm, see my blog: "From maximum likelihood to a brief introduction to the EM algorithm ").
The training process is a repetitive iteration process. As mentioned above, we change a and Phi alternately to minimize the following objective function.
Each iteration involves two steps:
A) fix the dictionary. Then, adjust a [k] to minimize the upper formula, that is, the target function (that is, to solve the lasso problem ).
B) Then fix a [k] and adjust Phi [k] to minimize the upper formula, that is, the target function (that is, solving the convex QP problem ).
Iterations until convergence. In this way, we can get a group that can well represent the basis of the series of X, that is, the dictionary.
2) coding stage:Given a new image X, the dictionary obtained above gets a sparse vector by solving a lasso problem.A. This sparse vector is a sparse expression of the input vector X.
For example:
9.3. Restricted Boltzmann Machine (RBM) limits the Polman Machine
Assume there is a two-part graph with no links between nodes in each layer. The layer is a visible layer, that is, the input data layer (v), and the layer is a hidden layer (h ), if all the nodes are random binary variable nodes (only 0 or 1 values can be taken), and the full probability distribution P (v, h) satisfies the Boltzmann distribution, we call this model restricted named mannmachine (RBM ).
Let's take a look at why it is a deep learning method. First, because this model is a binary graph, all hidden nodes are conditional independent (because there is no connection between nodes), that is, P (H | V), when we know v) = P (H1 | V )... P (HN | V ). Similarly, when the hidden layer H is known, all visible nodes are conditional independent. At the same time, because all V and H meet the conditions of the Boltzmann distribution, when the input V is used, P (H | V) can be used to obtain the hidden layer h. After the hidden layer H is obtained, P (v | h) can be used to obtain the visible layer. by adjusting the parameters, we want to make the visual layer V1 obtained from the hidden layer the same as the original visual layer V, the hidden layer is another representation of the visual layer. Therefore, the hidden layer can be used as a feature of the input data of the visual layer. Therefore, it is a deep learning method.
How to Train? That is, how can we determine the weights between the nodes on the visual layer and the hidden nodes? We need to do some mathematical analysis. That is, the model.
The energy of jointconfiguration can be expressed:
The joint probability distribution of a configuration can be determined by the Boltzmann distribution (and the energy of this configuration:
Because hidden nodes are conditional independent (because there is no connection between nodes), that is:
Then we can easily obtain the probability that the J-node in the hidden layer is 1 or 0 on the basis of the given visible layer V (factorizes:
Similarly, based on the given hidden layer H, the probability that the I node of the visible layer is 1 or 0 is also easy to get:
Given a sample set that satisfies the independent distribution: D = {V(1 ),V(2 ),...,V(N)}, we need to learn the parameter θ = {W, a, B }.
We maximize the following logarithm likelihood functions (maximum likelihood estimation: For a probability model, we need to select a parameter to maximize the probability of the current observed sample ):
That is to say, to evaluate the maximum logarithm likelihood function, we can get the parameter W corresponding to the maximum L.
If we increase the number of layers in the hidden layer, we can obtain deep Boltzmann Machine (DBM). If we use Bayesian belief networks (digraph models, of course, there is still no link between nodes in the layer), and using restricted Boltzmann Machine in the part that is far from the visible layer, We can get deepbelief net (DBN ).
9.4 deep belief networks
Dbns is a probability generation model. Compared with the neural network of the traditional discriminant model, the generation model establishes a joint distribution between the observed data and labels. For P (observation | label) and P (Label | observation) are evaluated, while the discriminative model only evaluates the latter, that is, P (Label | observation ). Dbns encountered the following problems when applying traditional BP algorithms to deep neural networks:
(1) A labeled sample set needs to be provided for training;
(2) slow learning process;
(3) Improper parameter selection will lead to convergence of learning to the local optimal solution.
Dbns consists of multiple restricted Boltzmann machines layers. A typical neural network type is shown in figure 3. These networks are "restricted" into a visible layer and a hidden layer. There is a connection between the layers, but there is no connection between the units in the layer. The hidden layer unit is trained to capture the correlation of higher-order data displayed on the visible layer.
First, we will not consider the top two layers of an associative memory. a dbn connection is determined by generating weights from top to bottom, like a building block, RBMS is easier to connect to weights than the traditional and deep layered sigmoid Belief Network.
At the beginning, A unsupervised greedy layer-by-layer method was used to pre-train to obtain the weights of the generated model. The unsupervised greedy layer-by-layer method was proved effective by Hinton, and is called contrastive divergence ).
In this training phase, a vector V is generated at the visible layer, which transmits the value to the hidden layer. In turn, the inputs at the visual layer are randomly selected to try to reconstruct the original input signal. Finally, these new visible neuron activation units will pass forward to reconstruct the hidden layer activation unit to obtain h (during the training process, the Visual Vector Value is first mapped to the hidden unit; then, the visual unit is reconstructed from the hidden layer Unit. These new visual units are mapped to the hidden unit again to obtain the new hidden unit. Execute this type of repeated steps, which is called the garbage sampling ). These backward and forward steps are familiar with the Gamma sampling, and the correlation difference between the activation unit of the hidden layer and the input of the visible layer serves as the main basis for weight update.
The training time will be significantly reduced, because only one step can be close to the maximum likelihood learning. Each layer added to the network improves the logarithm probability of the training data. We can understand that it is closer to the real expression of energy. This meaningful expansion and the use of unlabeled data are the decisive factors for any deep learning application.
At the maximum two layers, the weights are connected together, so that lower-layer output will provide a reference clue or be associated with the top layer, so that the top layer will connect it to its memory content. What we want most is to identify performance, for example, in a classification task.
After pre-training, DBN can use the BP algorithm to adjust the discriminant performance by using the labeled data. Here, a tag set will be appended to the top layer (Promoting Lenovo memory) to obtain a classification surface of the network through a bottom-up, learned recognition weight. This performance is better than the Network trained by the simple BP algorithm. This can be intuitively explained. The dbns BP algorithm only needs to perform a local search for the weight parameter space, which is faster than the forward neural network, the convergence time is also small.
The flexibility of dbns makes it easier to expand. One extension is convolutional deep belief networks (cdbns )). Dbns does not take into account the two-dimensional structure information of the image, because the input is simply one-dimensional vectorized from an image matrix. Cdbns takes this issue into consideration. It uses the spatial relationship of neighboring pixels and uses a model area called convolution RBMS to generate the transformation immutability of the model, and can easily transform to a high-dimensional image. Dbns does not explicitly deal with the learning of the time relationship between observed variables. Although this research has been conducted, such as the stack time RBMS, dubbed temporal convolutionmachines with sequential learning. This application of sequential learning brings exciting future research directions to the problem of speech signal processing.
Currently, research related to dbns includes stack automatic encoder, which replaces RBMS in traditional dbns by stack automatic encoder. This allows deep multi-layer neural network architecture to be trained using the same rules, but it lacks the strict requirements for layer parameterization. Unlike dbns, the automatic encoder uses a discriminant model, which makes it difficult to sample the input sampling space, making it more difficult for the Network to capture its internal expression. However, the automatic noise reduction encoder can avoid this problem, and it is better than the traditional dbns. It adds random contamination during the training process and stacks to generate the field generalization performance. The process of training a single automatic noise reduction encoder is the same as that of generating a model through RBMS training.
Deep Learning (deep learning) Study Notes series (3)