Deep Learning thesis note (7) Deep network high-level feature Visualization
Zouxy09@qq.com
Http://blog.csdn.net/zouxy09
I have read some papers at ordinary times, but I always feel that I will slowly forget it after reading it. I did not seem to have read it again one day. So I want to sum up some useful knowledge points in my thesis. On the one hand, my understanding will be deeper, and on the other hand, it will facilitate future surveys. You can also share your blog with us. Because of the limited foundation, some of my understanding of the paper may be incorrect. I hope you will not give me any comments. Thank you.
The thesis in this article comes from:
Dumitru Erhan, Aaron courville, Yoshua bengio, and Pascal Vincent. visualizing higher layer features of a deep network. spotlight presentation and poster at the icml 2009 Workshop on learning Feature Hierarchies, montr éal, Canada
The following is your understanding of some of these knowledge points:
Visualizing higher-layer features of a deep Network
Deep Learning is very attractive, and it is also a mysterious point that we all say it can be extracted to hierarchical layer-by-layer abstract features. However, for us, we are always confused, and our eyes are real. Therefore, every time we train a deep model, we want to visualize what the deep model has learned so that we can understand what it has learned and whether it makes sense, is it as amazing as it is. So how can we achieve meaningful visualization? By the way, what do we do with deep net? To extract features. What features does it extract? If it extracts features from the bottom to the top, such as edges, shapes, and targets, it proves that our goal is achieved.
In addition, we also need a qualitative analysis method to compare the features learned by different deep architectures. The purpose of this article is to find a better qualitative explanation of the advanced features extracted by the deep model. We train stacked automatic noise reduction encoder and DBN in several visual databases, and compare several different advanced feature visualization methods. Although the display of these features is at the unit level, it may be contrary to the intuitive understanding, but it is easy to implement and the results obtained on different methods are consistent. We hope that these methods will give researchers a clearer understanding of how deep learning works and why it works. This article describes three Visualization Methods: Activation maximization, sampling, and linear combination.
I. Overview
Some deep architectures (such as dbns) are closely related to the generative procedure generation process, so we can use this generation process to see what a single hidden layer neuron represents. Here, we study one of these sampling methods. However, on the one hand, it is sometimes difficult for us to obtain samples that can fully cover the poltzman or RBM distribution, on the other hand, this sampling-based visualization method cannot be applied to other deep-architecture models based on automatic encoders, or the semi-supervised learning models with retention similarity embedded in each layer.
A typical qualitative analysis method for the features extracted at the first layer of the deep architecture is to observe these filters learned by the model. These filters are the weights from the input layer to the first layer of the weight matrix. They are represented by input space. This is very convenient, because the input is an image or wavelet, they can be visualized.
In general, these filters can be visualized as detectors for strokes of different numbers when training data in a centralized manner. When training in natural images, these filters are equivalent to different edge detectors (wavelet filters ).
The goal of this article is to study a way to visualize the features calculated or extracted by any neuron in any layer of the deep architecture. To achieve this goal, we need to visualize the input space (image), and find an effective calculation method for computing, and then make it universal, that is, it can be used in different deep network models. Here we have explored several methods. Then we made a qualitative comparison between the two datasets to study the connection between them.
In the experiment, a very surprising thing is that the response of each node on the hidden layer to the input image, that is, the function of the input space, is a single peak, that is, no matter where you initialize it randomly, you can find the maximum value reliably. This is very nice for iterative optimization, in addition, it can disclose what each node has done to the public. All at a glance.
Ii. Model
We will discuss two models here, both of which are common in deep architecture. The first is dbns, which is obtained through greedy stacking of multiple RBM layers. We first train an RBM through the CD algorithm, then fix it, and then input the output into another RBM to train the RBM of this hidden layer. This process can be repeated repeatedly to obtain a deep architecture of an unsupervised model that complies with the training distribution. Note that it is a data generation model that can easily collect samples through a trained model.
The second model we will discuss is the automatic noise reduction encoder, which is a random variant of the traditional encoder and has stronger capabilities. It will not learn the constant function (that is, h (x) = x, which satisfies the zero reconstruction error, but this is meaningless ). It is forced to learn the essential expression of input.
The key to its training is to continuously improve the lower bound of the likelihood function of the generated model. It is more powerful than traditional coding machines. If it is stacked into an in-depth supervision architecture, it is equivalent to or even better than RBMS. Another way to avoid learning the constant function when the hidden layer unit is more than the input unit is to add sparse constraints to the hidden layer code.
Here we will summarize the training methods of stacked denoising auto-encoders. For an input X, we add random pollution or noise to it, and then train it to enable the automatic noise reduction generator to learn and reconstruct the original input x. The output of each automatic host is a code vector h (x ). Here, like a traditional neural network, h (x) = sigmoid (B + w x ). Here we use C (X) to represent a random pollution of X. We make Ci (x) = xi or 0, in other words, we randomly select a fixed subset from the original X and set them to 0. We can also add salt and pepper noise, randomly select a subset of a certain size, and set it to Bernoulli (0.5 ).
In the actual image, the input XI of a specific pixel I and its reconstructed XII can both be regarded as the bernui probability of the pixel: the probability that the pixel at the position is painted black. We use cross entropy to compare the similarity between the original input XI of the pixel position I and its reconstructed pixel XII distribution. Then sum all pixels. In addition, the bernuoli distribution is meaningful only when the input and reconstruction values are in the range of [0, 1. In addition, Gaussian distribution is selected. At this time, the mean square error rule is used.
Iii. Maximizing the activation maximum activation Value
The first idea is very simple: we look for the input mode that maximizes the activation value of a given hidden layer unit. Because the activation functions of each node in the first layer are input linear functions, the input mode of the first layer is proportional to the filter itself.
Let's review the great experiment of the Nobel Medical Prize David Hubel and Torsten Wiesel. They discovered a neuron called orientation selective cell. When the pupil discovers the edge of the object in front of the eye and points to a certain direction, the neuron will become active. That is to say, a "neural cell in a specific direction" is only excited or excited to the edge of the image in a specific direction. In layman's terms, if my neuron extracts this feature, then if your image satisfies this feature (which can be understood as similar to it), the output of the neuron will be very large, will be excited. (Some data shows that there will be "grandmother cells" in the upper layers of the human brain. One Cell of these cells is only excited for a specific target, for example, you have a cell in your brain that can remember your girlfriend. Once your girlfriend appears in front of you, you will be excited and tell the brain, ah, this is my girlfriend !) If we know about template convolution, we know that if a convolution template is more similar to a module in an image, the response will be larger. On the contrary, if an image input maximizes the output excitation value of this neuron, we have reason to believe that this neuron is extracting features similar to this input. So we look for the X that can make this neuron input the largest, that is, we can visualize and meaningful expression of the features learned by this neuron.
In mathematics, once the network training is completed, the parameter W is fixed, so we can find the X corresponding to the activation value that maximizes the neuron, that is:
However, this optimization problem is usually a non-convex optimization problem, that is, there are many local minimum values. The simplest method is to find a local minimum through gradient descent. There are two scenarios: one is to get the same minimum value from the initialization of different random values, and the other is to get two or more local minimum values. In either case, the features extracted by the neural node can be described by one or more minimum values. If there are multiple minimum values, you can find the maximum activation value, average all values, or display all values.
4. sampling from a unit of a deep belief network sampling from a node in DBN
Here we use a J-layer deep belief network. Here layer J and layer J-1 constitute an RBM, we can use the Block DCT Sampling Method (HJ−1 |HJ) and P (HJ |HJ−1) for continuous sampling (hereHJ indicates the vector composed of all the binary nodes of J ). In this Markov chain, we limit a node.HijIt is 1, and other nodes are 0. Then in DBN, we perform top-down sampling from the layer J-1 until the input layer. This will generate a distribution PJ (X|Hij= 1 ). That is to say, we use distribution PJ (X|Hij= 1) to describe hij. Similar to the third part, we can use this distribution to generate or sample enough samples to describe this hidden layer node or calculate the expected E [X|Hij= 1. This method has only one parameter to determine, that is, the number of samples to be sampled to estimate the expected value.
In maximizing the activation value and calculating the expected E [X|Hij= 1] There is a very subtle relationship between the two methods. We can know the definition expected by the condition:
We consider an extreme situation where the distribution is all concentrated inX+ This point, PJ (X|Hij= 1) is approximately equal to delta x + (X ). Therefore, it is expected that E [X|Hij= 1] =X+.
As a matter of fact, we have observed that although the samples sampled or their average may look much like the training samples, the image obtained with the maximum activation value looks more like the part of the image. Therefore, the latter may be more accurate in expressing what a specific node has done.
5. Linear Combination of linear combination of previous layers' filters upper-layer filters
Lee et al. (2008) presented in their paper a method to visualize the features of nodes on the second hidden layer. They are based on the assumption that a neuron node can be described by a combination of the filter at the upper layer that is strongly connected to it. The visualization of a node in this layer can be obtained through the linear weighting of the filter on the previous layer. The weight of each filter is the connection weight between the filter and the node.
They trained a dbns with sparse activation value constraints using natural images, and then used this method to show that it learned a angle detector on the second layer. Lee expanded this method to visualize what he learned at Layer 3: By simply weighting the filter at Layer 2, the weight is the connection weight from the filter at Layer 2 to the node at Layer 3, and select the largest weight.
This method is simple and effective. However, its disadvantage is that there is no clear rule on How to automatically select the appropriate number of filters at each layer. In addition, if we only select a few filters that are the most connected to the first time, we may get a disappointing, meaningless, hybrid image, this is because the above method essentially ignores the previous layer of other unselected filters. On the other hand, this method ignores the non-linearity between layers, which is a very important part of the model.
However, it is worth noting that, in fact, there is a subtle relationship between the gradient update method for maximizing the activation value of a node and the linear weighted combination. For example, for the first layer I node hi2 = v'sigmoid (w x), here V is the weight of the node, and W is the weight matrix of the first layer. Then export hi2/∂ x = v'diag (sigmoid (w x) Then (1− Sigmoid (w X) W. Here * Is to multiply by element. Diag is an operator that creates a diagonal matrix from a vector.1Is a full vector. If the node at the first layer is not saturated with saturate, then ∂ hi2/∂ X points roughly to the direction of v'w. You can useVThe largest of the absolute values of elements in I. (I still don't understand)
Vi. Experiment
6.1. Data and Setup
We train the dbN and dsae models in the minst handwritten database and the Natural Image Database respectively. Then, three methods are used to visualize the features extracted by some nodes in some layers.
6.2. Activation Maximization
The visual effect obtained by using activation maximization in mnist handwritten database. Left: The features extracted by 36 nodes on the first (first column), second (second column), and third (third column, then the first line is obtained by DBN training, and the second line is obtained by sdae training. Right: For a node on the third layer of DBN, and then iteration starts from nine random initialization values, the same result can be obtained.
In general, the activation function on the third layer should be a highly non-convex function about its input, but I don't know whether we are always lucky or whether it is a special situation that we train networks in mnist or natural images. We are pleasantly surprised to find that, the activation functions of these nodes tend to be more "single-peak ".
6.3. Sampling A unit
Left: Training DBN from the mnist database, and then sampling to obtain the visualization of the 6 nodes on the second layer. Right: trained from natural images. The sample of each row is obtained from the distribution sampling of each node, and the average value of each row is in the first column.
It is worth noting that the sampling method and the activation value maximization method have different results. The sampled (or distributed) samples are more likely to be the potential distribution of training samples (numbers or patches. The activation value maximization method generates many features, and then we need to determine which samples will match or conform to these features. The sampling method produces many samples, it is then up to us to determine which features of these samples are shared. At this level, the two methods are complementary.
6.4. Comparison of Methods
Here we compare three Visualization Methods: Activation maximization, sampling, and linear combination.
Here we show three methods: Left: sampling; medium: Linear Combination of the last filter; Right: maximizing the node activation value. Train the DBN model on the minst database (top) and Natural Image (down. Then the 36 nodes on the second layer are visualized.
Among the three methods, the linear combination method is based on the premise. The other two methods are obtained by integrating current wisdom with existing knowledge.
Appendix:More content is mentioned in the original paper. For more information, see the original paper. In addition, I have not had time to implement it. I will try again later to discuss it with you. I also hope you can share what you have already implemented. Thank you.