Deep Learning paper Notes (vii) Visualization of high-level features in depth networks

Source: Internet
Author: User
Tags comparison

Deep Learning paper notes (vii) Visualization of high-level features in depth networks

Zouxy09@qq.com

Http://blog.csdn.net/zouxy09

I usually read some papers, but the old feeling after reading will slowly fade, a day to pick up when it seems to have not seen the same. So want to get used to some of the feeling useful papers in the knowledge points summarized, on the one hand in the process of finishing, their own understanding will be deeper, on the other hand also convenient for the future of their own investigation. Better can also put on the blog to communicate with you. Because the foundation is limited, so some understanding of the paper may not be correct, but also hope that everyone is not hesitate to communicate, thank you.

The papers in this paper come from:

Dumitru Erhan, Aaron Courville, Yoshua Bengio, and Pascal Vincent. Visualizing higher Layer Features of a deep Network. Spotlight presentation and poster at the ICML-Workshop on learning Feature hierarchies, Montréal, Canada

Here are some of the things that you know about them:

"Visualizing Higher-layer Features of a deep Network"

Deep learning is very attractive, but also very iffy a point is that it can be extracted to a hierarchical level of abstraction of the characteristics. But for us, it is always a false ear, and seeing is believing. So, every time we train a depth model, we especially want to visualize what this depth model learns, to figure out what it is learning, whether it's meaningful, or not as magical as it is legendary. How does it make sense to visualize it. By the right, what do we do with deep net? To extract the features. Then what characteristics does it extract? If it extracts from the bottom to the top, like the edge, to the shape, to the target and so on, as we say, it proves that our purpose is achieved.

In addition, in the quantitative analysis of depth model, we also need a qualitative analysis method to compare the characteristics of different depth architecture learning. The purpose of this paper is to find a better qualitative interpretation of the advanced features extracted by the depth model. We train stacks of noise-cancelling automatic encoders and DBN in several visual databases to be convinced of the network, and to compare several different ways of visualizing advanced features. Although the display of these features is above the unit level, it may be inconsistent with visual comprehension, but it is easy to implement, and the results obtained in different methods are consistent. We hope that these methods will give researchers a clearer understanding of how deep learning works and why it works. In this article, we describe three ways to visualize: activation maximization, sampling, and linear composition.

I. Overview

Some depth architectures (such as DBNS) are closely related to the generative procedure generation process, so we can use this generation process to catch glimpses of what a single hidden layer neuron represents. Here, we look at one of these sampling methods. On the one hand, however, it is sometimes difficult to obtain a sample that can completely cover the distribution of Boltzmann or RBM, on the other hand, this method of sampling-based visualization is not applied to other deep architecture models based on the automatic encoder or embedded in the semi-supervised learning model of preserving similarity in each layer.

A typical qualitative approach to the features extracted from the first layer of the depth architecture is by observing these filters learned by the model. These filters are the weights of the weight matrix of the input layer to the first layer. They are represented by the input space. This is very convenient, because the input is an image or a wavelet, they can be visualized.

In general, these filters can be visualized as a detector for the strokes of a number of different numbers when trained in a digital data set. When trained in natural images, these filters are equivalent to different edge detectors (wavelet filters).

The objective of this paper is to study a method that can visualize the characteristics of any neuron in the depth architecture that is computed or extracted. To achieve this, we need to visualize in the input space (image), and we need to find an efficient method to calculate and then make him universal, that is, in different deep network models can be used. Here we explore several methods. We then made a qualitative comparison of them in two datasets to study their connection.

In the experimental process, a very surprising thing is that each of the nodes of the hidden layer of the input image of the response, that is, the input space function, incredibly a single peak, that is, regardless of where you randomly initialized, and ultimately can reliably find the maximum value, which is very cool for iterative optimization, And what it can do with each node is open to the world. Glance.

second, the model

Here we discuss two models, both of which are common in the depth architecture. The first is DBNS, which is obtained by the greedy stacking of multiple RBM layers. We train an RBM through the CD algorithm, then fix it, and then output it as another RBM input to train this hidden layer of RBM. This process can be done by repeating it and then getting a depth schema of the unsupervised model that obeys the training distribution. One thing to note is that it is a data generation model that can easily be sampled by a well-trained model.

The second model we are going to discuss is the noise reduction automatic encoder, which is a random variant of the traditional encoder, and it has a stronger capability. It does not learn the identity function (that is, h (x) =x, which satisfies the 0 refactoring error, but it is meaningless). It is forced to learn the essential expression of input.

The key to its training is the need to continuously improve the lower bound of the likelihood function of the generated model. It's better than a traditional code, and if stacked into a depth-supervised architecture, it's comparable to RBMS performance, or even more bull. Another way to avoid learning the identity function when the hidden layer element is more than the input unit is to add sparse constraints to the hidden layer code.

Here we summarize the training methods of stacked denoising auto-encoders. For an input x, we add random pollution or noise to it, and then train the noise-cancelling automatic coder to learn to refactor the original input x. The output of each automatic encoder is a code vector h (x). Here, as with traditional neural networks, h (x) = sigmoid (b + W x). Here we use C (x) to denote a random pollution of x, we let CI (x) = XI or 0, in other words, we randomly pick a fixed-size subset in the original X and set them to 0. We can also add salt and pepper noise, randomly select a subset of a certain size, set to Bernoulli (0.5).

In the actual image, the pixel input XI and its reconstructed pixel XII for a specific pixel I can be regarded as the Bernoulli probability of the pixel: the probability that the pixel of the position is painted black. We compare the similarity between the original input XI of the pixel position I and the distribution of its reconstructed pixel XII by cross-entropy. Then you need to sum all the pixels. In addition, the Bernoulli distribution is only meaningful when the values of input and refactoring are in the range of [0,1]. The other option is to select the Gaussian distribution, which corresponds to the mean square error rule.

Iii. Maximizing the activation Maximize activation value

The first idea is simple: we look for input patterns that make the maximum activation value of a given hidden layer unit. Because the activation function of each node in the first layer is a linear function of the input, its input mode and the filter itself are proportional to the first layer.

We look back at the great experiment of David Hubel and Torsten Wiesel, the Nobel Prize in medicine. They found a neuron cell called the "directional selective cell (Orientation selective cell)". When the pupil finds the edge of the object in front of it, and the edge points in a certain direction, the neuron cells become active. In other words, a "specific direction nerve cell" only has an excitation or excitement over the edge of the image in this particular direction. The popular point is that if my neuron is extracting this trait, then if you have this image that satisfies this trait (which can be understood to be very similar to it), then the output of the neuron is very large and exciting. (There is data suggesting that there is a "grandmother cell" in the top of the brain, a cell that is only excited about a particular target, such as a cell in your brain that remembers your girlfriend, and then once your girlfriend appears in front of you, your cell gets excited and tells the brain, ah, this is my girlfriend. If we know about the template convolution, then we know that if a convolution template is more similar to a module in an image, the response is greater. Conversely, if an image input causes the neuron to have the highest output excitation value, then we have reason to believe that the neuron is extracting a characteristic that is similar to the input. So we're looking for the x that can make this neuron enter the largest one that we can visualize and meaningfully express the characteristics of the neuron's learning.

In mathematical terms, once the network training is complete, the parameter w is determined, then we can look for the x that corresponds to the maximum activation value of the neuron, which is:

But this optimization problem is usually a non-convex optimization problem, that is, there are many local minimum values. The simplest method is to find a local minimum value by descending the gradient. There are two scenarios: one is to iterate from the beginning of a different random value initialization to get the same minimum value, and the other is to get two or more local minimums. In either case, the features extracted by the neural node can be described by one or more of the minimum values found. If you have more than one minimum, you can look for the maximum number of activations or average all, or show all of them.

Iv. Sampling from a unit of a deep belief Network sampling from a node in the DBN

Here we use a J-layer deep belief network to illustrate. Here the Layer J and Layer j-1 constitute an RBM, we can use the block Gibbs sampling method to the distribution P (hj−1| HJ) and P (hJ | Hj−1) for continuous sampling (here hJ represents the vector of all two-value nodes of Layer J). In this Markov chain, we limit one node hij to 1 and the other nodes to 0. Then in DBN, we perform top-down sampling all the time from the layer j-1 to the input layer. This will produce a distribution PJ (x| Hij=1). Which means we use the distribution PJ (x| Hij=1) to describe hij, similar to the third part, we can describe the hidden node by generating or sampling enough samples from this distribution, or by calculating the desired e[x| Hij= 1] To summarize this information. This method has only one parameter to determine, that is, how many samples we want to sample to estimate this expectation.

In maximizing the activation value and calculating the expected e[x| Hij= 1] There is a very subtle connection between the two methods. By the definition of conditional expectations we can know:

We consider an extreme situation where this distribution is all centered at the x+ point, when PJ (x| Hij=1) is approximately equal to δx+ (x). So its expectation e[x| Hij=1]= x+.

As a matter of fact, we observed that although the samples obtained or their averages might look like training samples, the images maximized by the activation value would look more like the part of the image. So the latter may be more accurate in expressing what a particular node does.

V, Linear combination of previous layers ' filters the linear combination of the upper filter

Lee ET (2008) in their paper, a method is presented to visualize the characteristics of nodes in the second layer of hidden layers. They are based on the assumption that a neuron node can be described by a combination of filters that are strongly connected to the upper layer. The visualization of one node of the layer can be obtained by the linear weighting of the previous filter, and the weighting value of each filter is the connection weight between the filter and the node.

They trained a DBNS with an activation value sparse constraint using natural images, and then used this method to show that it was learning a corner detector at the second level. Lee expands this approach to visualize what the third layer learns: by simply weighting the second layer of the filter, the weight is the second filter to the third layer of the node's connection weights, and the choice is the largest of that weight.

This method is simple and effective. One drawback, however, is that there is no clear guideline on how to choose the right number of filters for each layer automatically. In addition, if you choose only a few of the most powerful filters with the first connection, we are likely to get a meaningless mixed image of disappointment, because this approach essentially ignores the other filters that are not selected in the previous layer. On the other hand, this method also ignores the non-layer nonlinearity, which is a very important part of the model.

It is noteworthy, however, that there is a subtle link between the gradient Update method that maximizes the activation value of a node and the linear weighted combination. For example, for the hi2= V ' sigmoid (W x) of the 2nd layer, where V is the weight of the node, W is the weight matrix of the first layer. Then ∂hi2/∂x = V ' diag (sigmoid (W x) ∗ (1 −sigmoid (W x))) W. Here * is multiplied by element. Diag is an operator that creates a diagonal matrix from a vector. 1 is a total quantity. If the node of the first layer does not have a saturated saturate, then the ∂hi2/∂x will generally point to the direction of V ' W, which can be approximated by the largest of the absolute values of the elements in vi. (I don't understand it yet)

Vi.. Experiment

6.1. Data and Setup

We train the DBN and dsae two models in the Minst script database and the natural image database respectively. Then use three more methods to visualize the characteristics of some nodes in some of these layers.

6.2, Activation maximization

Visualize the activation maximization in the mnist handwriting database. Left: The features extracted from the first (first column), second (second), and third (third column) 36 nodes, then the first line is DBN training, the second line is Sdae training obtained. Right: For a node in the third layer of DBN, and then start iterating from 9 randomly initialized values, you can get the same result.

In general, the activation function on the third level should be a highly non-convex function about its input, but it is not known whether we have been fortunate enough or that it happens that we train the network in Mnist or natural images is a very special situation, and we are pleasantly surprised to find that the activation functions of these nodes tend to be more " Single-peak ".

6.3, sampling a unit

Left: Train the DBN from the Mnist database, and then sample the visualization of the 6 nodes in the second layer. Right: Trained from a natural image. The sample for each row is sampled from the distribution of each node, and then the mean value of each row is in the first column.

It is worth noting that the sampling and activation value maximization methods have different results, and the sampled (or distributed) samples are more likely to be potential distributions of training samples (numbers or patches). The maximum activation value method is to produce many characteristics, and then we decide which samples will match or conform to these characteristics, the sampling method is to produce a lot of samples, and then we decide which of these samples have common characteristics. From this level, the two approaches are complementary.

6.4. Comparison of Methods

Here, we compare three methods of visualization: Activation maximization, sampling, and linear combination method.

We have shown here three methods: Left: sampling; medium: Linear combination of the last filter; Right: maximizes node activation values. Train the DBN model on the Minst database (top) and the natural image (down). Then visualize the 36 nodes in the second layer.

In these three methods, the linear combination method was previously proposed, the other two is that we have existing knowledge, set the current wisdom to get.

attached: The original paper, mentioned more content, in-depth understanding please refer to the original paper. In addition, I have not had time to achieve some, and so later can be put up and exchange with you. Also hope that everyone has realized can also share under, thank you.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.