An automatic encoder for deep learning UFLDL Tutorial Translation

Source: Internet
Author: User

One, Automatic encoder

So far, we have introduced the application of neural network in supervised learning of tagged training samples. Now suppose we have only one unlabeled training set {x (1), X (2), X (3),...}, where x is n-dimensional. Automatic Encoder neural Network is a unsupervised learning algorithm using reverse propagation, Make the target value equal to the input, i.e. let y (i) =x (i).

This is an automatic encoder:


The automatic encoder tries to learn the function hw,b (x) ≈x. In other words, it attempts to learn the approximation of an identity function, making the output x^ similar to x. The identity function that will be learned looks like a particularly unimportant function, but by making some restrictions on the network, such as limiting the number of hidden units, We can find interesting structures in the data. Specifically, assume that input x is the pixel strength value of a 10*10 image, n=100, and then L2 layer has s2=50 hidden cells. Notice that Y is also 100-dimensional. Since there are only 50 hidden units, the network has to learn a "compressed" representation of the input, That is, a (2) of the activation value of a given hidden cell is only 50 dimensions, and it must "refactor" the input of 100 pixels. If the input is completely random, that is, each XI and other eigenvalues are Gaussian independent distribution, then this compression task will be very difficult. But if the data contains structures, for example if some of the input features are related , the algorithm can find these correlations. In fact, this simple encoder is often used to learn low-dimensional representations, much like PCA (principal component analysis).

The above argument is based on the small number of hidden layer units s2. But even when the number of hidden layer units is large (perhaps even larger than the number of input pixels), we can still find interesting structures by imposing other restrictions on the network. In particular, if we impose a "sparse" limit on hidden layer elements, Then we can still find interesting structures in the data, even if the number of hidden layer elements is large.

Informally, we would think that neurons are "activated" (or "lit") if its output is close to 1, or "inactive" If its output is close to 0. We want to limit the number of neurons that are inactive for most of the time. This discussion assumes that the sigmoid activation function is used. If you use the Tanh activation function, we think that the neuron is inactive if its output is close to-1.

Recall that AJ (2) represents the activation value of the automatic encoder hidden unit J. However, such a representation does not clarify what input x is causing the activation value. In this way, we will use AJ (2) (x) to indicate the activation value of the hidden unit when the given input is x on the network.

Further, make


Is the average activation value of the hidden cell J (averaging on the training set). We can (approximate) implement a limit of ρ^j=ρ, where ρ is a "sparse parameter" that typically approximates a small value of 0 (for example, ρ=0.05). In other words, we want the average activation value of J for each hidden cell to be close to 0.05 (assuming). To meet this limitation, most of the activation values of the hidden unit must be close to 0.

To achieve this, we add an additional penalty to our optimization goal, punishing ρ^j for the degree of deviation of the ρ. Many of the options for a penalty can be justified. We choose the following form:


Here, S2 is the number of neurons in the hidden layer, and subscript J sums the hidden units in the network. If you are familiar with the concept of KL decomposition, this penalty is based on this and can also be written as


of which KL (ρ| | Ρ^J) =ρlogρ/ρ^j+ (1?ρ) log1?ρ/1?ρ^j is an effort random variable--mean ρ and ρ^j (KL) decomposition. KL decomposition is a measure of how different two different distributions are. (If you haven't seen KL decomposition before, don't worry; everything you need to know is in this section.) )

The penalty function has such a property, if ρ^j=ρ, then KL (ρ| | Ρ^J) = 0, otherwise the deviation degree of ρ^j and ρ will increase monotonically. For example, in the following figure, we make ρ=0.2, we draw a range of Ρ^j,kl (ρ| | The value of the ρ^j).


We see the KL decomposition at ρ^j=ρ at a minimum of 0. And as the Ρ^j approaches 0 or 1, it increases rapidly (actually tending to infinity). Therefore, minimizing this penalty makes the ρ^j close to ρ.

Our current global cost function is


where J (w,b) was previously defined, β controls the weight of the sparse penalty term. The Ρ^j item also (implicitly) relies on w,b because it is the average activation value of the hidden cell J, while the activation value of the hidden cell depends on the parameters W and b.

In order for KL decomposition items to be included in your derivative calculations, there is an easy-to-implement technique that requires little change to your code. Specifically, in the second layer (l=2), you will calculate during the reverse propagation process:


Instead, it's calculated


A subtle place is that you need to know ρ^i to calculate this. Then, you first need to propagate all the training samples before calculating the average activation value on the training set, and then calculate the reverse propagation from any sample. If your training set is small enough to fit your computer's memory well (which is the same for programming tasks), you can calculate forward propagation on all samples, save the result of the activation value in memory, and then compute the ρ^i. You can then calculate the reverse propagation on all samples using the previously calculated activation values. If your data is too much memory to fit, you can only browse through all the samples, calculate forward propagation on each sample and then accumulate (sum) the activation values and then calculate ρ^i (once you have calculated the ρ^i with the activation value AI (2), you discard the results of the forward propagation of each calculation). Then after calculating the ρ^i, you do a forward propagation calculation of each sample so that you can do the reverse propagation on that sample. In the latter case, you will calculate two forward propagation ends in each sample of the training set, making the calculation inefficient.

The complete derivation of the gradient descent from the above algorithm is beyond the length of this section. However, if you implement an automatic encoder using modified reverse propagation, you will calculate the gradient drop on the target function Jsparse (w,b). Using the derivative test method, you can also verify it yourself.

Second, the visual training good automatic encoder

After training (sparse) The automatic encoder, we want to visualize the function that this algorithm learns and understand what it learns. Consider the automatic encoder that is trained on the 10x10 image, so n=100. Each hidden unit I computes the input function:


We will use the 2D image to visualize the function computed by the hidden unit i-dependent parameter Wij (1) (This ignores the offset). In particular, we think that AI (2) is a nonlinear feature of input x. We will ask: what kind of input image X can make AI (2) Maximize activation? (Not very formally, what features are hidden unit I looking for?) To get an answer to this question that is not trivial, we need to impose some restrictions on X. If we assume that the input is | | x| | 2=∑xi2≤1 the normalization limit, then we can see (try to do it yourself) to maximize the activation of the hidden unit I input is by setting the pixel XJ given as follows (for all 100 pixels, J is 1 to 100):


By displaying the images filled with these pixel strength values, we can begin to understand what features the hidden unit i is looking for.

If we have an automatic encoder with 100 (for example) hidden cells, then our visual image will get 100 of these images-one representing a hidden unit. By examining these 100 images, we can try to understand what sets of hidden elements are learning.

When we do this with a sparse self-encoder (trained with 100 hidden units built on the 10x10 pixel input), we get the following results (each image 10*10 represents a hidden unit, a total of 100 hidden units):


Each block in the box shows (normalized) the input image x, which maximizes the activation of one of the 100 hidden units. We found that different hidden units were learning to detect the edges of different positions and orientations in the image.

These features, not surprisingly, are useful for target recognition and other visual tasks. When applied to other input fields (such as audio), the algorithm also learns useful representations/features for these fields.

The learning features are obtained by training "albino" natural images. Whitening is a preprocessing step that removes input redundancy, making adjacent pixels less relevant.

An automatic encoder for deep learning UFLDL Tutorial Translation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.