Variational automatic Encoder

Source: Internet
Author: User

Variational automatic Encoder

Self-Encoder

The encoding dimension of the middle layer is much smaller than the output data. The entire model training target is to minimize the reconstruction input data error.

The problems faced by the standard self-encoder are:

The self-developed NLE converts input data into expressions in the hidden space instead of consecutive expressions, making it impossible for the decoder to decode the areas that exist between classes. Therefore, a variational self-encoder is proposed.

Variational self-Encoder

The hidden space of the variational self-encoder is designed as continuous distribution for random sampling and interpolation. The encoder outputs two n-dimensional vectors, namely the mean vector u and the standard deviation Vector Sigma; then, random variable X is obtained by sampling U and sigam as mean and variance. After n times of sampling, the n-dimensional sampling result is generated as the encoding output and sent to the subsequent decoder.

During model training, we hope to get codes that are as close as possible but still have a certain distance, so that we can perform interpolation in the hidden space and recreate new samples. To implement the encoding that meets the requirements, Kullback-Leibler Divergence (KL divergence) must be introduced in the loss function ). KL divergence describes the degree of divergence between two probability distributions. Minimizing KL divergence here means optimizing the probability distribution parameters (μ,σ) Approaching the target distribution as much as possible

 

 

 

We have introduced the Gan-generative adversarial network. This network group shows the power of generating and distinguishing models from the perspective of confrontation games, next, let's take a look at another way of generating a combination of models and discriminative models-variational autoencoder, Vae for short.

Variational autoencoder involves complex formulas. Before proceeding to the formal derivation, let's take a look at a basic concept, KL divergence, which is translated as KL divergence.

What is KL divergence?

From the perspective of probability theory or information theory, we can give the significance of KL divergence measurement. This is not a basic concept, so we will not introduce the KL concept. In variational inference, we hope to find a relatively simple probability distribution Q, so that it can approximate the posterior probability P (z | X) to be analyzed as much as possible ), here, z is a hidden variable, and X is an explicit variable. Here, the "loss function" is the KL divergence, which can measure the distance between two probability distributions. The closer the two distributions, the smaller the KL divergence. The farther the KL divergence is.

The KL divergence formula is as follows:

This is the formula for discrete probability distribution.

This is the formula for the continuous probability distribution.

We will not go into details about the properties of other KL divergence types here.

Practices of KL divergence-1-dimensional Gaussian distribution

Let's take a relatively simple example. Suppose we have two random variables X1 and X2, each of which is subject to a Gaussian distribution.

How can we calculate the KL divergence of the two distributions?

We know

Then kl (P1, P2) is equal

Stop here. Some children's shoes ask about the simplification of the last item on the right. What is in the integral symbol is unfamiliar at this time? That's right, it's our common variance. So we get the final result-1/2 by an internal and external score in parentheses.

Okay. Continue.

To tell the truth, I have never liked to write such a large paragraph of deduction formula. First, the originality was relatively poor (they were all pushed by our predecessors, and I was the porter of Nature). Second, the logic was too strong, easy to see. However, the final conclusion is that N2 is a normal distribution, that is

So what does N1 look like to minimize KL divergence? That is to say

We can guess when we look at it with the naked eye

The KL divergence is the smallest. From the formula, we can see that if the deviation is 0, the KL divergence will definitely increase. The variance changes are somewhat different:

When the value is greater than 1, the larger the value, the smaller the value.

When the value is less than 1, the smaller the value, the larger the value.

Which side is more powerful? We can figure it out:

Import numpy as NP

Import matplotlib. pyplot as PLT

X = NP. linspace (0.5, 2,100)

Y =-NP. Log (x) + x * X/2-0.5

PLT. Plot (x, y)

PLT. Show ()

We can see that

The power of the secondary item is greater, and the function is always non-negative, which is exactly the same as the non-negative definition we mentioned earlier.

Now, after reading this simple example, let's look at a complicated example.

A more complex example: KL divergence of multi-dimensional Gaussian distribution

The last time we saw the KL divergence calculation between 1-dimensional Gaussian distributions, let's look at what the KL divergence of multi-dimensional Gaussian distributions looks like? To be honest, this formula will play a very important role in introducing Vae later!

First, the formula for multi-dimensional Gaussian distribution is given:

Because this is a multi-dimensional variable, most of the calculations in it are calculated between vectors and matrices. We usually use the independent distribution of dimensions, so the covariance matrix is actually a diagonal matrix.

Considering the length and the actual situation, we will directly give the results below to ignore the disgusting derivation processes:

In fact, this time we didn't introduce the significance and functions of KL, but we just got a bunch of formulas, which are hard and inexplicable, but don't worry. Next time, when we show the Vae effect, we will see the role of KL divergence.

It is blessed to see the children's shoes here. It shows the character generation effect of the Vae decoder on the mnist database:

In terms of this effect, its function is a bit like Gan, so let's further reveal its true nature!

We have seen many excellent deep learning models that have achieved excellent accuracy and results. People have carefully analyzed why these models are so effective. The conclusion is that the nonlinear fitting capability of the deep models is indeed strong. No matter how complicated the problem was, a deep model immediately solved the problem. Vae also utilizes this feature. We use a deep model to fit some complex functions to solve practical problems.

Let's remember this trick first, and we will use it later. The next thing we want to play is to generate a model. Many of the models we have seen above are discriminant models in principle. We have a transaction x waiting for discrimination, which has a Category Y. We will build a model f (x; W) to make the probability of P (Y | x) as big as possible, the alternative is to keep f (x; W) as close as possible to y.

If we want to use a generative model to solve this problem, we need to use the Bayesian formula to convert the problem:

In order to follow the usage of variables in most textbooks, Here y is changed to Z. Of course, Z at this time may be more complex than the "category" y mentioned above. In many generative models, we call z an implicit variable and X an observation variable. In general, we can easily observe X, but z behind X is not so easy to see. In many cases, X is constructed by Z, for example, the good and bad weather of a day is determined by many inobserved factors. So we naturally have a demand. When we get these X, we want to know what Z is behind, so we have the above formula.

For some simple problems, the above formula is still relatively easy to solve, such as the naive Bayes model, but there are still many models that are not easy to solve, especially when the Hidden variable is in a high-dimensional continuous space:

The points here are not so easy to handle. As a result, various great gods began to try every means to make the above formula easier.

At this time, we will inevitably have a problem. Since we have a discriminant model that can directly solve the thing on the left of the formula, why do we have to turn it into a bunch of things on the right, making it difficult to solve it ourselves? In fact, no one wants to find trouble for themselves, but the problem is that this pile on the right can solve this problem. It also has a more advanced function, that is, generating X randomly based on the model.

Let's think about it. What if we only have P (z | X) on the left of the formula, we want to generate an X that matches a certain Z?

? Step 1: Random X;

? Step 2: Use P (z | X) to calculate the probability. If the probability is satisfied, it ends. If the probability is not met, the first step is returned;

As a result, X is generated using a discriminative model to become a character game, and no one knows when he will pass the second step. The generative model is different. We can customize it as needed. First, determine Z, and then perform random sampling based on p (x | Z, the process of generating X is secure and controllable.

After talking about this, we will enter the formula deduction section.

Variational Inference

Although we advocate a lot of the benefits of generating models, we are still helpless in the face of the pile of things on the right side of the equal sign. However, our predecessors have come up with some subtle solutions. Since it is difficult to find the right thing using probability theory, can we make some changes? For example, we use a variational function q (z) replace P (z | x )? Don't worry, we will see the benefits it brings later.

The variational inference here is a little simple to introduce. We will discuss it in detail later.

Since Q (z) is used to replace P (z | X), we certainly hope that the two things are as close as possible, so we chose the KL divergence indicator to measure the similarity between the two. Because both sides can be seen as the probability distribution for Z, It is very suitable to use the KL divergence index.

So there is:

Let's take a look at the Bayesian formula transformation and we get:

Take the Items unrelated to Z out of the credit symbol and you will get:

After finishing the preceding steps, you can get the following information:

Well, we actually sorted out a circle and the formula was still messy. But we still saw a glimmer of light from this formula because of the special relationship between KL divergence:

Although it is not easy to find p (x), we know that p (x) is a fixed value when X is given. If we want kl (Q (z) | P (z | X) to be as small as possible, it is equivalent to making the part on the right of the equal sign as big as possible. The first item on the right of the equal sign is actually based on the likelihood expectation of Q (z), and the second item is a negative KL divergence, so we can think that in order to find a good q (z ), to make it as close as possible as P (z | X), we need:

? Expectation for maximum log likelihood of the first item on the right

? The KL divergence of the second item on the right is minimized.

For variation inference (which can be translated into variational inference in Chinese) before Vae, we will start a new formula derivation. For example, let's make a mean-field assumption (to be honest, I don't know how to translate mean-field more intuitively, so I put the English here), so for Z composed of multiple implicit variables, components are independent of each other, so we can further formulate and simplify them based on this feature. Because our topic today is Vae, we will not go into detail about this part. At this time, we remembered the sentence we mentioned at the beginning of the article:

"Vae also utilizes this feature. We use a deep model to fit some complex functions"

So it is time to let this sentence play a role, but let's talk about how it works next time.

We have removed some basic conceptual barriers in the previous two parts. Let's go straight to the theme-Vae!

As this article is written in one article, we will move some of the important formulas described above to take care of your views.

First, the first formula in the series-the KL divergence formula for multi-dimensional Gaussian distribution:

I hope you will be impressed. If you are not impressed, please go back and check it out!

Then there is the derivation formula for the previous variational inference:

The last sentence:

"Vae also utilizes this feature. We use a deep model to fit some complex functions"

Okay ...... I am also drunk when writing the column. In order to ensure that the words in each article should not be too long, so that everyone may lose the patience to read it, This article simply takes a long time to review.

Well, here is the time to witness the miracle.

Variational autoencoder

Finally, we will see the Lord. Let's take a look at the right side of the variational inference formula. Before that, we need to make some changes to the formula and then give us a detailed modeling process.

Reparameterization trick

To solve the above formula more conveniently, we need to do a little trick work here. As mentioned above, the variational function Q '(z | x) represents the distribution of Z given an X. We can imagine that Z meets a certain distribution. So can we extract X from the numeric value?

Let's say we have a random variable A that obeys the Gaussian distribution N (). According to the theorem we can define a random variable B = A-1, then it will obey the Gaussian distribution N (), in other words, we can use a random variable with an mean of 0 and a variance of 1 plus 1 to represent the current random variable. In this way, we divide a random variable into two parts: one part is definite and the other part is random.

For the above Q' (z | X), we can also use the above method. We can split a Z that obeys the probability of this condition into two parts. One part is a complex function that solves the problem of determining the part. Then we define another random variable, it is responsible for Random parts. For the consistency of writing, we are used to represent the Z of the probability of obeying the condition.

What are the benefits of doing so? Now we know that the probability value of the Z condition depends entirely on the probability used to generate it. That is to say, if so, the formula for the variational derivation above will become the following formula:

This is a small step of replacement and a big step of solution! In fact, here we are close to the final answer to the question, and the rest is just our way-can we assume that this random part is subject to what kind of distribution?

Of course! However, because we generally make the Prior Assumption of Z into a multi-dimensional independent Gaussian distribution, for the convenience of KL calculation, in order to deduce the KL divergence of two multi-dimensional Gaussian distributions in the previous section, we decided to make the random portion after the replacement subject to the multi-dimensional independent Gaussian distribution.

Next, let's take a look at how to calculate the two parts of this formula.

The second item on the right, KL divergence part -- Encoder

First, let's look at the second item on the right of the formula. As we mentioned earlier, we generally take the Prior Assumption of Z into an independent Gaussian distribution in multiple dimensions. Here we can give a stronger assumption that the mean of the Gaussian distribution is 0, variance is the matrix of units, then the KL divergence formula we mentioned above is from:

Instantly simplified:

There is a feeling that the world is quiet ...... The goal below is to use the encoder part to calculate the mean variance of Z based on X. In this part, we can use a deep neural network. In the actual training process, we use the batch training method. Therefore, we need to input a batch's X information for model calculation and optimization.

If we use a vector to represent the main diagonal line of the covariance matrix above, the situation will be even better:

At this point, the function fitting for this part has been quite clear, and our function input and output are quite clear. Our loss is also simplified to a relatively simple state, the following is the specific calculation.

The first item on the right, expected part -- Decoder

From the optimization of the KL divergence formula above, we can see that if the KL divergence value of the two probability distributions is 0, in fact, the distribution of our random parts is the same as the prior distribution of our Z.

The good news is that we can directly use the mean variance obtained by the encoder in the previous step. In this way, we need another depth function decoder to help us change from Z to X. As we mentioned above, our goal is to maximize the likelihood of expectation. In fact, we can convert it into a batch of X samples. First, encoder is used to generate the Z' distribution, and then p (x | z) is optimized) maximum likelihood.

There are many ways to maximize the likelihood. This part will be detailed in practice.

Now, the derivation of Vae's core computing is over. We have spent three articles on this model. How can we end it like this? Next, let's take a look at the implementation code and see what the models based on the classic Vae evolution look like.

Finally, we have reached the point of implementation. The previous dry and boring formula derivation and theoretical elaboration have made many people sleepy. Let's take a look at a good implementation of this model-GitHub-cdoersch/vae_tutorial: caffe code to accompany my tutorial on Variational autoencoders, of course, this implementation is also a supporting tutorial article implementation. If you are interested in shoes, you can also take a look at this tutorial. I believe this model will be inspired more.

The target dataset for this implementation is mnist, which is the same as our previous dcgan. Of course, in his tutorial, he presented a total of three models. Next, let's start with the prototxt file. Let's take a look at the Classic Vae that we are most familiar.

Vae

To be honest, looking at the figure he provided on GitHub, even the children's shoes with certain Vae model basics will surely feel a little worried. We hide some of the details in the model, leaving only the core data stream and loss computing, then this model becomes like the following:

The black box in the figure indicates the flow of data, and the red box indicates the location where the loss is located. Double red lines indicate data sharing between two different parts. We can see that the top part of the graph is the encoder part, that is, the process from X to Z. below is the process from Z to X. In the previous article, we have given the formula for solving the problem. Now we have given the network model. We can compare the two parts.

In addition, the encoder and decoder sections are omitted. In the actual network, we can use a deep neural network model instead. In addition, there are three main parts in the figure:

? The first is the loss calculation of Q (z | X.

? The second is the random generation of Z.

? Finally, the loss calculation of P (x | Z.

The most complex one is the loss calculation of Q (z | X. Caffe mainly uses vector calculation in actual calculation, so the previous formula needs to be transformed:

After completing the previous vector calculation, the last step is to perform the operation ction, that is, to complete the addition process. In this way, the computing can be completed smoothly.

I have understood these parts, coupled with our previous understanding of Vae, and believe that we have a clearer understanding of the Vae model.

Mnist generation model Visualization

The figure below is generated during an experiment. It looks a bit like the distribution of all numbers in one plane. There is still a certain transitional area between the numbers. How is this image produced?

A simple method is to set the Z dimension to 2. The process of generating this image is as follows:

? The Vae model is used for training and the sum

? After completing the Z sampling process, we regularly Sample noise in two-dimensional space according to n (0, I), and combine noise and

? Convert Z to X by using decoder to display the Z after sampling.

 

After these steps, we can get the final image. In fact, the Gan model we mentioned earlier can also generate such images in a similar way.

Variational automatic Encoder

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.