In recent years, unsupervised learning has become the focus of research, with little left to be picked with supervised learning of low branches of fruit. The models of VAE (variational auto-encoder, variable-encoder) [1,2] and GAN (generative adversarial Networks) are becoming more and more concerned.

The author has also been studying the knowledge of VAE (from the perspective of depth learning). First of all, as an engineer, I want to implement the VAE algorithm correctly, and understand what VAE can do to help us solve practical problems; As a practitioner of artificial intelligence, I also want to understand the underlying principles to some extent.

As a study note, this article in accordance with the order from simple to complex, first introduced the implementation of VAE's specific algorithm, and then from the intuitive interpretation of the principle of vae, finally, the mathematical principle of vae review. We will introduce the concepts of variation, self coding, unsupervised and generative models in the right place.

We will see that, like many machine algorithms, the math behind VAE is more complex, but the engineering implementation is very simple.

This article Conditional variational autoencoders also by intuition to introduce VAE, several diagrams are also very helpful in understanding. 1. Algorithm implementation

This article introduces a relatively simple implementation of VAE, as far as possible with the article [1] section 3 of the experimental settings consistent. The complete code can see repo. 1.1 Input:

Data set X⊂rn.

As an example, you can imagine X as a mnist dataset. As a result, we have 60,000 0~9 of handwritten grayscale (training sets), the size of 28x28. Further, the normalization of each pixel to [0,1] is x⊂[0,1]784.

Figure 1. Mnist Demo (photo source) 1.2 Output:

An input of M-dimensional, output as n-dimensional neural network, may be called decoder [1] (or generative model [2]) (Fig 2).

Figure 2. Decoder the input-output dimension satisfies the requirements, decoder assumes any structure--MLP, CNN,RNN, or other. Since we have already set the input data to [0, 1] intervals, we make the output of decoder in this range. This can be accomplished by adding sigmoid activation to the last layer of decoder:

F (x) =11+e−x as an example, we take M = 100,decoder for the most popular full connection network (MLP). The definitions based on the Keras functional API are as follows:

N, m = 784, 2
Hidden_dim = 256
batch_size = M
# # encoder
z = Input (batch_shape= (Batch_size, M))
H_de coded = dense (Hidden_dim, activation= ' Tanh ') (z)
x_hat = dense (n, activation= ' sigmoid ') (h_decoded)

1.3 Training

Figure 3. VAE Structural Framework 1.3.1 Encoder

To train decoder, we need an auxiliary encoder network (also known as recognition model) (Figure 3). The input of encoder is n-dimensional and the output is 2XM dimension. As with decoder, encoder can be any structure.

Figure 4. Encoder 1.3.2 Sampling (sampling)

We view the output of the encoder (2XM number) as the logarithm (z_log_var) of the mean (Z_mean) and variance of the M-Gaussian distribution respectively.

And then the example above, encoder is defined as follows:

# # Encoder
x = Input (batch_shape= (batch_size, N))
h_encoded = dense (Hidden_dim, activation= ' Tanh ') (x)
z_ mean = dense (m) (h_encoded) # mean value
Z_log_var = Dense (M) (h_encoded) # variance logarithm

Then, according to the mean and variance of the encoder output, the random number is generated which obeys the corresponding Gaussian distribution:

Epsilon = K.random_normal (shape= (Batch_size, M),
MEAN=0.,STD=EPSILON_STD) # Standard Gaussian distribution
z = Z_mean + exp (Z_LOG_VAR/2 ) * Epsilon

Z can be used as the input of the decoder defined above to produce an n-dimensional output x^.

Figure 5. Sampling

The reparemerization technique is used here. Since Z∼n (μ,σ), we should sample from N (μ,σ), but this sampling operation is not conductive to μ and σ, which results in a conventional gradient descent method (GD) that passes through the error inversion. By Reparemerization, we first sampled the Ε from N (0,1), then z=σ⋅ϵ+μ. In this way, Z∼n (μ,σ), and, from encoder output to Z, only involves linear operations (Ε is just a constant for neural networks), so GD can be optimized for normal use. The correctness of the method is shown in sections [1] 2.3 and [2] 3rd (stochastic backpropagation).

Figure 6. Reparameterization (Photo source)

The price of preparameterization is that the implicit variable must be a continuous variable [7]. 1.3.3 Optimization Objectives

Encoder and decoder together, we can output a x^ of the same dimension for each x∈x. Our goal is to make x^ as close as possible to x itself. That is, X is encoded (encode) and can be decoded (decode) as much as possible to restore the original information.

Note: Strictly speaking, according to the assumptions of the model, we want to optimize not the distance between x and x^, but to maximize the likelihood of X. Different loss functions correspond to different probability distributions of P (x|z). For the sake of intuition, here is a detailed discussion of the following ([1] Appendix C).

Because of x∈[0,1], we use cross entropy (cross entropy) to measure the difference between X and x^:

Xent=∑i=1n−[xi⋅log (x^i) + (1−XI) ⋅log (1−x^i)]

The smaller the xent, the closer X is to the x^.

We can also use the mean square error to measure:

MSE=∑I=1N (xi−x^i) 2

The smaller the MSE, the closer the two are.

In the training process, the output is input, which is the meaning of AE (Autoencoder, self-coding) in VAE.

In addition, we need to constrain the output Z_mean (μ) and Z_log_var (logσ2) of the encoder. The KL divergence is used here (see below for specific formula derivation):

Kl=−0.5∗ (1+logσ2−μ2−σ2) =−0.5 (1+logσ2−μ2−exp (logσ2))

Here KL, in fact, is the negative value of the KL divergence, see below.

The overall optimization objective (minimized) is:

Loss=xent+kl

Or

Loss=mse+kl

To sum up, with the objective function, and all the operations from input to output are available, we can train the network by SGD or by its improved methods.

Since the training process uses only X (both as input and target output) and is independent of the label of X, this is unsupervised learning. 1.4 Summary

To summarize, figure 2,vae includes encoder (module 1) and decoder (module 4) two neural networks. The two are connected to a large network through modules 2 and 3. Thanks to the reparemeterization technique, we can use regular SGD to train the network.

The best way to learn the algorithm is to read the code, there are many on the web based on different frameworks VAE reference implementation, such as TensorFlow, Theano, Keras, torch. 2. Intuitive explanation 2.1 VAE what is the use. 2.1.1 Data generation

Since we specify the P (z) standard normal distribution, and then join the trained and decoder (P (x|z)), we can sample and produce a new sample that is similar but different from the training set data.

Figure 7. Generate a new sample

Fig. 8 (cross entropy) and Figure 9 (mean square error) are based on the trained decoder, sampled-generated images (x^)

Figure 8. Cross entropy loss

Figure 9. Mean square error loss

Strictly speaking, the code that generates the figure two above is not a sample, but E[x|z]. The expectation of the distribution and Gauss distribution of the effort is exactly the output x^ of the Decocder. See the discussion below. 2.1.2 High dimensional data visualization

Encoder can map data x to a lower-dimensional z-space, which can be visually displayed if it is 2-D or 3-D (Figure 10, 11).

Figure 10. Cross entropy loss

Figure 11. Mean square error loss 2.1.3 Missing Data filling (imputation)

For many practical problems, the data of each dimension of the sample points are correlated. Therefore, in the case of partial dimension missing or inaccurate, it is possible to be filled by relevant information. Figure 12, 13 shows a simple example of data filling. Among them, the first behavior of the original image, the second act in the middle of a few lines of pixels missing graph, the third behavior using the VAE Model recovery diagram.

Figure 12. Cross entropy loss

Figure 13. Mean square error loss 2.1.4 Half supervised learning

No annotation data is easier to obtain than high cost annotated data. Semi-supervised learning attempts to learn a better predictive model (classification or regression) using only a small subset of annotated data plus a large number of data without annotations.

VAE is unsupervised and can also learn better characterization, so it can be used for unsupervised learning [3, 12]. 2.2 VAE Principle

Due to the lack of background knowledge of probability graph model and statistics, the first reading [1, 2] has no clue to the problem statement, related work and motivation. So, first put down the formula, return to comfort zone, analogy familiar model, intuitively understand the working principle of VAE. 2.2.1 Model Structure

From the model structure (and the name), VAE and the Audoencoder are very similar. In particular, VAE and CAE (constractive AE) are very similar, both of which add long constraints to the hidden layer output. And VAE in the hidden layer of the sampling process, played and dropout similar regularization of the theft. Therefore, VAE should have similar training and working methods with CAE, and it is not easy to fit. 2.2.2 Manifold Learning

Although the data is high-dimensional, similar data may be distributed over a manifold in a high dimensional space (for example, Figure 14). and characteristic learning is to learn this manifold explicitly or implicitly.

Figure 14. Manifold Learning (picture source)

It is this manifold distribution that we can recover the high dimensional observational variables from the low hidden variables. As shown in Figure 8, figure 9, the observed variables corresponding to the similar implicit variables are indeed more like, and the similarity is smooth. 3. Derivation

VAE proposed the background involves the maximum likelihood estimation (maximum posteriori probability estimation), the expectation maximization (EM) algorithm, the variational inference (variational inference,vi), the KL divergence, the MCMC and so on in the probability domain. But the VAE algorithm itself mathematical deduction is not complex, if familiar with the contents of the words, you can jump directly to 3.6. 3.1 Problem Statements

The known variable x obeys a fixed but unknown distribution. The relationship between x and the implicit variable (latent variables) can be described in Figure 15. This is a simple probability diagram. (Note that both X and Z are vectors)

Fig. 152 The direction probability graph of the layer, X is the observed variable, and z is the implicit variable.

For this probability graph, p (z) (prior to implicit variable z), p (x|z) (x relative to Z's conditional probability), and P (z|x) (implicit variable posteriori) are feasible to fully describe the relationship between X and Z. Because the joint distribution of the two can be expressed as:

P (z,x) =p (x|z) p (z)

The edge distribution of x can be calculated as follows:

P (x) =∫zp (x,z) dz=∫zp (x|z) ⋅p (z) dz=ez[p (x|z)]

We can only observe x, and z is an implicit variable and cannot be observed. Our task is to estimate the relative parameters of probability graphs by an observation set X.

For a machine learning model, if it can (explicitly or implicitly) modeling P (z) and P (x|z), we call it the build model. There are two layers of meaning:

1. The two determine the joint distribution P (X,Z);

2. The use of both can be used to sample X (ancestral sampling). The concrete method is to generate the sample point zi∼p (z) according to the probability, and then to sample xi∼p (X|zi) according to probability.

The simplest generation model may be the naive Bayesian model. 3.2 Maximum likelihood estimation (Maximum likelihood Estimation,mle)

The most classical method of the probability distribution is the maximum likelihood estimation.

Given a set of observed values x= (xi), I=1,.., N. The likelihood of the observed data is:

L (pθ (X)) =∏inpθ (xi)

Logarithm of the general likelihood:

LOGL (pθ (X)) =∑inlogpθ (xi)

The parameter θ∗ of MLE assumption maximization likelihood is the optimal parameter estimation. Therefore, the probabilistic parameter estimation problem is transformed to maximize the LOGL (pθ (X)) problem.

From the Bayesian inference point of view, θ itself is also a random variable, subject to a distribution p (θ).

P (θ| X) =p (θ) ⋅ (x|θ) p (x) =p (theta) ⋅ (x|θ) ∫θp (x,θ) dθ∝p (theta) ⋅ (x|θ)

Logp (θ| X) =logp (θ) +LOGL (P (x|θ))

This is the maximum posteriori probability estimate (MAP). 3.3-Phase Hope maximization algorithm (EXPECTATION-MAXIMUM,EM)

For our problems, using the MLE criteria, the goal is to:

Logp (X,z)

Since z is not observable, we can only try to optimize:

Logp (X) =log∫zp (x,z) DZ

By MLE or MAP we now have to target (logarithmic likelihood), but in our case, there is an integral of implicit variable z in likelihood. A reasonable assumption (designation) of the distribution of P (z) and P (x|z) can be solved with the desired maximization algorithm (EM).

Random initialization of Θold

E-step: Computing Pθold (Z|X)

M-step: Compute θnew, given:

Θnew=argmaxθq (Θ,θold)

which

Q (Θ,θold) =∫zpθold (z|x) log (pθ (x,z)) DZ

The intuitive application of EM is to solve the parameter estimation and K-means clustering of Gaussian mixture model (Gaussian mixtrue model,gmm). More complex, speech recognition of the core--GMM-HMM model is also trained using the EM algorithm [5].

Here we directly give the ME algorithm and omit the most important proof, but EM is the basis of variational reasoning, if not familiar with the recommendations first see [4] Chapter 9 or [9]. 3.4 MCMC

The EM algorithm involves the integration of P (z|x) (i.e., the posterior distribution of implicit variables). Although the above examples can be conveniently solved by the EM algorithm, this integral is generally difficult to compute due to the diversity of probability distributions and the high dimension of variables (intractable).

Therefore, the integral term of m-step can be approximate obtained by means of numerical integration.

Q (Θ,θold) =∫zpθold (z|x) log (pθ (x,z)) dz≈1n∑i=1nlogpθ (X,zi)

This involves sampling z in accordance with P (z|x). This requires the use of sampling techniques such as MCMC. About Mcmc,lda Math Gossip 0.4.3 speak very clearly, here no longer repeat. You can also refer to [4] Chapter 11. 3.5 variational Inference (variational inference,vi)

Because of the complexity of the MCMC algorithm (which is heavily mined for each data point), it may be difficult to apply it under large data. Therefore, for the integration of P (z|x), other approximate solutions are needed.

The idea of variational inference is to look for an easy to handle distribution Q (z), so that Q (z) is as close to the target distribution P (z|x) as possible. Then, replace P (z|x) with Q (z)

The measurement between distributions is based on the Kullback–leibler divergence (KL divergence), which is defined as follows:

KL (q| | p) =∫q (t) logq (T) p (t) dt=eq (LOGQ−LOGP) =eq (LOGQ) −EQ[LOGP]

Without causing ambiguity, we omit the subscript for E. This is not proved here. Some important properties of KL are stated: KL (q| | p) ≥0 and KL (q| | P) =0⟺q=p [6]

NOTE: KL divergence is not a distance metric, does not meet the symmetry and triangular inequalities

So, we look for q (z) problems and turn them into an optimization problem:

Q∗ (z) =argmaxq (z) ∈qkl (q (z) | | P (z|x))

KL (q (z) | | P (z|x)) is about the Q (z) function, and q (z) ∈q is a function, so this is a functional (function). The variation (variation) of the extremum is to the functional, just as the differential extremum is to the function.

If the argument for the variation is not easy to understand, you can simply view the variation as Gauss in the Gaussian distribution, Fourier transform in the same terminology, do not try to understand the literal.

In addition, do not confuse the variation (variation) with the variable (variable), variance (variance), and so on, there is no relationship between them. Elbo (Evidence Lower Bound objective)

According to the definition of KL and P (z|x) =p (z,x) p (x)

KL (q (z) | | P (z|x)) =e[logq (z)]−e[logp (z,x)]+LOGP (x)

Make

Elbo (q) =e[logp (z,x)]−e[logq (z)]

According to the nonnegative nature of KL, we have

LOGP (x) =KL (q (x) | | P (z|x)) +elbo (q) ≥elbo (q)

Elbo is a lower bound (lower bound) of P (x) logarithm likelihood (i.e. evidence, evidence).

For a given dataset, p (x) is constant, by the

KL (q (x) | | P (z|x)) =logp (x) −elbo (q)

The minimization of KL is equivalent to maximizing Elbo.

About variational reasoning Here's a simple introduction to so many. If you are interested, you can refer to [6], [4] Chapter 10 and the newest tutorial [10]. 3.6 VAE

This is mainly in accordance with [1] ideas to discuss VAE.

The logarithmic likelihood of the observed data x (i) can be written:

logpθ (x (i) =kl (qφ (Z|x (i)) | | pθ (Z|x (i))) +l (Θ,φ;x (i)))

Here we will elbo as L to emphasize the parameters that need to be optimized.

We can indirectly optimize the likelihood by optimizing L.

In VI we optimize KL by optimizing L.

According to the multiplication formula of probability, after a simple transformation, L can write

L (Θ,φ;x (i)) =−kl (Qφ (Z|x (i)) | | Pθ (z)) +eqφ (z|x) [Logpθ (x (i) |z)]

Therefore, the goal of our optimization can be decomposed into two items to the right of the equal sign. 3.6.1 First Item

Let's first examine the first item, which is a KL divergence. Q is the distribution we want to learn, p is a priori distribution of implicit variables. Through the rational choice of distribution form, this one can be resolved to find out.

If, Q takes each dimension independent Gaussian distribution (that is, the 1th part of the decoder), also makes p is the standard normal distribution, then, can calculate that the KL divergence between the two is:

−kl (Qφ (Z|x (i)) | | Pθ (z)) =−0.5∗ (1+logσ2i−μ2i−σ2i) =−0.5 (1+logσ2i−μ2i−exp (logσ2i))

This is the KL term for the objective function in the 1th part of this article.

See the specific proof [1] Appendix B. 3.6.2 Second Item

Then we examine the second item to the right of the equation. Eqφ (z|x) [Logpθ (x (i) |z)] is a logarithmic likelihood of the posterior probability of x (i).

Since VAE does not make too strong a hypothesis for Q (z|x) (in our case, a neural network), the argument cannot be resolved by the decoder. So we consider the way to sample.

Eqφ (z|x) [Logpθ (x (i) |z)]≈1l∑j=1llogpθ (x (i) |z (j))

Here Z (j) is not sampled directly from the Gaussian distribution modeled by the decoder, but uses the Reparameterization method introduced in part 1th, and its correctness is shown in section 2.3 of [1].

If only one sample point is collected at a time, the

Eqφ (z|x) [Logpθ (x (i) |z)]≈logpθ (x (i) |z~)

Among them, the z~ is the sampling point. Luckily, this formula is a loss function commonly used in neural networks. 3.6.3 loss function

Through the above discussion, VAE's optimization objectives have become familiar and easy to handle the form. Below, we derive the actual loss function in neural network training for the specific modeling distributions of pθ (x (i) |z~) (encoder).

The 1th part introduces two loss functions of cross entropy and mean square error. The following is a brief introduction to the different probability distributions of the two losses corresponding to each other. The following distributions assume that the dimensions of X are independent. Cross Entropy

If it is assumed that P (xi|z), (I=1,.., N) Obeys Bobbie's efforts to distribute, that is:

P (x=1|z) =αz,p (x=0) =1−αz

For an observed value, the likelihood is:

L=αxz⋅ (1−αz) 1−x

Decoder output is the parameter of the distribution of the effort, namely Αz=decoder (z) =x^. The logarithmic likelihood is:

Logl=x⋅log (x^) + (1−x) log (1−x^)

−logl This is the cross entropy we use.

Mean square error

If P (xi|z), (I=1,.., N) is assumed to serve the Gaussian distribution, i.e.

P (x|z) =12π−−√σ⋅e− (x−μ) 22σ2

Logarithmic likelihood is:

Logl=−0.5∗log (2π) −0.5∗logσ− (x−μ) 22σ2

Decoder is the expectation of Gaussian distribution, here does not care about variance, that is, σ is unknown constant. We are the target of optimization (remove constant items unrelated to optimization):

Max− (x−μ) 22σ2=min (x−μ) 2

This is the mean square error that we want to optimize.

The relationship between different loss functions and probability distributions is discussed in detail in [4] Chapter 5. 4. Conclusion

There is not much contact in this field, the understanding is shallow, the literature also reads little, more is some question:

VAE is a very beautiful work, is a model of theoretical guidance models structure design paradigm.

[1] [2] The Independent presents VAE. Although the last proposed algorithm is roughly the same, but the starting point and the derivation of ideas are still significantly different, should be put together with each other reference.

VAE as a feature learning method, and the same unsupervised AE, RBM and other methods, compared to the advantages and disadvantages of what is.

[2] The relationship with Denoising AE is discussed, but VAE is more similar in form to Constractive Auto-encoder, and does not know how to understand the relationship between the two.

Some work using VAE as a semi supervised learning, a cursory look, and did not show the advantages compared to other pre-training methods [3, 12].

Combine the above points, although VAE is a good tool for new "paper growth", but only in depth learning, the feeling is just another new tool. Refences Kingma et al. auto-encoding variational Bayes. Rezende et al. stochastic backpropagation and approximate inference in Deep generative. Kingma and Rezende et al. semi-supervised Learning with Deep generative Models. Bishop. Pattern recognition and Machine Learning. Young et al. HTK Handbook. Blei et al. variational inference:a Review for statisticians. Doersch. Tutorial on variational autoencoders. Kevin Frans. Variational Autoencoders explained. Sridharan. Gaussian mixture models and the EM algorithm. Blei et al. variational inference:foundations and modern Methods. Durr. Introduction to Variational autoencoders. Xu et al. variational autoencoders for semi-supervised Text classification. Further Reading Dilokthanakul et al. DEEP unsupervised clustering with GAUSSIAN mixture variational autoencoders. GAUSSIAN mixture vae:lessons about variational inference, generative modeling, and DEEP NETS.