TLDR: We propose a new way of viewing variational autoencoders, that allows us to explain many existing problems in VAE, such as fuzzy generation and low usage of latent code.
Variational Autoencoders (VAEs) (Kingma and Welling 2013) is a interesting family of generative models, and has received much attention since its emergence. We recently submitted two papers on arXiv that discuss several interesting aspects of VAEs. In one of them (https://arxiv.org/abs/1702.08658), we propose a novel interpretation of VAEs that explains several phenomenons, including
We use the common notation for VAEs. Suppose is the underlying true data distribution, is the prior of the latent code, the generative model is and the recognition (or inference) model is .
As in representation learning, we assume that the latent variables represent certain features in . For example, if we consider face data and glasses features , then if is a face with glasses, then will represent “the existence of glasses” (we denote that as , which is determined by the inference network . Our observation comes from the following intuitive question:
Suppose we already have , which is a “glasses feature detector”. What type of should we choose to generate the set of “faces with glasses”?
To answer this question, recall that is the VAE ELBO bound without the KL divergence term. In the Bayesian framework, we are using as a variational approximator to the true posterior. However, we can also consider the variational approximation the other way:
Namely, we are treating the generative network as a variational approximation to the true posterior , where the likelihood is defined by the recognition network .
Why is this interpretation necessary? If we consider the “glasses” example, the true posterior is going to be the distribution of all the faces with glasses, which is a truly complex distribution, and requires a very complex to approximate.
The most common we see are factored Gaussians, which clearly cannot approximate this complex distribution. In fact, this explains the fuzzy generation of VAEs. Given a latent code, the generative network try to fit a subset of data with a Gaussian distribution, where the best fit for the mean is to calculate an average of the subset. If multiple map to the same (which is common since is also a Gaussian), the Gaussian will try to learn an average of these , which leads to fuzzy generation.
Since is an approximation of , we can calculate the variance of for any to measure the “fuzziness” of the generated samples given that particular . In the following image, the left digits are generated with low , and thus are sharp; the right digits, however, have high , and hence looks like some average between 5, 4, and 9.
For a weak there is only one solution to alleviate fuzziness - have a better . In our paper, we demonstrated that injecting latent codes during iterative generation can result in latent code that have smaller variance, thus creating sharper samples. This is a generalization of the notion of “infusion training” (Bordes, Honari, and Vincent 2017), where the injected latent code is just a subset of the pixels of the image. Here, we show that we can generate sharp LSUN images using simple Gaussian VAEs.
Our code for this experiment is available at https://github.com/ermongroup/Sequential-Variational-Autoencoder.
What if a complex distribution, such as PixelCNN is considered for ?
Let us write the VAE ELBO:
This is equivalent to:
where , and is the true posterior. If can be arbitrarily complex, then a trivial solution will make equal to zero:
This is exactly the case discussed in the Variational Lossy Autoencoder paper (Chen et al. 2016).
However, if we are willing to give up the KL divergence term in the prior, the latent code will be utilized, and it is still possible to generate samples through a Markov chain. Here, we illustrate generating samples through a Markov chain, where highly complex models such as PixelCNN++ are utilized.
Our code for this experiment is available at https://github.com/ermongroup/Generalized-PixelVAE.
(We have a recent work that extends this to multiple different type of regularizations, including Adversarial distance (Goodfellow et al. 2014), Moment Matching Distance (Li, Swersky, and Zemel 2015), and Stein (Liu and Wang 2016))
Through this work, we aim to promote the following argument: Learning features should be at least equally important to generating samples in unsupervised learning.
The view of VAE from the inference side allows us to realize that if we want our features to have some alignment with the real world, then the type of we utilize should match our inductive bias over that distribution. Since both Gaussian and Recurrent distributions are far from perfect of encoding our inductive bias (as directed by features), this requires us to consider other types of distribution that could reflect that.
This also gives us another intuition. Instead of using a simple generator network and come-up with complex inference distributions (as in many prior works), we can take the opposite direction, where the generator is complex yet the inference is simple.
Following this work, our second arXiv submission (https://arxiv.org/abs/1702.08396) discusses current limitations for hierarchical VAEs, and propose an architecture that is extremely simple yet effective at learning structured features. We will discuss that in another blog post.