TLDR: Current ways of stacking variational autoencoders may not always provide meaningful structured features. In fact, we showed in a recent ICML paper that a simple ladder architecture is enough for us to learn structured, distentangled features from variational autoencoders.
Variational Autoencoders (VAE) (Kingma and Welling 2013) has been one of the most popular frameworks for deep generative models. It defines a model that generates data given the latent variables , where we can treat as some sort of “feature” and perform semi-supervised learning (Kingma et al. 2014).
For some time, researchers have proposed to use “hierarchical VAEs”:
There are two potential advantages to using such hierarchical VAEs:
The first has been validated by many recent works and hierarchical VAEs would help improving ELBO bounds on paper 1. The second argument, however, has not been clearly demonstrated2. In fact, it is difficult to learn a meaningful hierarchy when there are many layers of , and some have considered that as an inference problem (Sønderby et al. 2016).
We show in our recent ICML paper, Learning Hierarchical Features from Generative Models, that if the purpose is to learn structured, hierarchical features, using an hierarchical VAE has its problems. Instead, we can use a very simple approach, which we call Variational Ladder Autoencoder (VLAE) 3.
The basic idea is simple: if there is a hiearchy of features, some complex and some simple, they should be expressed by neural networks of different capacity. Hence, we should use shallow networks to express low-level, simple features and deep networks to express high-level, complex features. One way to do this is to use an implicit generative model 4, where we inject Gaussian noise at different levels of the network:
Training is essentially the same for 1 layer VAE with reconstruction error. We show some results from SVHN and CelebA: the models have 4 layers of and in each block we display the generation results when we change the in one layer and fix that in other layers (from left to right, lower layers to higher layers):
We can see that lower layers learned simpler features such as the image color, while higher layers learned more complex attributes such as the overall structure. This clearly demonstrates that this VLAE is able to learn hierarchical features, similar to what InfoGAN (Chen et al. 2016) does.
Code is available at: https://github.com/ermongroup/Variational-Ladder-Autoencoder.
In terms of learning hierarchical features, deep is not always good for VAEs; if we use a ladder architecture that serves as an inductive bias for the complexity of features, then we can learn structured features at ease.
Coming up with a good name for this model is so difficult, because of existing works. ↩
(Tran, Ranganath, and Blei 2017) has mentioned a model with similar architecture, but the motivation is different from ours (they focus on learning and inference with GANs while we focus on learning hierarchical features). ↩