Learning Hierarchical Features from Generative Models

TLDR: Current ways of stacking variational autoencoders may not always provide meaningful structured features. In fact, we showed in a recent ICML paper that a simple ladder architecture is enough for us to learn structured, distentangled features from variational autoencoders.

Variational Autoencoders (VAE) (Kingma and Welling 2013) has been one of the most popular frameworks for deep generative models. It defines a model that generates data given the latent variables , where we can treat as some sort of “feature” and perform semi-supervised learning (Kingma et al. 2014).

Deeper is not Always Better for Hierarchical VAEs

For some time, researchers have proposed to use “hierarchical VAEs”:

There are two potential advantages to using such hierarchical VAEs:

  1. It could improve the ELBO bounds and decrease reconstruction error in most cases, which would look nicely in a paper.
  2. The stack of latent variables might learn a “feature hierarchy”, similar to what convolutional neural networks learned.

The first has been validated by many recent works and hierarchical VAEs would help improving ELBO bounds on paper 1. The second argument, however, has not been clearly demonstrated2. In fact, it is difficult to learn a meaningful hierarchy when there are many layers of , and some have considered that as an inference problem (Sønderby et al. 2016).

A Simple yet Effective Method for Learning Hierarchical Features

We show in our recent ICML paper, Learning Hierarchical Features from Generative Models, that if the purpose is to learn structured, hierarchical features, using an hierarchical VAE has its problems. Instead, we can use a very simple approach, which we call Variational Ladder Autoencoder (VLAE) 3.

The basic idea is simple: if there is a hiearchy of features, some complex and some simple, they should be expressed by neural networks of different capacity. Hence, we should use shallow networks to express low-level, simple features and deep networks to express high-level, complex features. One way to do this is to use an implicit generative model 4, where we inject Gaussian noise at different levels of the network:

Training is essentially the same for 1 layer VAE with reconstruction error. We show some results from SVHN and CelebA: the models have 4 layers of and in each block we display the generation results when we change the in one layer and fix that in other layers (from left to right, lower layers to higher layers):

We can see that lower layers learned simpler features such as the image color, while higher layers learned more complex attributes such as the overall structure. This clearly demonstrates that this VLAE is able to learn hierarchical features, similar to what InfoGAN (Chen et al. 2016) does.

Code is available at: https://github.com/ermongroup/Variational-Ladder-Autoencoder.


In terms of learning hierarchical features, deep is not always good for VAEs; if we use a ladder architecture that serves as an inductive bias for the complexity of features, then we can learn structured features at ease.


  1. Kingma, Diederik P, and Max Welling. 2013. “Auto-Encoding Variational Bayes.” ArXiv Preprint ArXiv:1312.6114.
  2. Kingma, Diederik P, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. “Semi-Supervised Learning with Deep Generative Models.” In Advances in Neural Information Processing Systems, 3581–89.
  3. Sønderby, Casper Kaae, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. “Ladder Variational Autoencoders.” In Advances in Neural Information Processing Systems, 3738–46.
  4. Chen, Xi, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. “Infogan: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets.” In Advances in Neural Information Processing Systems, 2172–80.


  1. Although ELBO bounds and log-likelihood in general are not necessarily good measurements for generative models (Theis, Oord, and Bethge 2015)

  2. Except for in (Gulrajani et al. 2016), where PixelCNN is used to model

  3. Coming up with a good name for this model is so difficult, because of existing works. 

  4. (Tran, Ranganath, and Blei 2017) has mentioned a model with similar architecture, but the motivation is different from ours (they focus on learning and inference with GANs while we focus on learning hierarchical features).