TLDR of TLDR: With a dataset over “cats or dogs” labels, you are not going to learn a table detector even if cats appear on tables more often than dogs.
TLDR: Current state-of-the-art deep learning models for transfer learning are mostly trained with supervised learning. However, their abilities to transfer are entirely limited by the expressiveness of labels. Unsupervised learning methods, on the other hand, do not suffer this problem, and thus have the potential to learn all the useful features in the data.
Since the advent of AlexNet (Krizhevsky, Sutskever, and Hinton 2012), deep convolutional networks pretrained on ImageNet have been used for a variety of tasks. These include transfer learning to a new task (Yosinski et al. 2014), generating visually realistic data (Nguyen et al. 2016), and efficient evaluation of generative models (Salimans et al. 2016). These methods are based on one premise - that these pretrained models contain “rich features” that are generally useful in the larger domain.
However, there are few investigations over how “rich” these features exactly are. Do deep models trained using supervised learning learn all the useful features for transfer? We answer this question in a recently submitted ICLR workshop paper. (The answer is no.)
We introduce a concept called feature learning. In feature learning, we assume that features emerge from the weights of a deep neural network, so that learning a neural network is essentially learning features at the same time. If we have learned features, , then the ease of learning a new feature is measured by the “information gain” when we add it:
where is the label 1. Therefore, a larger signal indicates that learning this new feature while conditioning on the previous ones will have better preditive performance on the current task. However, the sum of all the signals is bounded by the entropy of the labels:
Note that this is independent of the size of the dataset. This suggests that the existence of previous features will negatively affect the ability to learn new features.
With a infinite dataset over “cats or dogs” labels, learning cat face and dog face features will reduce the signal for learning feature for table detector. Therefore we hypothesize that highly predictive features, such as cat face and dog face features, will prevent learning of less predictive features such as tables. We refer to this hypothesis as “feature competition”.
To test whether this is reasonable, we conduct the following experiment with two MNIST digit in the same input image. We would like to know if features learned for the label of the left digit will prevent learning features for the right digit. This is analagous to the hypothetical “cats, dogs, and tables” dataset where the left digits are “cats and dogs” and the right digits have “tables”.
In the feature extraction phase, we train with only the label for the left digit, and obtain separate “feature extractors” for both digits and , which is the input for the top fully-connected layer. In the feature evaluation phase, we fix and , and plug in a new fully-connected layer to train over the label for the right digit. We use the test accuracy in the evaluation phase to approximate the quality of features learned in - higher test accuracy would suggest that learned “better” features in the extraction phase.
During the extraction phase, we randomly corrupt the left digit (to make it totally indistinguishable) by probability , and force the right digit to have the same label with the left one by probability . Therefore, the signal for learning the right digit would be:
which increases if decreases or increases.
The following figure are two heatmaps for the test accuracy and the signal of under different settings of and . This validates our assumption that a higher signal makes learning the feature easier.
Interestingly, if is high then would learn almost nothing even when is as high as 0.5 due to the “feature competition” phenomenon. Hence, to learn a table detector with a “cats and dogs” dataset, we actually need the majority of cats to sit on a table.
On the other hand, some generative models using unsupervised methods tend to avoid this feature competition and has the potential to learn all the features. This might seem trivial for autoencoding models such as VAE (Kingma and Welling 2013), but not so for GANs (Goodfellow et al. 2014). In the paper, we point out that the discriminator in the GAN framework does not suffer from “feature competition”, and has the potential to learn all the features.
For the discriminator, it is presented with a dataset over two labels - 1 for real data, and 0 for generated data. We assume that the discriminator is always at “a state of confusion” where it always assigns probability 0.5 to both real data and fake data 2. Therefore, , and the signal for learning a new feature then becomes:
If the generated distribution does not yet match the data distribution, then the value above is greater than zero. Importantly, this value has no dependence over the previous features, which effectively eliminates the “feature competition” phenomenon that plagues labeled supervision.
We empirically test this using the same “evaluation-extraction” setting with being a convolutional neural network, and (no corrption in left digit, no correlation with right digit). We consider vanilla CNNs, Autoencoders, GANs and Wasserstein GANs with the same neural network as .
|Accuracy (w/ extraction)||67.93||89.95||90.38||91.37|
|Accuracy (w/o extraction)||84.31||82.18||82.27||84.97|
GAN and WGAN have accuracy that is on par with AE, while CNN performs even worse than random initializing the weights of . This is because in the extraction phase is completely ignored during the learning of .
In this post, we discuss the limitations of learning transferrable features using label-based supervised learning. Current practice, however, tend to take CNN features trained on ImageNet as granted, such as the Inception score for measuring the “quality” of generated samples. While these methods seems to be successful 3, one should also realize their limitations. We propose that further exploration be done in the direction of unsupervised representation learning.
We assume that we have infinite data in this article; the dataset size affects other aspects of feature learning. ↩
Note that the generator is not allowed to “fool” the discriminator, but only reach the decision boundary. If the condition is not satisfied then the argument would break (see Appendix A of our paper). This is also the intuition behind Boundary Seeking GANs (Hjelm et al. 2017). ↩
This is thanks to the highly expressive labels in ImageNet. Our theory also provides a strong motivation for datasets like Visual Genome, since the key to improving supervised feature learning is more expressive labels. ↩