do GANs really model the true data distribution, or are they just cleverly fooling us? | by Gal Yona

Since their introduction in 2014, Generative Adversarial Networks (GANs) have become a popular choice for the task of density estimation. The approach is simple: A GAN framework is composed of two networks, one for generating new samples, and another for discriminating between real samples (from the true data distribution) and generated samples. The term adversarial is used because they have competing objectives, so one tries to “outwit” the other. The networks are trained jointly such that the gradient feedback from one network improves the other, until (hopefully) the generator is able to generate images which a good discriminator can not tell if are true images or not.

The theory behind GANs is promising. In fact, if at each step of the training procedure each network is trained to completion, the GAN objective can be shown to be equivalent to minimizing the Jensen-Shannon divergence between the true data density and the model data density. However, in practice, the assumptions of these analysis don’t hold. Indeed, successfully training GANs is a notoriously difficult task that has given rise to many improvements in the last two years. Anyhow, the extent to which GANs manage to faithfully model the true data distribution in practice is still an open question.

One recent work that has shown very promising results is NVIDIA’s famous “Progressive Growing of GANs”, in which both the discriminator and the generator grow progressively, until high quality images of 1024x1024 are generated. The generated faces of “pseudo-celebrities” seem to suggest a positive answer to the above question. Is it so?

How do we know if GANs are successful?

The main challenge in estimating the performance of GANs is that there is no inherent estimate of how good the distributional fit was, leaving researchers (for the most part) with the task of qualitatively evaluating the result.

Simply looking at the generated outputs and marveling at how realistic the faces look is not enough. GANs are known to be susceptible to mode collapse, in which the target distribution is not fully modeled, and the generator tends to produce very similar images (referring to “modes” of the true distribution) . When the training sets are composed of thousands of images, how can we be sure that generator is doing more than just smoothly interpolating between the training images (or at the worst case, just memorizing the dataset)?

One good indication that mode collapse is not happening is diversity in the generated outputs. A simple check for diversity is the following:

choose two random noise vectors that produce realistic images and generate the “interpolated” images by generating images from seeds lying on the line joining the two vectors. If many of these images are reasonable, then we have good evidence that the generated distribution is capable of producing a variety of images.

The folks from NIVIDIA were kind enough to provide a whole hour of such interpolations for your enjoyment:

However, this still doesn’t resolve the issue of originality: If the GAN managed to truly model the data distribution, then the chances of it generating an image that resembles a training image are slim to non-existing. While the youtube video shows a lot of high quality generated faces, could the GAN only be doing clever and visually-appealing interpolations between the training images it received? Does it really dream up new celebrities, or merely creates skillful merges between existing celebrities?

The point is that proving sample diversity (e.g, showing that the support of the generated distribution being large) is not enough. One needs to establish that the generated outputs significantly differ from the training examples.

For this purpose, the authors of the NVIDIA paper used the most straightforward approach, which is to show the nearest neighbor of the generated faces in the training set, where the metric used is L1 distance in pixel space (of a center crop of the image).

top: generated outputs, bottom: nearest neighbor in training set, as calculated using L1 distance in the pixel space of the center crop

I guess the assumption here is that because the faces are normalized (aligned and frontal), measuring the distance in pixel space is a good proxy to the true similarity, but the results are sometimes poor, as the rightmost pair demonstrates.

This felt like something was missing. Some of the generated images looked vaguely familiar to me, I just couldn’t come up with actual names. What if there really are images of these celebrities in the training set, that comparing L1 pixel intensities doesn’t find?

To get some inspiration, I came up with my own test: I showed the generated outputs to my (non-AI-savvy) family members and asked them if they recognized any of these images. The responses were pretty funny. I got answers ranging from Beyonce (second from left), “Chris Rock with weird hair” (fourth from left) and “a female version of Michael Douglas” (rightmost), which is actually a pretty good guess.

left: generated image from GAN; right: Michael Douglas

Laughs aside, it got me thinking that there could be something there. I just need a more methodological approach.

Face recognition to the rescue!

What if we replace the naive L1 loss with a semantic loss function, that looks for the most similar people in the dataset? This is actually pretty easy. It turns out that the features from the final layer of a network trained on the task of face recognition are useful for computing semantic similarity between two individuals (in fact, that’s how Facebook knows who to tag in the images you upload). In this case I was lazy and used dlib, a long time go-to tool for face detection and facial landmark detection. They recently added pre-trained face recognition models which allow you to compute “semantic features” for people in just a few lines of code. This is actually super useful for a huge variety of downstream tasks involving images of people, so it’s a good tool to know.

I used this approach to search for semantically-similar images in the non-HQ version of celebA (the only I had access to), which is the dataset used to train NVIDIA’s GAN. Here are some of the “nearest neighbor” pairs I found. In all of the below examples, on the left is the high quality generated image, and on the right is a “semantic” neighbor from celebA. Much better than the L1 neighbors for sure, and some similarities are quite noticeable. (Note that these weren’t “cherry picked”, I only did this for about 10 generated faces which I took from their article).

Summary

The evaluation of GANs forces us to resort to qualitative measures of “good of fit”.
When GANs are trained on big datasets, visually compelling & diverse output don’t, by themselves, prove GAN training was successful in recovering the true data distribution. There is need more rigorous evidence that GANs do more than just “intelligent memorization” of the training set.
When possible, opt for semantic similarity rather than simply using Euclidean distance on the raw data (especially if you’re trying to make a meaningful point)
Specifically for faces, you can get “off the shelf” high quality semantic features very easily. For this proejct I used dlib’s pre-trained face recognition network. It couldn’t be easier!

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。