Visual Chain-of-Thought Diffusion Models

Images generated by our baseline, EDM. They mostly look realistic but there are occasionally artifacts – see the blobs on the chin in the first and seventh images.

Images generated by our method. We don’t see any of the artifacts that were present in images from the baseline.

At this year’s CVPR workshop on Generative Models for Computer Vision we’ll present a simple new approach to unconditional and class-conditional image generation. It takes advantage of this fact: conditional diffusion generative models (DGMs) produce much more realistic images than unconditional DGMs. We show in the paper that images produced by conditional DGMs even get more realistic as you condition on more information. This even holds true if you add information by making your text prompt longer for Stable diffusion (see the images we sample from Stable diffusion at the end). If we want to generate a large set of images (or a video), it seems like we have to either (a) start by writing out a detailed description of each image or frame, or (b) accept inferior quality.

Our paper proposes a third option: prompt a first DGM to generate a detailed image description, and then prompt a conditional DGM to generate the image given this detailed description. To avoid the cost of generating long paragraphs of text, we use a vector in the form of a CLIP image embedding for the image description. A CLIP image embedding is a vector that encodes the semantically-meaningful parts of an image in a compact format. Let’s look at some images sampled conditioned on CLIP embeddings:

Animal faces. We sampled every frame in this video using the same set of 20 CLIP embeddings so high-level features (like animal species) are shared across all frames.

Human faces. We sampled every frame in this video using the same set of 20 CLIP embeddings so, once again, high-level features are shared across frames.

We see that two images conditioned on the same CLIP embedding share a lot of the same features: the animals’ species and color patters stay the same; the people’s age, their facial expression, and their accessories stay roughly the same. Even better, these images sampled from a conditional DGM are much more realistic than those sampled from an unconditional DGM: if we take a CLIP embedding of an image from the animal faces dataset and then sample from our conditional DGM, the resulting image is on average 56% more realistic than if we’d used an unconditional DGM (according to the Fréchet Inception Distance, a commonly-used measure of image quality).

Now, how well does the conditional DGM work as part of our proposed method, when we prompt it with a CLIP embedding generated by a second DGM? We find that our generated images are still 48% more realistic than those from an unconditional DGM, almost as good as when we “cheated” by taking CLIP embeddings of dataset images. In summary, even though our task is unconditional generation, we can make use of conditional DGMs, which typically make better-looking images than unconditional DGMs!

We’re excited about future work following this direction. It’s likely that there are better quantities to condition on than CLIP embeddings – we have so far tried just a couple of alternatives. We might even be able to learn an embedder directly to maximize our image quality – doing so could lead us to a generalization of Variational Diffusion Models. We could also condition on multiple quantities – perhaps a future state-of-the-art generative model will consist of a “chain” of DGMs, each conditioning on the output of the one before. If this sounds too complex, an alternative is to simplify our method by learning a single DGM that jointly generates an image and CLIP embedding. See our paper for the full details!

An unrealistic image featuring a distorted road. We prompted Stable diffusion to generate “Aerial photography.

A more realistic image. We prompted Stable diffusion to generate “Aerial photography of a patchwork of small green fields separated by brown dirt tracks between them. A large tarmac road passes through the scene from left to right.