tl;dr: Diving into how diffusion models like DALL-E and Stable Diffusion generate art.

Generative AI has made progress in leaps and bounds in the past year with the release of LLMs (large language models) like ChatGPT from OpenAI and LLama2 from Meta as well as image generation models like DALL-E from OpenAI and Stable Diffusion from stability.ai.

These image generation models can convert a simple image description like

“A car on a road, sci-fi style”

into beautiful photo-realistic photographs like this

In this article, we are going to take a deep dive into how the stable diffusion (which is the current state-of-the-art image generatio

How do diffusion models work?

Diffusion Models are a type of generative model which means they are built to generate outputs similar to ones they have been trained on.

There are two different types of diffusion processes:

Forward Diffusion

This process progressively adds noise to the image until it is converted into uncharacteristic noise. Uncharacteristic noise means you can’t tell whether the original image was a dog, a cat or maybe even a car. This is a very important step in the process.

Reverse Diffusion

This is the fun part! The reverse diffusion process tries to reconstruct the original training image from the noisy image we got from Forward Diffusion.

The reconstructed image isn’t always the same as the original as there is randomness that comes into play here. Now we know what the model needs to do. But the question is HOW 🤔

Training Process

To reconstruct the diffusion process, we need to find out how much noise was added to the image. To achieve this we train a neural network to predict the noise that was added. It is called a Noise Predictor which is a U-Net Model.

The steps are as follows:

Pick a random training image
Generate some noise
Add this noise to the training image for a certain number of steps
Teach the noise predictor how much noise was added

🔥 Now we have a fully trained noise predictor

Inference

To use this noise predictor, we generate a new noisy image. The noise predictor estimates how much noise was added and then removes the noise from the image. We repeat this process for a specified number of sampling steps.

The above process we discussed is running in The above process we discussed is running in pixel space (holds data for the red, green and blue channels for every single pixel). A single image of resolution 512x512 has 786,432 dimensions to it. Running these computations in pixel space is very, very slow and requires several GPUs at a minimum to run 🤯. But, Stable Diffusion has a trick up its sleeve 👇 (holds data for the red, green and blue channels for every single pixel). A single image of resolution 512x512 has 786,432 dimensions to it. Running these computations in pixel space is very, very slow and requires several GPUs at a minimum to run 🤯. But, Stable Diffusion has a trick up its sleeve 👇

Stable Diffusion

Latent Space

To overcome the computational speed issues, we have what are called Latent Diffusion Models such as the Stable Diffusion family of models which compress the high dimensional space from pixel space into something called latent space.

Latent Space is 48x smaller than pixel space which makes it exponentially faster to run which unlocks the ability to run inference on a single GPU with decent speeds.

Variational Autoencoder (VAE)

To perform the compression we use a technique called the variational autoencoder (VAE). It consists of two parts:

Encoder - It handles compressing the image from pixel space to latent space
Decoder - It handles converting the image from latent space back to pixel space

The image is compressed into latent space without any loss of information. This is possible since natural images are not random. For example, faces follow a certain spatial relationship between eyes, nose and other features. You can read more about this here → Manifold Hypothesis.

Inference in Latent Space

The inference in latent space is mostly the same as the one in pixel space except, a random latent space matrix is used instead of the generated noise image. An additional VAE Decoder step is also added after inference completes to convert the latent matrix back into a regular image (pixel space) which is our final generated image.

Conditioning

In the steps we discussed above, we never specified what we wanted the model to generate. Telling or “conditioning” the model to generate a certain kind of desired result is known as conditioning.

There are several types of conditioning such as:

Text conditioning (aka prompting)

inpainting

outpainting

controlnets

and more….

In this article, we will only look into text conditioning which is the most widely used conditioning method and it is also used in several other conditioning methods.

Text Conditioning

Here is a high-level view of how the text prompts are processed and fed into the noise predictor (U-NET). This might look familiar to some who know about the transformer model architecture for LLMs.

Tokenizer

The text prompt is tokenized using the CLIP tokenizer. Tokenization allows the model to understand the prompt without having to understand “words”. Each word doesn’t always correspond to a single token. One word may consist of multiple tokens as well.

Embedding Model

The embedding model converts these tokens into vectors. Each vector has a unique fixed vector embedding which is learned by the embedding model when it was trained. Vector embedding allows computers to understand how semantically similar two tokens using the distance between any two vectors. Stable Diffusion used OpenAI’s ViT-L/14 CLIP Model.

Text Transformer

The Text Transformer is the final step in the pipeline for processing the text prompt. It serves as an adapter for other conditioning methods. The inputs to the transformer are not limited to text, it can include images, depth maps and a variety of other conditioning inputs.

Cross-Attention Mechanism

The Noise Predictor (U-NET) ingests the output of the text transformer via a cross-attention mechanism. It has two parts to it:

(a) Self Attention (within the prompt)

Assume the text prompt is as follows:

A blue car on the road

The self-attention mechanism pairs up “blue” and “car” so the model generates images with a “blue car” and not a “blue road”. For an in-depth look into this, read the Attention is all you need paper.

(b) Cross Attention (between prompt and image)

The model then uses the information from**(a)**to guide the reverse diffusion process to generate the images containing blue cards.

This is a very important part of the conditioning pipeline, so much so that modifying its functionality can change the style of the generated images. Modifying these to fine-tune model outputs is known as Hypernetworks, you can read about them here.

End-to-End Pipeline Overview

Based on everything we have discussed above, here is what the finished pipeline looks like.

and this is a visualization of how the noise is converted into an image.

If you made it till here, 👏. Thank you for reading the article. I hope you found the information useful.

Demystifying AI Generated Art