Diffusion Models: From DDPM to Stable Diffusion Explained

Diffusion models have revolutionized the field of generative AI, enabling machines to create stunning images, art, and visual content from simple text descriptions. From the foundational denoising diffusion probabilistic models (DDPM) to the widely-adopted Stable Diffusion, these generative models have transformed how we think about image generation and creative AI applications.

In this comprehensive guide, we’ll explore the mathematical foundations, architectures, and practical implementations of diffusion models. Whether you’re a researcher, developer, or AI enthusiast, understanding these models is crucial for working with modern generative AI systems.

Content

1. Understanding the core concept of diffusion models

What are diffusion models?

Diffusion models are a class of generative models that learn to create data by reversing a gradual noising process. Imagine watching a video of ink slowly dispersing in water—then playing that video backward to see the ink reconstitute itself. That’s essentially what a diffusion model does with images.

The process works in two phases:

Forward diffusion process: Gradually adds Gaussian noise to data over many timesteps until it becomes pure noise
Reverse diffusion process: Learns to remove noise step-by-step, reconstructing the original data

This approach differs fundamentally from other generative models like GANs (Generative Adversarial Networks) or VAEs (Variational Autoencoders). While GANs use adversarial training and VAEs compress data into a latent space, diffusion models rely on iterative refinement through noise prediction.

The intuition behind diffusion

Think of a diffusion model as an artist who has forgotten how to draw. To relearn, they practice by taking finished paintings and gradually obscuring them with random brushstrokes. Once they understand this degradation process perfectly, they can work backward—starting with chaos and systematically removing randomness until a coherent image emerges.

This intuition translates into a powerful mathematical framework where the model learns the score (gradient) of the data distribution at different noise levels. By following these gradients, the model can navigate from pure noise back to realistic data samples.

2. DDPM: The foundation of modern diffusion models

Denoising diffusion probabilistic models explained

Denoising diffusion probabilistic models (DDPM) introduced by Ho et al. formalized the diffusion process into a tractable probabilistic framework. DDPM consists of a fixed forward process and a learned reverse process.

The forward process adds noise according to a variance schedule $\beta_t$ over $T$ timesteps:

$$ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I) $$

A key insight is that we can sample $x_t$ directly from $x_0$ using the reparameterization:

$$ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon $$

where $\bar{\alpha}_t = \prod_{i=1}^{t} (1 – \beta_i),
\quad
\epsilon \sim \mathcal{N}(0, I)$.

The reverse process and noise prediction

The reverse process learns to denoise by predicting the noise that was added:

$$ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) $$

Instead of directly predicting $x_{t-1}$, DDPM trains a neural network $\epsilon_\theta$ to predict the noise $\epsilon$, which simplifies the training objective to:

$$ L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}
\left[ \left\| \epsilon – \epsilon_\theta(x_t, t) \right\|^2 \right]$$

This means the model learns to estimate what noise was added at each timestep, allowing it to iteratively remove that noise.

Implementing DDPM in Python

Here’s a simplified implementation of the core DDPM components:

import torch
import torch.nn as nn
import numpy as np

class DDPM:
    def __init__(self, num_timesteps=1000, beta_start=0.0001, beta_end=0.02):
        self.num_timesteps = num_timesteps
        
        # Linear beta schedule
        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
        
        # Precompute values for forward process
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1 - self.alphas_cumprod)
    
    def forward_diffusion(self, x0, t, noise=None):
        """Add noise to data according to timestep t"""
        if noise is None:
            noise = torch.randn_like(x0)
        
        sqrt_alpha_cumprod_t = self.sqrt_alphas_cumprod[t].reshape(-1, 1, 1, 1)
        sqrt_one_minus_alpha_cumprod_t = self.sqrt_one_minus_alphas_cumprod[t].reshape(-1, 1, 1, 1)
        
        # x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise
        return sqrt_alpha_cumprod_t * x0 + sqrt_one_minus_alpha_cumprod_t * noise, noise
    
    def reverse_diffusion(self, model, x_t, t):
        """Remove noise using the trained model"""
        # Predict the noise
        predicted_noise = model(x_t, t)
        
        alpha_t = self.alphas[t].reshape(-1, 1, 1, 1)
        alpha_cumprod_t = self.alphas_cumprod[t].reshape(-1, 1, 1, 1)
        beta_t = self.betas[t].reshape(-1, 1, 1, 1)
        
        # Compute mean of x_{t-1}
        mean = (1 / torch.sqrt(alpha_t)) * (
            x_t - (beta_t / torch.sqrt(1 - alpha_cumprod_t)) * predicted_noise
        )
        
        if t[0] > 0:
            noise = torch.randn_like(x_t)
            variance = beta_t
            return mean + torch.sqrt(variance) * noise
        else:
            return mean

# Training loop example
def train_ddpm(model, dataloader, ddpm, epochs=100, device='cuda'):
    optimizer = torch.optim.Adam(model.parameters(), lr=2e-4)
    
    for epoch in range(epochs):
        for batch in dataloader:
            x0 = batch[0].to(device)
            
            # Sample random timesteps
            t = torch.randint(0, ddpm.num_timesteps, (x0.shape[0],), device=device)
            
            # Add noise
            x_t, noise = ddpm.forward_diffusion(x0, t)
            
            # Predict noise
            predicted_noise = model(x_t, t)
            
            # Compute loss
            loss = nn.MSELoss()(predicted_noise, noise)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

This implementation shows how DDPM adds noise during training and learns to predict that noise for removal during generation.

3. Score-based models and their connection to diffusion

Understanding score-based generative models

Score-based models provide an alternative perspective on diffusion that illuminates the underlying mathematics. The “score” refers to the gradient of the log probability density:

$$ s_\theta(x, t) = \nabla_x \log p_t(x) $$

This score tells us which direction to move in data space to increase probability density. Score-based models learn this gradient field at different noise levels, then use it to transform noise into data through a process called Langevin dynamics.

The equivalence between score matching and denoising

A remarkable discovery is that denoising and score matching are fundamentally equivalent. When a diffusion model predicts noise (\epsilon_\theta(x_t, t)), it implicitly estimates the score:

$$ s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1 – \bar{\alpha}_t}} $$

This connection reveals that DDPM performs score matching under the hood. The denoising objective naturally trains the model to estimate probability gradients, which can then guide the reverse diffusion process.

Stochastic differential equations (SDEs)

Score-based models formalize diffusion as a continuous-time stochastic differential equation:

$$ dx = f(x, t)dt + g(t)dw $$

where $f$ is the drift coefficient, $g$ is the diffusion coefficient, and (dw) represents Brownian motion. The reverse-time SDE allows sampling:

$$ dx = [f(x, t) – g(t)^2 \nabla_x \log p_t(x)]dt + g(t)d\bar{w} $$

This framework unifies different diffusion formulations and enables flexible sampling strategies through various numerical SDE solvers.

4. Latent diffusion models: Making diffusion efficient

The computational challenge

While DDPM produces impressive results, generating high-resolution images requires hundreds or thousands of denoising steps in pixel space. For a 512×512 image, each step processes over 786,000 values. This computational expense limits practical applications.

Latent diffusion models solve this problem by performing diffusion in a compressed latent space rather than pixel space. This architectural innovation makes high-resolution image generation practical.

How latent diffusion works

The latent diffusion architecture consists of three main components:

Autoencoder: A pretrained VAE compresses images into a lower-dimensional latent space
Diffusion model: Operates in this compressed space, learning to denoise latent representations
Conditioning mechanism: Allows control through text, images, or other modalities

The process looks like:

Encode image to latent: $z = \mathcal{E}(x)$
Apply diffusion in latent space: $z_t = \sqrt{\bar{\alpha}_t} z + \sqrt{1-\bar{\alpha}_t} \epsilon$
Denoise latent: $z_0 = \text{DiffusionModel}(z_T)$
Decode to image: $x = \mathcal{D}(z_0)$

This approach reduces computational requirements by 4-16× while maintaining or improving quality.

Cross-attention for conditioning

Latent diffusion models use cross-attention mechanisms to incorporate conditioning information like text prompts. The denoising U-Net includes cross-attention layers where:

Query: Comes from the noisy latent representation
Key and Value: Come from encoded conditioning information (e.g., text embeddings)

This allows the model to attend to relevant parts of the conditioning signal when denoising specific regions of the latent:

class CrossAttentionBlock(nn.Module):
    def __init__(self, dim, context_dim, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.scale = (dim // num_heads) ** -0.5
        
        self.to_q = nn.Linear(dim, dim, bias=False)
        self.to_k = nn.Linear(context_dim, dim, bias=False)
        self.to_v = nn.Linear(context_dim, dim, bias=False)
        self.to_out = nn.Linear(dim, dim)
    
    def forward(self, x, context):
        """
        x: latent features [batch, height*width, dim]
        context: conditioning [batch, seq_len, context_dim]
        """
        batch_size, seq_len, _ = x.shape
        
        # Compute queries from latent, keys and values from context
        q = self.to_q(x)
        k = self.to_k(context)
        v = self.to_v(context)
        
        # Reshape for multi-head attention
        q = q.reshape(batch_size, seq_len, self.num_heads, -1).transpose(1, 2)
        k = k.reshape(batch_size, -1, self.num_heads, -1).transpose(1, 2)
        v = v.reshape(batch_size, -1, self.num_heads, -1).transpose(1, 2)
        
        # Compute attention
        attention_scores = torch.matmul(q, k.transpose(-2, -1)) * self.scale
        attention_probs = torch.softmax(attention_scores, dim=-1)
        
        # Apply attention to values
        out = torch.matmul(attention_probs, v)
        out = out.transpose(1, 2).reshape(batch_size, seq_len, -1)
        
        return self.to_out(out)

5. Stable Diffusion: Architecture and innovations

The Stable Diffusion pipeline

Stable Diffusion represents the culmination of latent diffusion research, combining multiple innovations into a practical, open-source system for text-to-image generation. The complete pipeline includes:

Text encoder: CLIP text encoder converts prompts into embeddings
Latent diffusion model: U-Net denoises latent representations conditioned on text
VAE decoder: Converts final latent back to pixel space
Scheduler: Controls the denoising trajectory and sampling algorithm

The architecture processes a text prompt through these stages:

class StableDiffusionPipeline:
    def __init__(self, vae, text_encoder, unet, scheduler):
        self.vae = vae
        self.text_encoder = text_encoder
        self.unet = unet
        self.scheduler = scheduler
    
    def generate(self, prompt, num_inference_steps=50, guidance_scale=7.5):
        # Encode text prompt
        text_embeddings = self.text_encoder(prompt)
        
        # Create unconditional embeddings for classifier-free guidance
        uncond_embeddings = self.text_encoder("")
        
        # Combine for guidance
        text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
        
        # Start from random noise in latent space
        latent = torch.randn((1, 4, 64, 64))
        
        # Set timesteps
        self.scheduler.set_timesteps(num_inference_steps)
        
        # Denoising loop
        for t in self.scheduler.timesteps:
            # Expand latent for classifier-free guidance
            latent_model_input = torch.cat([latent] * 2)
            
            # Predict noise residual
            noise_pred = self.unet(latent_model_input, t, text_embeddings)
            
            # Perform guidance
            noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
            noise_pred = noise_pred_uncond + guidance_scale * (
                noise_pred_text - noise_pred_uncond
            )
            
            # Compute previous noisy sample
            latent = self.scheduler.step(noise_pred, t, latent)
        
        # Decode latent to image
        image = self.vae.decode(latent)
        return image

Classifier-free guidance

Classifier-free guidance is a crucial technique that allows Stable Diffusion to generate images that closely follow text prompts. During training, the model randomly drops conditioning information, learning both conditional and unconditional distributions.

During inference, the model predicts noise both with and without conditioning, then combines predictions:

$$ \tilde{\epsilon}_\theta(x_t, t, c)
= \epsilon_\theta(x_t, t, \emptyset)
+ w \cdot \left(
\epsilon_\theta(x_t, t, c)
– \epsilon_\theta(x_t, t, \emptyset)
\right) $$

where $w$ is the guidance scale. Higher values increase prompt adherence but may reduce diversity. Typical values range from 7 to 15.

Sampling algorithms and schedulers

The scheduler determines how the model traverses from noise to data. Different schedulers offer trade-offs between quality and speed:

DDPM: Original sampling, requires many steps (1000+)
DDIM: Deterministic sampling, allows fewer steps (50-100)
DPM-Solver: Fast solver using differential equations (20-25 steps)
Euler ancestral: Adds controlled noise for diversity

Here’s an example DDIM scheduler implementation:

class DDIMScheduler:
    def __init__(self, num_train_timesteps=1000, beta_start=0.0001, beta_end=0.02):
        self.num_train_timesteps = num_train_timesteps
        self.betas = torch.linspace(beta_start, beta_end, num_train_timesteps)
        self.alphas = 1 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
    
    def set_timesteps(self, num_inference_steps):
        # Create subset of timesteps for faster sampling
        step_ratio = self.num_train_timesteps // num_inference_steps
        self.timesteps = torch.arange(0, self.num_train_timesteps, step_ratio).flip(0)
    
    def step(self, noise_pred, timestep, sample, eta=0.0):
        """Perform one DDIM step"""
        prev_timestep = timestep - self.num_train_timesteps // len(self.timesteps)
        
        alpha_prod_t = self.alphas_cumprod[timestep]
        alpha_prod_t_prev = self.alphas_cumprod[prev_timestep] if prev_timestep >= 0 else 1.0
        
        # Predict x_0
        pred_original_sample = (
            sample - torch.sqrt(1 - alpha_prod_t) * noise_pred
        ) / torch.sqrt(alpha_prod_t)
        
        # Compute variance
        variance = (1 - alpha_prod_t_prev) / (1 - alpha_prod_t) * (1 - alpha_prod_t / alpha_prod_t_prev)
        std_dev_t = eta * torch.sqrt(variance)
        
        # Compute direction to x_t
        pred_sample_direction = torch.sqrt(1 - alpha_prod_t_prev - std_dev_t**2) * noise_pred
        
        # Compute x_{t-1}
        prev_sample = torch.sqrt(alpha_prod_t_prev) * pred_original_sample + pred_sample_direction
        
        if eta > 0:
            noise = torch.randn_like(sample)
            prev_sample += std_dev_t * noise
        
        return prev_sample

6. Advanced techniques and applications

Image-to-image and inpainting

Diffusion models excel at image-to-image translation and inpainting by starting from partially noised images rather than pure noise. For image-to-image generation:

Encode the input image to latent space
Add noise according to a strength parameter (e.g., 50% noise)
Denoise using the text prompt as guidance
Decode the result

This allows for controlled modifications while preserving structure:

def img2img(pipeline, init_image, prompt, strength=0.8, num_steps=50):
    # Encode initial image
    latent = pipeline.vae.encode(init_image)
    
    # Determine starting timestep based on strength
    start_step = int(num_steps * strength)
    start_timestep = pipeline.scheduler.timesteps[start_step]
    
    # Add noise to latent
    noise = torch.randn_like(latent)
    latent = pipeline.scheduler.add_noise(latent, noise, start_timestep)
    
    # Denoise from this point
    for t in pipeline.scheduler.timesteps[start_step:]:
        # Standard denoising step
        latent = pipeline.denoise_step(latent, t, prompt)
    
    return pipeline.vae.decode(latent)

ControlNet and spatial conditioning

ControlNet adds precise spatial control to diffusion models by processing additional input conditions (edges, poses, depth maps) through a parallel network architecture. This enables applications like:

Pose-guided character generation
Depth-aware scene composition
Edge-guided image synthesis

The ControlNet copies the U-Net weights and processes control images alongside the main diffusion process, adding its outputs to the main network through zero-initialized connections.

Fine-tuning and personalization

Several techniques enable customizing Stable Diffusion for specific subjects or styles:

DreamBooth: Fine-tunes the entire model on 3-5 images with a unique identifier
Textual Inversion: Learns new token embeddings while keeping the model frozen
LoRA: Learns low-rank adaptations of attention weights, requiring minimal storage

These methods make personalized image generation accessible without massive computational resources.

7. Practical considerations and future directions

Training considerations

Training diffusion models requires careful attention to several factors:

Batch size and resolution: Larger batches stabilize training but require more memory. Start at 64×64 or 128×128 resolution and increase gradually.
Learning rate scheduling: Constant or cosine schedules work well. Typical values: 1e-4 to 2e-4.
Noise schedule: Linear schedules work for DDPM, but cosine schedules often perform better for high-resolution images.
EMA weights: Maintaining exponential moving averages of model weights improves sample quality.

Optimization and efficiency

Several techniques accelerate diffusion model inference:

Reduced precision: Float16 or bfloat16 reduces memory and increases speed
Flash attention: Optimized attention implementations reduce computational cost
Model distillation: Trains smaller models to mimic larger ones in fewer steps
Latent consistency models: Recent approaches enable single-step generation

Challenges and limitations

Despite their success, diffusion models face ongoing challenges:

Text-image alignment: Models sometimes struggle with complex spatial relationships or counting
Bias and fairness: Training data biases can propagate to generated images
Computational cost: Generation still requires significant resources
Fine details: Very small or intricate details may appear inconsistent

Emerging research directions

The field continues to evolve rapidly with exciting developments:

Video diffusion: Extending models to generate coherent video sequences
3D generation: Creating three-dimensional objects and scenes
Multi-modal fusion: Combining text, audio, and image conditioning
Faster sampling: Reducing inference steps while maintaining quality
Disentangled control: Separating style, content, and composition

8. Knowledge Check

Quiz 1: Core Diffusion Concept

• Question: Describe the two fundamental phases that define how diffusion models operate to generate data.

• Answer: Diffusion models operate in two phases. The first is the forward diffusion process, where Gaussian noise is gradually added to data until it becomes pure noise. The second is the reverse diffusion process, where the model learns to remove that noise step-by-step to reconstruct the original data.

Quiz 2: DDPM’s Core Task

• Question: Instead of predicting the previous data step $x_{t-1}$ directly, what does a Denoising Diffusion Probabilistic Model (DDPM) train its neural network to predict, and why is this simpler?

• Answer: A DDPM trains a neural network to predict the noise (ε) that was added at each timestep. This simplifies the training objective to a mean squared error calculation between the actual noise added and the predicted noise.

Quiz 3: Score Matching and Denoising

• Question: What is the fundamental relationship between score matching and the denoising process performed by models like DDPM?

• Answer: Denoising and score matching are fundamentally equivalent. When a diffusion model predicts the noise, it is implicitly estimating the score, which is the gradient of the log probability density. This connection reveals that the denoising objective naturally trains the model to estimate these probability gradients.

Quiz 4: The Efficiency of Latent Diffusion

• Question: What is the primary architectural innovation of latent diffusion models that makes them significantly more computationally efficient than standard DDPMs?

• Answer: Latent diffusion models achieve computational efficiency by performing the diffusion process in a compressed, lower-dimensional latent space created by an autoencoder, rather than operating directly in the high-dimensional pixel space of the image.

Quiz 5: Latent Diffusion Architecture Components

• Question: What are the three main components of a latent diffusion model’s architecture?

• Answer: The three main components are:

1. An Autoencoder (like a VAE) to compress images into a latent space.

2. A Diffusion model that operates within this compressed latent space.

3. A Conditioning mechanism (like cross-attention) to allow control via text or other inputs.

Quiz 6: The Stable Diffusion Pipeline

• Question: List the four key components that make up the complete text-to-image pipeline in Stable Diffusion.

• Answer: The Stable Diffusion pipeline consists of four key components:

1. A Text encoder (e.g., CLIP) to convert prompts into embeddings.

2. A Latent diffusion model (a U-Net) to denoise latent representations.

3. A VAE decoder to convert the final latent representation back into pixel space.

4. A Scheduler to control the denoising trajectory and sampling algorithm.

Quiz 7: Classifier-Free Guidance

• Question: Explain the core principle behind classifier-free guidance and state its primary purpose in models like Stable Diffusion.

• Answer: Classifier-free guidance works by having the model predict the noise both with the text conditioning and without it (unconditionally). It then combines these predictions, using a guidance scale to steer the final output. Its primary purpose is to generate images that more closely follow the provided text prompt.

Quiz 8: Image-to-Image Generation

• Question: How do diffusion models adapt their process for image-to-image translation, as opposed to generating from pure noise?

• Answer: For image-to-image translation, the process starts from a partially noised version of an initial input image rather than from pure random noise. The model then denoises this latent representation using the text prompt as guidance, allowing for controlled modifications that preserve the original image’s structure.

Quiz 9: Model Personalization Techniques

• Question: Name two techniques mentioned in the text for fine-tuning or personalizing Stable Diffusion for specific subjects or styles.

• Answer: Two techniques for personalizing Stable Diffusion are DreamBooth, which fine-tunes the entire model on a few images of a subject, and LoRA (Low-Rank Adaptation), which learns adaptations of attention weights and requires minimal storage.

Quiz 10: Schedulers and Sampling

• Question: What is the role of a scheduler in a diffusion model, and name two different types of schedulers that offer trade-offs between quality and speed.

• Answer: The scheduler’s role is to determine the trajectory and algorithm for traversing from noise to data during the reverse diffusion process. Two examples of schedulers are DDIM, which allows for faster, deterministic sampling in fewer steps, and DPM-Solver, which is an even faster solver based on differential equations.

Explore more: