Denoising for generation. Score matching, noise schedules.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Diffusion models turn the generative modeling problem into something deceptively simple: learn to undo noise. If you can reliably denoise a sample that has been corrupted in a controlled way, you can start from pure noise and iteratively “walk back” to realistic data.
A diffusion model defines (1) a forward Markov chain that gradually adds Gaussian noise to data using a schedule {βₜ}, and (2) a learned network ε_θ(xₜ, t) that predicts the added noise (or equivalently the score ∇_x log p(xₜ)). Generation runs the reverse process: start from x_T ∼ 𝒩(0, I) and repeatedly denoise to sample x₀.
Diffusion models are generative models built around one central trick: instead of trying to model a complex data distribution p_data(x₀) directly, we define a destruction process that is easy to analyze (adding Gaussian noise over many small steps), then learn a construction process that reverses it.
Why this helps:
A diffusion model typically has two coupled processes indexed by discrete time t ∈ {1, …, T} (T can be 1000 in classical DDPMs; modern samplers often use fewer steps with improved solvers):
1) Forward (noising) process q:
2) Reverse (denoising) process p_θ:
The key learned object is a neural network that depends on the current noisy sample and the time index:
From ε_θ you can derive other equivalent parameterizations:
Even if you only remember one sentence: diffusion models work because “denoise step-by-step” is a tractable supervised learning task.
Connection to ideas you already know (VAEs):
Throughout this lesson we’ll use bold for vectors (e.g., x, ε), and assume images are flattened into vectors in ℝ^d (but all formulas hold per-pixel / per-dimension).
The forward process is a time-indexed Markov chain that gradually destroys information by adding Gaussian noise according to a noise schedule.
Because it gives you a controlled way to generate paired training data:
This turns generative modeling into supervised learning.
A common discrete-time forward process (DDPM) is:
q(xₜ | x_{t−1}) = 𝒩( xₜ ; √(αₜ) x_{t−1}, βₜ I )
where:
Intuition:
Over many steps, the signal decays and noise dominates.
A crucial simplification is that the composition of Gaussians stays Gaussian, so we can sample xₜ directly from x₀ without simulating every intermediate step.
Define the cumulative product:
ᾱₜ = ∏_{s=1}^t α_s
Then:
q(xₜ | x₀) = 𝒩( xₜ ; √(ᾱₜ) x₀, (1 − ᾱₜ) I )
Equivalently, we can reparameterize:
xₜ = √(ᾱₜ) x₀ + √(1 − ᾱₜ) ε, where ε ∼ 𝒩(0, I)
This single equation is the workhorse of diffusion training.
Pause and interpret each term:
The schedule determines how quickly you destroy information.
Design goals:
Common schedules:
| Schedule | How it behaves | Pros | Cons |
|---|---|---|---|
| Linear βₜ | βₜ increases linearly | Simple, classic DDPM | Not optimal SNR allocation |
| Cosine ᾱₜ | ᾱₜ follows a cosine curve | Good empirical performance, smooth | Needs careful discretization |
| Learned / piecewise | optimized for sampling steps | can be very fast at inference | adds complexity |
A useful quantity is the SNR at time t:
SNRₜ = ᾱₜ / (1 − ᾱₜ)
Practical consequence: the network must learn to denoise across wildly different regimes. This is why the time embedding (conditioning on t) is essential.
The network ε_θ(xₜ, t) is conditioned on time t (or a continuous time value). In practice:
This enables one shared network to act like a family of denoisers, one for each noise level.
At this point you have a forward process q that is:
Next we learn the reverse.
The learned model sits at the center of diffusion: ε_θ(xₜ, t). Superficially it’s “just” a network that predicts noise, but understanding what it represents explains why diffusion models work and how score matching appears.
When we write
xₜ = √(ᾱₜ) x₀ + √(1 − ᾱₜ) ε
the only randomness (given x₀ and t) is ε ∼ 𝒩(0, I). If a network can infer the likely ε from xₜ, it can recover information about x₀.
Noise prediction is attractive because:
Sample:
Then minimize:
L(θ) = 𝔼_{x₀,t,ε} [ ‖ ε − ε_θ(xₜ, t) ‖² ]
This is the “simple loss” from DDPM.
At a fixed t, xₜ is a noisy version of real data. There are many possible clean x₀ that could have produced a given xₜ, but the network learns the conditional expectation of noise given xₜ and t.
For MSE regression,
ε_θ*(xₜ,t) = 𝔼[ ε | xₜ, t ]
That conditional expectation encodes the structure of the data distribution because the posterior over x₀ given xₜ is shaped by p_data.
Rearrange the reparameterization:
xₜ = √(ᾱₜ) x₀ + √(1 − ᾱₜ) ε
Solve for x₀:
x₀ = ( xₜ − √(1 − ᾱₜ) ε ) / √(ᾱₜ)
So given ε_θ, we can define an implicit estimate of the clean sample:
\hat{x}₀(xₜ,t) = ( xₜ − √(1 − ᾱₜ) ε_θ(xₜ,t) ) / √(ᾱₜ)
This is used during sampling and for guidance methods.
The score of a density p(x) is:
∇_{x} log p(x)
Diffusion theory says the optimal denoiser corresponds to the score of the noisy distribution at each noise level.
For the forward process q, one can show a relationship of the form:
s_θ(xₜ,t) ≈ ∇_{xₜ} log q(xₜ)
and with noise prediction parameterization:
s_θ(xₜ,t) = − ε_θ(xₜ,t) / √(1 − ᾱₜ)
Up to conventions and scaling, predicting noise is equivalent to predicting the score.
If you know the score field ∇ log p(x), you know in which direction probability increases most steeply. Sampling methods (reverse SDE / Langevin-like dynamics) can use the score to push noise toward data manifold regions.
This gives diffusion a deep conceptual link: it’s not merely denoising; it’s learning a time-indexed family of score functions.
Uniformly sampling t often works, but it can overweight uninformative very-noisy steps or underweight crucial mid-SNR steps.
Many systems use weighted losses:
L(θ) = 𝔼[ w(t) ‖ ε − ε_θ(xₜ,t) ‖² ]
Common ideas:
At this stage we have a trained ε_θ. Next we need to turn it into a generative procedure.
Generation is the reverse of the forward noising chain. The forward chain is easy because it’s Gaussian by construction; the reverse chain is hard because it depends on the unknown data distribution. The learned model ε_θ supplies the missing information.
We want transitions p_θ(x_{t−1} | xₜ) that approximately invert q(xₜ | x_{t−1}). DDPMs choose Gaussian reverse transitions:
p_θ(x_{t−1} | xₜ) = 𝒩( x_{t−1} ; μ_θ(xₜ,t), Σ_θ(t) )
A standard choice fixes Σ_θ(t) to a known variance (e.g., β̃ₜ I), and uses the network to compute the mean.
A common form (using noise prediction) is:
μ_θ(xₜ,t) = 1/√(αₜ) \left( xₜ − \frac{βₜ}{√(1 − ᾱₜ)} ε_θ(xₜ,t) \right)
Then sampling is:
x_{t−1} = μ_θ(xₜ,t) + σₜ z, z ∼ 𝒩(0, I)
with σₜ chosen from the variance schedule.
Pause: what is happening qualitatively?
This stochasticity helps match the true reverse distribution and avoids collapsing to a single mode.
If you set the injected noise to zero (or modify the update), you get deterministic trajectories that still land on realistic samples. This is the basis for faster sampling variants.
You can think of DDPM vs DDIM as trading:
In the score-based modeling framework, the diffusion process is described by an SDE:
dx = f(x, t) dt + g(t) dw
where w is Brownian motion.
The reverse-time SDE has drift that involves the score:
dx = [ f(x, t) − g(t)² ∇_{x} log p_t(x) ] dt + g(t) d\bar{w}
If you approximate the score with s_θ(x, t), you can numerically solve the reverse SDE to sample.
You do not need to memorize this SDE form to use diffusion models, but it explains:
If T is large enough and the schedule is designed properly, q(x_T) is close to a standard normal regardless of p_data (information destroyed). That’s why the model can start from pure noise.
In conditional generation (text-to-image, class-conditional), you train ε_θ(xₜ,t, c) with conditioning c, and also sometimes drop c during training to learn an unconditional path.
At sampling time, combine:
ε_guided = (1 + w) ε_θ(xₜ,t,c) − w ε_θ(xₜ,t, ∅)
where w ≥ 0 is guidance scale.
Intuition: push samples toward regions that satisfy the condition more strongly, at the cost of reduced diversity if w is too high.
This is not strictly part of “diffusion basics,” but it’s a major reason diffusion works well in practice.
Diffusion models became dominant for high-fidelity generation because they combine stable training with flexible conditioning and strong likelihood-related foundations.
| Model family | Core idea | Strengths | Weaknesses |
|---|---|---|---|
| VAE | latent variable model trained with ELBO | stable training, explicit encoder | samples can be blurry; trade-off via KL |
| GAN | adversarial game | sharp samples, fast inference | unstable training, mode collapse, hard likelihood |
| Diffusion | learn to reverse noising | very high quality, stable objective, flexible conditioning | slow sampling (mitigated by fast samplers), compute-heavy |
Since you know VAEs: notice the philosophical similarity:
In both cases, “start from something Gaussian” is the trick. Diffusion differs in that the latent is not low-dimensional; it’s the same dimension as x.
Even if you never explicitly compute ∇ log p, the score view guides:
1) Architecture
2) Data scaling and variance
3) Speed
4) Evaluation
Diffusion models are iterative refinement. Each step is a small denoise move that, when composed many times, produces a complex global transformation from noise to data.
If you want one unifying picture:
Let a 1D “data point” be x₀ = 2. Suppose a diffusion schedule gives ᾱₜ = 0.81 at some timestep t. Sample ε ∼ 𝒩(0,1); take ε = −0.5 for this worked example. Compute xₜ.
Use the closed form:
xₜ = √(ᾱₜ) x₀ + √(1 − ᾱₜ) ε
Compute √(ᾱₜ):
√(0.81) = 0.9
Compute √(1 − ᾱₜ):
1 − ᾱₜ = 1 − 0.81 = 0.19
√(0.19) ≈ 0.43589
Plug in values:
xₜ = 0.9 · 2 + 0.43589 · (−0.5)
= 1.8 − 0.217945
= 1.582055
Interpretation:
Insight: The ability to sample xₜ directly from x₀ (without simulating t steps) is what makes diffusion training efficient: you can train on arbitrary noise levels with one formula.
Assume ᾱₜ = 0.36 and you observe a 2D noisy sample xₜ = (1.2, −0.3). A trained network predicts ε_θ(xₜ,t) = (0.5, −1.0). Compute \hat{x}₀.
Use the reconstruction formula:
\hat{x}₀ = ( xₜ − √(1 − ᾱₜ) ε_θ(xₜ,t) ) / √(ᾱₜ)
Compute √(ᾱₜ):
√(0.36) = 0.6
Compute √(1 − ᾱₜ):
1 − ᾱₜ = 0.64
√(0.64) = 0.8
Compute the noise term:
√(1 − ᾱₜ) ε_θ = 0.8 · (0.5, −1.0)
= (0.4, −0.8)
Subtract from xₜ:
xₜ − √(1 − ᾱₜ) ε_θ = (1.2, −0.3) − (0.4, −0.8)
= (0.8, 0.5)
Divide by √(ᾱₜ):
\hat{x}₀ = (0.8, 0.5) / 0.6
= (1.333…, 0.833…)
Insight: Noise prediction is not just a training trick: it provides a direct map from a noisy point back to an estimate of the clean data, which then defines the reverse-process mean updates.
Consider 1D for simplicity. Suppose at timestep t you have xₜ = 0.7, αₜ = 0.9 (so βₜ = 0.1), ᾱₜ = 0.5, and the model predicts ε_θ(xₜ,t) = 0.2. Compute the reverse mean μ_θ(xₜ,t).
Use the standard mean (noise-prediction form):
μ_θ(xₜ,t) = 1/√(αₜ) · ( xₜ − (βₜ/√(1 − ᾱₜ)) ε_θ(xₜ,t) )
Compute √(αₜ):
√(0.9) ≈ 0.948683
So 1/√(αₜ) ≈ 1.054093
Compute √(1 − ᾱₜ):
1 − ᾱₜ = 0.5
√(0.5) ≈ 0.707107
Compute the scaled noise subtraction:
(βₜ/√(1 − ᾱₜ)) ε_θ = (0.1 / 0.707107) · 0.2
≈ 0.141421 · 0.2
≈ 0.028284
Compute inside parentheses:
xₜ − (...) = 0.7 − 0.028284 = 0.671716
Multiply by 1/√(αₜ):
μ_θ ≈ 1.054093 · 0.671716 ≈ 0.7081
Interpretation:
Insight: Reverse diffusion is many small, structured moves. Each move uses ε_θ to decide how to shift the sample toward higher-probability regions at that noise level.
A diffusion model defines a forward Gaussian noising Markov chain q and learns a reverse denoising chain p_θ.
The closed form xₜ = √(ᾱₜ)x₀ + √(1 − ᾱₜ)ε enables efficient training at arbitrary timesteps.
ε_θ(xₜ,t) is commonly trained with an MSE objective to predict the exact noise ε used to create xₜ.
From ε_θ you can recover an estimate of the clean sample \hat{x}₀, which is used to build reverse transition means.
Noise prediction is (up to scaling) equivalent to learning the score ∇_{x} log p_t(x), connecting diffusion to score matching.
The noise schedule {βₜ} (or ᾱₜ) controls SNR over time and has a large impact on learnability and sampling quality.
Sampling starts from x_T ∼ 𝒩(0, I) and iteratively applies reverse updates; improved samplers reduce the number of steps.
Conditioning and guidance methods (e.g., classifier-free guidance) modify the effective denoising direction to satisfy prompts/labels.
Confusing αₜ with ᾱₜ: αₜ = 1 − βₜ is per-step, while ᾱₜ = ∏_{s≤t} α_s is cumulative and appears in the direct sampling formula.
Assuming ε_θ outputs “denoised x₀” by default; many implementations predict noise (ε), v, or x₀ depending on configuration, and the conversion matters.
Ignoring time conditioning: a network without a good embedding of t will fail because denoising at different SNRs is fundamentally different.
Thinking the forward process is learned: in standard diffusion, the forward noising q is fixed; only the reverse model is learned.
You have ᾱₜ = 0.64 and a clean vector x₀ = (3, 0). You sample ε = (−1, 2). Compute xₜ using xₜ = √(ᾱₜ)x₀ + √(1 − ᾱₜ)ε.
Hint: Compute √(0.64) and √(0.36) first, then scale and add componentwise.
√(ᾱₜ) = √(0.64) = 0.8 and √(1 − ᾱₜ) = √(0.36) = 0.6.
xₜ = 0.8(3,0) + 0.6(−1,2)
= (2.4, 0) + (−0.6, 1.2)
= (1.8, 1.2).
Derive the \hat{x}₀ reconstruction formula from xₜ = √(ᾱₜ)x₀ + √(1 − ᾱₜ)ε when you replace ε by ε_θ(xₜ,t).
Hint: Rearrange the equation to isolate x₀; treat √(ᾱₜ) as a scalar.
Start from:
xₜ = √(ᾱₜ)x₀ + √(1 − ᾱₜ)ε
Subtract the noise term:
xₜ − √(1 − ᾱₜ)ε = √(ᾱₜ)x₀
Divide by √(ᾱₜ):
x₀ = ( xₜ − √(1 − ᾱₜ)ε ) / √(ᾱₜ)
Replace ε with the model prediction ε_θ(xₜ,t):
\hat{x}₀(xₜ,t) = ( xₜ − √(1 − ᾱₜ) ε_θ(xₜ,t) ) / √(ᾱₜ).
Show that if ᾱₜ → 0, then q(xₜ|x₀) approaches 𝒩(0, I) regardless of x₀. Use the mean and covariance of q(xₜ|x₀).
Hint: Look at the closed form q(xₜ|x₀) = 𝒩(√(ᾱₜ)x₀, (1 − ᾱₜ)I) and take limits.
We have:
q(xₜ|x₀) = 𝒩( μₜ, Σₜ )
with
μₜ = √(ᾱₜ)x₀
Σₜ = (1 − ᾱₜ) I
Take ᾱₜ → 0:
μₜ → √0 · x₀ = 0
Σₜ → (1 − 0) I = I
Therefore q(xₜ|x₀) → 𝒩(0, I), independent of x₀. This formalizes the idea that the forward process eventually destroys all information about the original data.