Generative models with latent variables. ELBO, reparameterization.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Variational Autoencoders (VAEs) are the bridge between probabilistic latent-variable modeling (Bayes, priors, posteriors) and deep learning (powerful function approximation). They give you a principled way to learn both a generator and an inference procedure—by optimizing a single tractable objective: the ELBO.
A VAE posits a latent variable z that generates data x via a decoder p_θ(x|z) and a prior p(z). Because the true posterior p_θ(z|x) is intractable, we approximate it with an encoder q_φ(z|x). Training maximizes the ELBO: 𝔼_{q_φ(z|x)}[log p_θ(x|z)] − KL(q_φ(z|x) ‖ p(z)). The reparameterization trick (z = μ_φ(x) + σ_φ(x) ⊙ ε, ε ∼ 𝒩(0, I)) allows backpropagation through sampling.
In many problems we want a model that can generate realistic data and also explain data in terms of hidden factors. Think:
A standard (deterministic) autoencoder learns an encoder f(x) → z and decoder g(z) → x̂, but it does not define a probability distribution over data. You can reconstruct, but “sampling” z and decoding often produces arbitrary garbage because the latent space has no probabilistic structure.
A VAE fixes this by making the model explicitly probabilistic. It’s an instance of a latent-variable generative model:
This defines a joint distribution:
p_θ(x, z) = p_θ(x|z) p(z)
If we can learn θ well, then we can generate new data by sampling z ∼ p(z) and then x ∼ p_θ(x|z).
Given an observed x, the Bayesian posterior over latents is
p_θ(z|x) = p_θ(x|z) p(z) / p_θ(x)
where the marginal likelihood (evidence) is
p_θ(x) = ∫ p_θ(x|z) p(z) dz
In deep models, that integral is typically intractable.
But to learn the model, we’d like to maximize log p_θ(x) over θ for the dataset. And to do inference (encode x), we want p_θ(z|x). Both are blocked by the same intractable evidence integral.
Introduce a tractable approximation q_φ(z|x) (the variational posterior / encoder) and optimize a lower bound on log p_θ(x) that is differentiable and scalable.
The VAE has two neural networks:
Unlike a deterministic autoencoder, the encoder outputs a distribution (often Gaussian) and the decoder defines a likelihood (often Gaussian for real-valued data, Bernoulli for binary pixels, categorical for discrete tokens, etc.).
A common (and very useful) baseline is:
This is not required, but it’s a common starting point because (1) sampling is easy, (2) KL terms often have closed form, and (3) reparameterization is straightforward.
Think of training a VAE as simultaneously:
The quantity we would like to maximize for each datapoint x is log p_θ(x). But:
log p_θ(x) = log ∫ p_θ(x|z) p(z) dz
The log of an integral of a neural-network-defined density is generally not tractable.
Variational inference gives a workaround: introduce a distribution q_φ(z|x) that we can sample from and evaluate.
Start with the log evidence and multiply inside by q_φ(z|x) / q_φ(z|x):
log p_θ(x)
= log ∫ p_θ(x, z) dz
= log ∫ q_φ(z|x) · [p_θ(x, z) / q_φ(z|x)] dz
Now apply Jensen’s inequality to log 𝔼[·] (log is concave):
log ∫ q_φ(z|x) · [p_θ(x, z) / q_φ(z|x)] dz
= log 𝔼_{q_φ(z|x)} [ p_θ(x, z) / q_φ(z|x) ]
≥ 𝔼_{q_φ(z|x)} [ log p_θ(x, z) − log q_φ(z|x) ]
Define the ELBO:
ELBO(θ, φ; x) = 𝔼_{q_φ(z|x)} [ log p_θ(x, z) − log q_φ(z|x) ]
So we have the bound:
log p_θ(x) ≥ ELBO(θ, φ; x)
Use p_θ(x, z) = p_θ(x|z) p(z):
ELBO
= 𝔼_{q_φ(z|x)}[ log p_θ(x|z) + log p(z) − log q_φ(z|x) ]
Group the last two terms as a KL divergence:
KL(q_φ(z|x) ‖ p(z))
= 𝔼_{q_φ(z|x)}[ log q_φ(z|x) − log p(z) ]
So:
ELBO
= 𝔼_{q_φ(z|x)}[ log p_θ(x|z) ] − KL(q_φ(z|x) ‖ p(z))
This is the form you implement.
𝔼_{q_φ(z|x)}[ log p_θ(x|z) ]
KL(q_φ(z|x) ‖ p(z))
A key identity connects ELBO and the true posterior:
log p_θ(x) = ELBO(θ, φ; x) + KL(q_φ(z|x) ‖ p_θ(z|x))
Derivation sketch (showing the work):
KL(q ‖ p_θ(z|x))
= 𝔼_q[ log q(z|x) − log p_θ(z|x) ]
= 𝔼_q[ log q(z|x) − log (p_θ(x, z) / p_θ(x)) ]
= 𝔼_q[ log q(z|x) − log p_θ(x, z) + log p_θ(x) ]
= log p_θ(x) − 𝔼_q[ log p_θ(x, z) − log q(z|x) ]
= log p_θ(x) − ELBO
Rearrange:
log p_θ(x) = ELBO + KL(q ‖ p_θ(z|x))
Because KL ≥ 0, ELBO is a lower bound. It becomes tight when q_φ(z|x) matches the true posterior.
For a dataset {xᵢ}ᵢ₌₁ᴺ, maximize:
∑ᵢ ELBO(θ, φ; xᵢ)
This trains:
We need gradients of
𝔼_{q_φ(z|x)}[ log p_θ(x|z) ]
with respect to both θ and φ, plus gradients of the KL term.
That’s exactly why the reparameterization trick matters.
Suppose we approximate the expectation with Monte Carlo:
𝔼_{q_φ(z|x)}[ f(z) ] ≈ (1/L) ∑_{ℓ=1}^L f(z^{(ℓ)}), where z^{(ℓ)} ∼ q_φ(z|x)
If z^{(ℓ)} is produced by a sampling step that depends on φ, naive backprop gets stuck: the computational graph has a stochastic node.
One option is the score-function (REINFORCE) estimator:
∇_φ 𝔼_{q_φ}[f(z)] = 𝔼_{q_φ}[ f(z) ∇_φ log q_φ(z) ]
It’s unbiased but typically high-variance.
Reparameterization gives a lower-variance, pathwise gradient by rewriting the random variable as a deterministic function of φ and external noise.
If you can write
z = g_φ(ε, x), ε ∼ p(ε)
where p(ε) does not depend on φ, then
𝔼_{q_φ(z|x)}[ f(z) ] = 𝔼_{p(ε)}[ f(g_φ(ε, x)) ]
Now the randomness is in ε, not in the parameters. Gradients can flow through g_φ.
Let
q_φ(z|x) = 𝒩(μ_φ(x), diag(σ²_φ(x)))
Sample ε ∼ 𝒩(0, I) and define:
z = μ_φ(x) + σ_φ(x) ⊙ ε
Here ⊙ is elementwise multiplication.
This produces z distributed exactly as q_φ(z|x). And μ_φ, σ_φ are outputs of a neural net.
Consider the reconstruction term for a single x:
ℒ_rec(θ, φ; x) = 𝔼_{q_φ(z|x)}[ log p_θ(x|z) ]
Using reparameterization:
ℒ_rec = 𝔼_{ε∼𝒩(0,I)}[ log p_θ(x| μ_φ(x) + σ_φ(x) ⊙ ε ) ]
Approximate with L samples:
ℒ_rec ≈ (1/L) ∑_{ℓ=1}^L log p_θ(x| μ_φ(x) + σ_φ(x) ⊙ ε^{(ℓ)})
Now ∇_φ is just ordinary backprop through μ_φ and σ_φ.
With a standard normal prior p(z) = 𝒩(0, I) and diagonal Gaussian q_φ, the KL has a closed form.
Let q = 𝒩(μ, diag(σ²)) and p = 𝒩(0, I). Then:
KL(q ‖ p) = (1/2) ∑ⱼ ( μⱼ² + σⱼ² − log σⱼ² − 1 )
This is extremely useful: it’s exact, differentiable, and cheap.
Each latent dimension j pays a penalty:
So the encoder is encouraged to produce a distribution “not too far” from 𝒩(0,1).
To keep σ positive, we usually output log σ² (call it s) and compute:
σ² = exp(s), σ = exp(0.5 s)
This avoids invalid (negative) variances and tends to be numerically stable.
| Piece | Object | Typical choice | Role | |
|---|---|---|---|---|
| Prior | p(z) | 𝒩(0, I) | Defines “sampling space” | |
| Encoder | q_φ(z | x) | 𝒩(μ_φ(x), diag(σ²_φ(x))) | Approx posterior |
| Decoder | p_θ(x | z) | Bernoulli or Gaussian | Likelihood model |
| Objective | ELBO | 𝔼_q[log p_θ(x | z)] − KL(q ‖ p) | Train θ, φ |
| Trick | z = μ + σ ⊙ ε | ε ∼ 𝒩(0,I) | Low-variance gradients |
Reparameterization is straightforward for continuous distributions like Gaussians. For discrete latents, you need alternatives (Gumbel-Softmax / Concrete distributions, score-function estimators, or other variational relaxations). Many VAE lessons stop at Gaussians because they cover the most common and useful case.
After training, generation is simple:
Because the KL term kept q_φ(z|x) near p(z), the decoder has been trained on z values that look like prior samples.
The latent z can be used as a learned feature representation. Common uses:
Caution: VAEs trade off reconstruction fidelity vs. latent regularity. If the KL term dominates, representations can become less informative.
If you want generation conditioned on labels or attributes y:
This lets you generate samples with a chosen condition.
A common heuristic: points with low ELBO (or high reconstruction error) are considered anomalous. This works best when the model class fits normal data well.
In powerful decoders (e.g., autoregressive text decoders), the model may learn to ignore z entirely:
Then z carries little information about x.
| Technique | What it changes | Why it helps |
|---|---|---|
| KL annealing | Slowly increase KL weight from 0 → 1 | Gives encoder time to learn informative latents |
| β-VAE | Use β · KL with β ≠ 1 | β < 1 encourages information; β > 1 encourages disentanglement |
| Free bits | KL term has a per-dim minimum | Prevents KL from collapsing too aggressively |
| Weaker decoder | Reduce decoder capacity | Forces use of z |
Objective:
ELBO_β = 𝔼_q[log p_θ(x|z)] − β KL(q ‖ p)
ELBO uses one sample (or few) and is a lower bound. IWAE tightens the bound using multiple samples:
log p_θ(x) ≥ 𝔼_{z₁:K ∼ q}[ log ( (1/K) ∑ₖ p_θ(x, zₖ) / q(zₖ|x) ) ]
This can improve generative modeling but changes optimization dynamics.
Diffusion models also involve:
Conceptually, VAEs train a generator with latent variables via a variational bound; diffusion trains a generator via denoising/score objectives. Understanding:
makes diffusion objectives feel much less mysterious.
Given a latent-variable model p_θ(x, z) = p_θ(x|z) p(z) and an approximate posterior q_φ(z|x), show that ELBO = 𝔼_q[log p_θ(x|z)] − KL(q_φ(z|x) ‖ p(z)).
Start from the ELBO definition:
ELBO = 𝔼_{q_φ(z|x)}[ log p_θ(x, z) − log q_φ(z|x) ].
Substitute the joint factorization:
log p_θ(x, z) = log p_θ(x|z) + log p(z).
Plug in and separate expectations:
ELBO = 𝔼_q[ log p_θ(x|z) + log p(z) − log q(z|x) ]
= 𝔼_q[ log p_θ(x|z) ] + 𝔼_q[ log p(z) − log q(z|x) ].
Recognize the KL divergence:
KL(q(z|x) ‖ p(z)) = 𝔼_q[ log q(z|x) − log p(z) ].
Therefore:
𝔼_q[ log p(z) − log q(z|x) ] = − KL(q(z|x) ‖ p(z)).
Conclude:
ELBO = 𝔼_{q_φ(z|x)}[ log p_θ(x|z) ] − KL(q_φ(z|x) ‖ p(z)).
Insight: The ELBO cleanly separates “fit the data” (expected log-likelihood) from “keep latents well-behaved for sampling” (KL to the prior). This is the central tradeoff in VAEs.
Let q(z) = 𝒩(μ, diag(σ²)) and p(z) = 𝒩(0, I). Derive KL(q ‖ p) = (1/2) ∑ⱼ ( μⱼ² + σⱼ² − log σⱼ² − 1 ).
Write log densities (up to constants) for d dimensions.
For p:
log p(z) = −(1/2) ∑ⱼ ( zⱼ² + log 2π ).
For q:
log q(z) = −(1/2) ∑ⱼ ( (zⱼ−μⱼ)²/σⱼ² + log σⱼ² + log 2π ).
Start from KL definition:
KL(q ‖ p) = 𝔼_q[ log q(z) − log p(z) ].
The log 2π constants cancel.
Compute the difference inside the expectation:
log q − log p
= −(1/2) ∑ⱼ [ (zⱼ−μⱼ)²/σⱼ² + log σⱼ² − zⱼ² ].
Take expectation under q. Use facts:
If zⱼ ∼ 𝒩(μⱼ, σⱼ²), then
𝔼[(zⱼ−μⱼ)²] = σⱼ²,
𝔼[zⱼ²] = Var(zⱼ) + (𝔼[zⱼ])² = σⱼ² + μⱼ².
Substitute expectations:
𝔼_q[(zⱼ−μⱼ)²/σⱼ²] = σⱼ²/σⱼ² = 1,
𝔼_q[zⱼ²] = σⱼ² + μⱼ².
So for each j:
𝔼_q[ (zⱼ−μⱼ)²/σⱼ² + log σⱼ² − zⱼ² ]
= 1 + log σⱼ² − (σⱼ² + μⱼ²).
Therefore:
KL(q ‖ p)
= −(1/2) ∑ⱼ [ 1 + log σⱼ² − σⱼ² − μⱼ² ]
= (1/2) ∑ⱼ [ μⱼ² + σⱼ² − log σⱼ² − 1 ].
Insight: This closed-form KL is why the Gaussian VAE is so popular: you get exact regularization without needing Monte Carlo estimates, and gradients are stable.
Let q_φ(z|x) = 𝒩(μ_φ(x), diag(σ²_φ(x))). Show how to rewrite an expectation 𝔼_{q_φ}[f(z)] so ∇_φ can be computed by backprop.
Define external noise ε ∼ 𝒩(0, I), independent of φ.
Construct a deterministic transform:
z = g_φ(ε, x) = μ_φ(x) + σ_φ(x) ⊙ ε.
Rewrite the expectation:
𝔼_{q_φ(z|x)}[ f(z) ] = 𝔼_{ε∼𝒩(0,I)}[ f( μ_φ(x) + σ_φ(x) ⊙ ε ) ].
Approximate with a Monte Carlo sample ε¹:
𝔼 ≈ f( μ_φ(x) + σ_φ(x) ⊙ ε¹ ).
Differentiate:
∇_φ f( μ_φ(x) + σ_φ(x) ⊙ ε¹ )
flows through μ_φ and σ_φ via the chain rule because ε¹ is treated as a constant during backprop.
Insight: Reparameterization moves the randomness to an input node (ε). Once you do that, the sampled z is just another differentiable layer in the network.
A VAE is a probabilistic latent-variable model: p_θ(x, z) = p_θ(x|z) p(z), enabling true sampling/generation.
Because p_θ(z|x) is usually intractable, we introduce an encoder q_φ(z|x) to approximate the posterior.
The ELBO is the tractable training objective: ELBO = 𝔼_{q_φ}[log p_θ(x|z)] − KL(q_φ(z|x) ‖ p(z)).
ELBO is a lower bound on log p_θ(x) and becomes tight when q_φ(z|x) = p_θ(z|x).
The reparameterization trick (Gaussian case: z = μ + σ ⊙ ε, ε ∼ 𝒩(0,I)) enables low-variance gradients through sampling.
With Gaussian prior and diagonal Gaussian encoder, KL has a closed form: (1/2)∑ⱼ(μⱼ² + σⱼ² − log σⱼ² − 1).
VAEs can fail via posterior collapse (KL→0, latents ignored), especially with very strong decoders; KL annealing, β-VAE, and capacity control help.
Understanding latent-variable objectives, KL structure, and reparameterization provides conceptual groundwork for later generative models, including diffusion.
Treating the VAE decoder output as a deterministic reconstruction x̂ without defining a likelihood p_θ(x|z); you need a distribution to make the ELBO meaningful.
Forgetting that the reconstruction term is an expectation over z ∼ q_φ(z|x); using only μ_φ(x) can work as a heuristic but changes the objective.
Implementing σ directly (which can go negative) instead of parameterizing log σ² and exponentiating; this often causes numerical instability.
Assuming the KL term is just a generic regularizer; it specifically matches q_φ(z|x) to the chosen prior p(z), which determines sampling behavior.
Show that log p_θ(x) = ELBO(θ, φ; x) + KL(q_φ(z|x) ‖ p_θ(z|x)).
Hint: Start from KL(q ‖ p_θ(z|x)) and substitute p_θ(z|x) = p_θ(x, z) / p_θ(x). Rearrange to isolate log p_θ(x).
Let q = q_φ(z|x).
KL(q ‖ p_θ(z|x))
= 𝔼_q[ log q(z|x) − log p_θ(z|x) ]
= 𝔼_q[ log q(z|x) − log (p_θ(x, z) / p_θ(x)) ]
= 𝔼_q[ log q(z|x) − log p_θ(x, z) + log p_θ(x) ]
= log p_θ(x) − 𝔼_q[ log p_θ(x, z) − log q(z|x) ]
= log p_θ(x) − ELBO.
Therefore log p_θ(x) = ELBO + KL(q_φ(z|x) ‖ p_θ(z|x)).
Assume p(z) = 𝒩(0, I) and q_φ(z|x) = 𝒩(μ, diag(σ²)) for a single datapoint. If μ = (2, 0) and σ² = (0.25, 4), compute KL(q ‖ p).
Hint: Use KL = (1/2)∑ⱼ(μⱼ² + σⱼ² − log σⱼ² − 1). Be careful: the formula uses σⱼ², not σⱼ.
Compute per dimension.
j=1: μ₁² = 4, σ₁² = 0.25, log σ₁² = log 0.25 = −1.386294...
Term₁ = 4 + 0.25 − (−1.386294) − 1 = 4.636294...
j=2: μ₂² = 0, σ₂² = 4, log σ₂² = log 4 = 1.386294...
Term₂ = 0 + 4 − 1.386294 − 1 = 1.613706...
Sum = 6.25
KL = (1/2) · 6.25 = 3.125.
A VAE uses Gaussian likelihood p_θ(x|z) = 𝒩(μ_θ(z), σ_x² I) with fixed σ_x². Show that maximizing 𝔼_q[log p_θ(x|z)] is equivalent (up to a constant) to minimizing 𝔼_q[‖x − μ_θ(z)‖²].
Hint: Write out the log density of a Gaussian with fixed variance and drop terms that do not depend on θ.
For d-dimensional x,
log p_θ(x|z) = −(d/2) log(2πσ_x²) − (1/(2σ_x²)) ‖x − μ_θ(z)‖².
Take expectation over q_φ(z|x):
𝔼_q[log p_θ(x|z)]
= −(d/2) log(2πσ_x²) − (1/(2σ_x²)) 𝔼_q[ ‖x − μ_θ(z)‖² ].
The first term is constant w.r.t. θ. Therefore maximizing 𝔼_q[log p_θ(x|z)] is equivalent to minimizing 𝔼_q[ ‖x − μ_θ(z)‖² ]. (The scaling 1/(2σ_x²) does not change the optimizer when σ_x² is fixed.)