Generator vs discriminator training. Minimax game.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
GANs turn “learning to generate data” into a competitive game: one network forges samples, another network plays detective. The surprising part is that this duel—if balanced—pushes the forger toward the true data distribution without ever explicitly writing down a likelihood.
A Generative Adversarial Network (GAN) trains a generator G(z) to map latent noise z to synthetic samples, and a discriminator D(x) to classify real vs. generated. Training is a two-player zero-sum minimax game:
min_G max_D 𝔼_{x∼p_data}[log D(x)] + 𝔼_{z∼p(z)}[log(1 − D(G(z)))]
At the discriminator optimum, GAN training corresponds to minimizing a divergence (Jensen–Shannon) between the data distribution p_data and the model distribution p_g induced by G. In practice, stability depends on keeping G and D in balance, using good losses (often non-saturating), regularization (e.g., gradient penalty), architectural constraints, and careful optimization.
Many machine learning tasks are discriminative: given x, predict y. But in generation, we want to sample new x that look like they came from an unknown data distribution p_data(x) (images, audio, text embeddings, etc.). One classical approach is to define an explicit probabilistic model p_θ(x) and maximize likelihood. That can be hard when the data distribution is complex, multi-modal, and high-dimensional.
GANs offer a different route: instead of writing down p_θ(x) and computing likelihoods, you train a neural network to produce samples and another neural network to judge samples. The “judge” becomes a learned loss function that adapts to the generator’s current weaknesses.
A GAN contains two parametric functions:
Intuitively:
The classic GAN objective is:
V(D, G) = 𝔼_{x∼p_data}[log D(x)] + 𝔼_{z∼p(z)}[log(1 − D(G(z)))]
The game is:
max_D min_G V(D, G)
(or commonly written as min_G max_D V(D, G))
If training reaches an equilibrium:
In that ideal case:
Think of p_data as a complicated cloud in a high-dimensional space. G pushes forward a simple latent distribution p(z) through a neural network to produce a new distribution p_g. The discriminator learns a moving “boundary” between the two clouds. G then moves its cloud to cross that boundary.
This creates an important theme you will see repeatedly:
By the end of this lesson, you should be able to:
GAN training alternates between updating D and updating G. To understand what G is really optimizing, we first ask:
If G were fixed, what discriminator D is best?
This is the “inner loop” of the minimax problem. Solving it reveals the divergence GANs implicitly minimize.
Let p_g be the distribution of samples x = G(z) when z ∼ p(z). Then:
V(D, G) = 𝔼_{x∼p_data}[log D(x)] + 𝔼_{x∼p_g}[log(1 − D(x))]
Rewrite as:
V(D, G) = ∫ p_data(x) log D(x) dx + ∫ p_g(x) log(1 − D(x)) dx
Combine integrals:
V(D, G) = ∫ [ p_data(x) log D(x) + p_g(x) log(1 − D(x)) ] dx
For a fixed x, define:
f_x(D(x)) = p_data(x) log D(x) + p_g(x) log(1 − D(x))
Since x is fixed, p_data(x) and p_g(x) are constants. We maximize f_x over D(x) ∈ (0, 1).
Differentiate with respect to D(x):
∂f_x/∂D = p_data(x)/D(x) − p_g(x)/(1 − D(x))
Set derivative to 0:
p_data(x)/D(x) = p_g(x)/(1 − D(x))
Solve for D*(x):
p_data(x) (1 − D(x)) = p_g(x) D(x)
p_data(x) − p_data(x) D(x) = p_g(x) D(x)
p_data(x) = (p_data(x) + p_g(x)) D*(x)
D*(x) = p_data(x) / (p_data(x) + p_g(x))
This is the optimal discriminator for a fixed generator.
So D is estimating a density ratio.
Compute:
V(D*, G) = ∫ p_data(x) log( p_data(x)/(p_data(x)+p_g(x)) ) dx
Let m(x) = (1/2)(p_data(x) + p_g(x)) be the mixture distribution. Note that:
p_data(x)/(p_data(x)+p_g(x)) = p_data(x)/(2m(x))
p_g(x)/(p_data(x)+p_g(x)) = p_g(x)/(2m(x))
Then:
V(D*, G) = ∫ p_data(x) log( p_data(x)/(2m(x)) ) dx + ∫ p_g(x) log( p_g(x)/(2m(x)) ) dx
Split out log(1/2):
V(D*, G) = ∫ p_data(x) [log(p_data(x)/m(x)) + log(1/2)] dx
Use that ∫ p_data(x) dx = 1 and ∫ p_g(x) dx = 1:
V(D*, G) = (∫ p_data(x) log(p_data(x)/m(x)) dx) + (∫ p_g(x) log(p_g(x)/m(x)) dx) + 2 log(1/2)
Recognize KL divergences:
KL(p_data ‖ m) = ∫ p_data(x) log(p_data(x)/m(x)) dx
KL(p_g ‖ m) = ∫ p_g(x) log(p_g(x)/m(x)) dx
Thus:
V(D*, G) = KL(p_data ‖ m) + KL(p_g ‖ m) − 2 log 2
The Jensen–Shannon divergence is:
JSD(p_data ‖ p_g) = (1/2) KL(p_data ‖ m) + (1/2) KL(p_g ‖ m)
So:
V(D*, G) = 2·JSD(p_data ‖ p_g) − 2 log 2
When D is optimal, minimizing V(D*, G) with respect to G is equivalent to minimizing JSD(p_data ‖ p_g). The minimum is achieved when p_g = p_data.
The original minimax generator loss is:
L_G^minimax = 𝔼_{z∼p(z)}[log(1 − D(G(z)))]
If D becomes too good early, then D(G(z)) ≈ 0, and:
log(1 − D(G(z))) ≈ log(1) = 0
Its gradient can become very small (the generator “stalls”). A common alternative is the non-saturating generator loss:
L_G^NS = − 𝔼_{z∼p(z)}[log D(G(z))]
This has stronger gradients when D(G(z)) is small.
| Component | Classic (minimax) | Non-saturating (common in practice) |
|---|---|---|
| Discriminator objective | maximize 𝔼[log D(x)] + 𝔼[log(1−D(G(z)))] | same |
| Generator objective | minimize 𝔼[log(1−D(G(z)))] | minimize −𝔼[log D(G(z))] |
| Main benefit | clean theory | better gradients early |
| Main risk | saturation when D is strong | still unstable without regularization |
In supervised learning, you minimize a fixed loss. In GANs, the loss depends on D, which is being updated too. So optimization is not “rolling downhill” on a static surface; it’s closer to chasing a moving target in a game.
This can create:
Understanding these issues helps you choose objectives and regularizers.
G maps many latent vectors z to the same (or few) outputs:
G(z₁) ≈ G(z₂) ≈ …
So p_g covers only a subset of modes of p_data.
D provides gradients that only punish current mistakes. If G finds a small set of outputs that D currently misclassifies as real, G can “exploit” that weakness. If D then adapts, G may hop to another exploit, producing cycling behavior.
The JSD is well-behaved when distributions overlap, but in high dimensions, supports can be nearly disjoint early in training. Then D can perfectly separate real and fake, making gradients uninformative.
This motivates alternative distances/divergences with more useful gradients when supports don’t overlap much.
The Wasserstein-1 (Earth Mover) distance measures how much “mass” must move to turn one distribution into another. It can provide meaningful gradients even when supports are disjoint.
Formally:
W(p_data, p_g) = inf_{γ ∈ Π(p_data, p_g)} 𝔼_{(x,y)∼γ}[‖x − y‖]
This is hard to compute directly. WGAN uses the Kantorovich–Rubinstein duality:
W(p_data, p_g) = sup_{‖f‖_L ≤ 1} 𝔼_{x∼p_data}[f(x)] − 𝔼_{x∼p_g}[f(x)]
So instead of a discriminator that outputs probabilities, WGAN uses a critic f (often still called D) that outputs real numbers, constrained to be 1-Lipschitz.
max_f 𝔼_{x∼p_data}[f(x)] − 𝔼_{z∼p(z)}[f(G(z))]
min_G − 𝔼_{z∼p(z)}[f(G(z))]
Original WGAN used weight clipping (crude). A widely used improvement is WGAN-GP (gradient penalty):
L_D = −(𝔼_{x∼p_data}[f(x)] − 𝔼_{z}[f(G(z))]) + λ 𝔼_{\hat{x}}[(‖∇_{\hat{x}} f(\hat{x})‖ − 1)²]
where \hat{x} are points interpolated between real and generated samples.
This penalty encourages ‖∇f‖ ≈ 1, approximating the 1-Lipschitz constraint.
Even for non-WGAN GANs, regularization helps prevent D from becoming too sharp (leading to vanishing gradients).
Common tools:
| Tool | Where applied | Why it helps |
|---|---|---|
| Spectral normalization | Discriminator weights | Controls Lipschitz constant, stabilizes D |
| Gradient penalty (various forms) | Discriminator | Prevents overly confident / spiky decision boundaries |
| Label smoothing / noisy labels | Discriminator targets | Reduces overconfidence, improves gradients |
| Data augmentation (DiffAugment/ADA) | D input | Prevents D from memorizing, improves sample efficiency |
If you do one gradient step on both G and D, you can get rotational dynamics rather than convergence (common in games). Alternating updates approximate solving:
A typical loop:
In WGAN, k is often > 1 (e.g., 5 critic steps per generator step) early in training.
If D is too weak:
If D is too strong:
So “make D perfect” is not the goal; “make D a good teacher” is.
GANs are notoriously hard to evaluate, but you can still monitor:
A helpful habit: fix a set of latent vectors {zᵢ} and track G(zᵢ) over training. Mode collapse often shows up as many zᵢ converging to similar outputs.
GANs are most compelling when you need:
Typical applications:
1) Image synthesis
2) Image-to-image translation
3) Super-resolution and inpainting
4) Data augmentation
Often you want control: generate x conditioned on y (label, text embedding, another image).
A simple conditional objective feeds y into both G and D:
G(z, y) → x̃
D(x, y) → probability real
Then:
max_D 𝔼_{(x,y)∼p_data}[log D(x,y)] + 𝔼_{z,y}[log(1 − D(G(z,y), y))]
Conditional setups make the mapping easier because y reduces ambiguity (less multi-modality per condition).
Modern diffusion models often dominate unconditional high-fidelity image generation because they are easier to train and cover modes better (at the cost of slower sampling). Autoregressive models dominate discrete sequences (text) because likelihood-based training is stable.
So a practical selection table:
| Goal | GANs | Diffusion | Autoregressive |
|---|---|---|---|
| Fast sampling | excellent | slower (many steps) | slow (token-by-token) |
| Training stability | challenging | good | good |
| Mode coverage | can be poor (collapse) | strong | strong |
| Likelihood | implicit | often implicit/approx | explicit |
| Best for | images, translation, perceptual tasks | high-fidelity images/audio | text, discrete sequences |
GANs sit at the intersection of:
If you understand GANs deeply, you also understand a general pattern:
Learn a generator by training an adversary that provides a task-specific discrepancy signal.
That pattern reappears in domain adaptation, imitation learning (GAIL), and robust representation learning.
Assume the generator G induces a distribution p_g over x. Consider the classic GAN value function:
V(D, G) = 𝔼_{x∼p_data}[log D(x)] + 𝔼_{x∼p_g}[log(1 − D(x))].
We want the discriminator D that maximizes V for fixed G.
Rewrite expectations as integrals:
V(D,G) = ∫ p_data(x) log D(x) dx + ∫ p_g(x) log(1 − D(x)) dx
= ∫ [p_data(x) log D(x) + p_g(x) log(1 − D(x))] dx.
Observe that the integrand depends on D only through D(x) at each x, so we can maximize pointwise.
For fixed x define:
f_x(u) = p_data(x) log u + p_g(x) log(1 − u), where u = D(x).
Differentiate with respect to u:
∂f_x/∂u = p_data(x)/u − p_g(x)/(1 − u).
Set derivative to zero and solve:
p_data(x)/u = p_g(x)/(1 − u)
⇒ p_data(x)(1 − u) = p_g(x)u
⇒ p_data(x) = (p_data(x) + p_g(x))u
⇒ u = p_data(x)/(p_data(x) + p_g(x)).
Conclude:
D*(x) = p_data(x) / (p_data(x) + p_g(x)).
Insight: The discriminator is not “mysterious”: at optimum it estimates a density ratio. This is why GANs can be viewed as divergence minimization—D is a learned critic that compares p_data and p_g.
Using the optimal discriminator from the previous example, compute V(D*, G) and relate it to JSD(p_data ‖ p_g).
Start with D*(x) = p_data(x)/(p_data(x)+p_g(x)). Plug into V:
V(D*,G) = ∫ p_data(x) log( p_data(x)/(p_data(x)+p_g(x)) ) dx
Define the mixture distribution m(x) = (1/2)(p_data(x)+p_g(x)). Then:
p_data(x)/(p_data(x)+p_g(x)) = p_data(x)/(2m(x))
p_g(x)/(p_data(x)+p_g(x)) = p_g(x)/(2m(x)).
Rewrite V(D*,G):
V(D*,G) = ∫ p_data(x) log( p_data(x)/(2m(x)) ) dx + ∫ p_g(x) log( p_g(x)/(2m(x)) ) dx.
Split logs:
log(p_data/(2m)) = log(p_data/m) + log(1/2)
log(p_g/(2m)) = log(p_g/m) + log(1/2).
Use normalization of distributions:
∫ p_data(x) dx = 1, ∫ p_g(x) dx = 1.
So the constant terms contribute 2 log(1/2) = −2 log 2.
Recognize KL terms:
∫ p_data(x) log(p_data(x)/m(x)) dx = KL(p_data ‖ m)
∫ p_g(x) log(p_g(x)/m(x)) dx = KL(p_g ‖ m).
Therefore:
V(D*,G) = KL(p_data ‖ m) + KL(p_g ‖ m) − 2 log 2
= 2·JSD(p_data ‖ p_g) − 2 log 2.
Insight: This establishes the idealized story: if D is optimized, then improving G reduces a statistical divergence. The practical story is harder because D is never fully optimized and neural nets/finite data introduce instability.
Consider the original generator loss L_G^minimax = 𝔼_z[log(1 − D(G(z)))]. Suppose early in training the discriminator becomes very confident: D(G(z)) ≈ 0 for most z.
If D(G(z)) ≈ 0 then 1 − D(G(z)) ≈ 1.
Thus log(1 − D(G(z))) ≈ log(1) = 0, so the loss becomes near-constant for many samples.
A near-constant loss implies small gradients with respect to generator parameters θ_G because:
∇_{θ_G} log(1 − D(G(z)))
= (1/(1 − D(G(z)))) · (−∇_{θ_G} D(G(z))).
When D(G(z)) is extremely close to 0, D often lies in a saturated region of its sigmoid, making ∇ D(G(z)) small as well (depending on discriminator parametrization).
Compare to non-saturating loss:
L_G^NS = −𝔼_z[log D(G(z))].
If D(G(z)) ≈ 0, then log D(G(z)) is very negative, and the gradient signal is typically stronger because the loss strongly penalizes small D(G(z)).
Insight: This is one of the simplest reasons GAN training can stall: if D gets too good too fast, the minimax generator objective can provide weak learning signals. Many practical GAN recipes use the non-saturating loss and/or regularize D to remain a useful teacher.
A GAN defines a generator G(z) that induces p_g and a discriminator D(x) that estimates “realness”; training is a two-player zero-sum minimax game.
For fixed G, the optimal discriminator is D*(x) = p_data(x)/(p_data(x)+p_g(x)), which acts like a density-ratio estimator.
With D = D, the value function becomes V(D,G) = 2·JSD(p_data ‖ p_g) − 2 log 2, so the ideal equilibrium is p_g = p_data and D = 1/2.
The classic minimax generator loss can saturate when D becomes too strong; the non-saturating generator loss −𝔼[log D(G(z))] often yields better gradients.
GAN training is game optimization, not ordinary minimization; oscillations and instability are expected without constraints and regularization.
Mode collapse occurs when G finds a small set of outputs that exploit D; mitigations include stronger-but-regularized discriminators and alternative objectives.
Wasserstein GAN replaces probability discrimination with a 1-Lipschitz critic to obtain more informative gradients; gradient penalties or spectral normalization help enforce constraints.
Treating GAN losses like standard supervised losses and expecting monotonic convergence; in adversarial games, losses can oscillate while samples improve (or vice versa).
Letting the discriminator overfit or become too strong (near-perfect separation) without regularization, leading to vanishing/poor gradients for the generator.
Assuming that “D accuracy ≈ 50%” always means success; it can also mean both networks are weak or that D is confused due to poor training signals.
Ignoring mode collapse by evaluating only sample quality and not diversity (e.g., failing to test latent interpolations or nearest-neighbor comparisons).
Suppose p_data(x) = p_g(x) for all x. What is D(x)? What is V(D, G) in this case?
Hint: Use D(x) = p_data(x)/(p_data(x)+p_g(x)). Then plug into V(D,G) = 2·JSD(p_data ‖ p_g) − 2 log 2.
If p_data = p_g, then D*(x) = p_data(x)/(2p_data(x)) = 1/2 for all x.
Also JSD(p_data ‖ p_g) = 0, so:
V(D*,G) = 2·0 − 2 log 2 = −2 log 2.
Derive ∂/∂D(x) of the integrand p_data(x) log D(x) + p_g(x) log(1 − D(x)), and show it yields the optimal discriminator formula when set to zero.
Hint: Differentiate log D(x) and log(1−D(x)) carefully: d/dD log D = 1/D and d/dD log(1−D) = −1/(1−D).
Let u = D(x). The derivative is:
∂/∂u [p_data log u + p_g log(1−u)]
= p_data·(1/u) + p_g·(−1/(1−u))
= p_data/u − p_g/(1−u).
Set to 0:
p_data/u = p_g/(1−u)
⇒ p_data(1−u) = p_g u
⇒ p_data = (p_data+p_g)u
⇒ u = p_data/(p_data+p_g).
So D*(x) = p_data(x)/(p_data(x)+p_g(x)).
You observe the discriminator achieves ~99% accuracy quickly and the generator outputs barely change over time. Propose two interventions grounded in GAN theory/practice, and explain why each helps.
Hint: Think: gradient saturation, overfitting of D, imbalance. Consider non-saturating loss, regularization (spectral norm, gradient penalty), data augmentation, or changing update ratios.
Two plausible interventions:
1) Use the non-saturating generator loss L_G^NS = −𝔼_z[log D(G(z))] instead of the minimax loss 𝔼_z[log(1−D(G(z)))].
Reason: when D(G(z)) is near 0, log(1−D(G(z))) is near 0 and can provide weak gradients; −log D(G(z)) penalizes small D(G(z)) more strongly, typically producing larger, more useful gradients.
2) Regularize / constrain the discriminator so it remains a smooth teacher rather than an overconfident separator. Examples: spectral normalization on D, or a gradient penalty (e.g., WGAN-GP style), plus possibly data augmentation.
Reason: an overly sharp or overfit D can produce uninformative gradients for G (and can memorize training data). Regularization improves generalization and keeps gradients meaningful. Data augmentation reduces memorization and makes D learn more robust features.
Optionally, adjust the training balance (e.g., fewer D steps, lower D learning rate) so D does not outpace G.