Learning to learn. Few-shot learning, MAML.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
A normal ML pipeline learns one task at a time: it starts random (or pretrained), sees lots of data, and slowly becomes good. Meta-learning flips the question: can we train a system so that seeing just a few examples from a brand-new task is enough to adapt immediately?
Meta-learning (“learning to learn”) trains over a distribution of tasks. In gradient-based meta-learning (e.g., MAML), we learn meta-parameters θ (often an initialization) so that a small inner-loop update on a new task produces adapted parameters θ′ with low task loss. The outer loop optimizes θ by differentiating through the inner-loop adaptation across many tasks.
In standard supervised learning, we assume a single task: one input space, one label space, one dataset, and one loss. If you want to solve a new but related task, you often retrain or fine-tune. That works, but it’s wasteful and slow when:
Humans seem to have a prior over tasks: you can learn a new character in a foreign alphabet from a couple examples because you’ve learned how learning tends to work in that domain. Meta-learning aims to build that capability into ML systems.
Meta-learning is not defined by a specific model class (you can meta-learn neural nets, linear models, optimizers). It’s defined by the training setup:
A common formalization:
The meta-learner uses Dₛ(𝒯) to adapt quickly, then is evaluated on D_q(𝒯). Meta-training adjusts shared structure so that this procedure works well for tasks drawn from p(𝒯).
A useful way to think about it is as a two-level optimization:
In gradient-based meta-learning, the shared parameters are often an initialization θ. Given a new task, you start from θ and take one or a few gradient steps to obtain θ′ (task-adapted parameters).
Meta-learning is strongest when tasks are related but not identical.
Examples:
If tasks are unrelated, no method can reliably transfer. If tasks are identical, ordinary training already works.
| Term | Meaning | Typical symbol |
|---|---|---|
| Task distribution | How tasks are generated | p(𝒯) |
| Task | A specific problem instance with its own loss | 𝒯 |
| Support set | Few-shot data used for adaptation | Dₛ(𝒯) |
| Query set | Data used to evaluate meta-objective | D_q(𝒯) |
| Meta-parameters | Shared parameters across tasks | θ |
| Adapted parameters | Parameters after inner update for task 𝒯 | θ′ |
The rest of the lesson focuses on a canonical approach: MAML (Model-Agnostic Meta-Learning), which cleanly illustrates the inner/outer loop idea and the role of θ and θ′.
If the goal is: “perform well after seeing only a few examples from a new task,” then the training procedure should match that goal.
Episodic (a.k.a. meta-training) simulates test-time conditions repeatedly:
This forces the model to practice adapting under data scarcity.
A task 𝒯 typically specifies:
For supervised learning, a common choice is cross-entropy over D.
The meta-learning assumption is:
You can imagine a hidden variable z controlling each task, e.g., “which classes are chosen,” “which sinusoid parameters,” or “which environment layout.” Even if we don’t explicitly model z, meta-learning tries to learn parameters that work well across the induced distribution.
Within each task 𝒯, we split data into:
This is subtle but crucial:
This resembles cross-validation, but nested inside a task distribution.
We want θ such that:
So the outer objective is an expectation over tasks:
MetaLoss(θ) = 𝔼_{𝒯 ∼ p(𝒯)} [ ℒ_𝒯(θ′(𝒯); D_q(𝒯)) ]
But θ′(𝒯) is itself produced by a learning rule (inner loop), usually gradient descent.
A standard benchmark structure is N-way K-shot:
You run many episodes, each with different class subsets.
Normal training:
Meta-training (episodic):
A practical comparison:
| Aspect | Standard supervised learning | Meta-learning (episodic) |
|---|---|---|
| Unit of sampling | Example (x, y) | Task 𝒯 (support + query) |
| Objective | Low loss on dataset | Low loss after adaptation |
| Generalization target | New examples from same task | New tasks from p(𝒯) |
| Overfitting risk | Overfit dataset | Overfit task distribution |
Meta-learning is sometimes described as learning a model of learning: the algorithm itself is trained.
Concretely, you are no longer just learning a function f_θ(x) → y.
You are learning parameters θ such that the procedure
θ → (adapt using Dₛ(𝒯)) → θ′(𝒯) → predictions on D_q(𝒯)
works well across tasks.
This sets up the next mechanic: the fast inner loop and how θ′ is computed.
If you already know SGD and backprop, a natural idea is: “Can we meta-learn an initialization that fine-tunes quickly?”
This is the core of MAML:
For a task 𝒯 with support set Dₛ, define the support loss:
ℒₛ(θ) = ℒ_𝒯(θ; Dₛ(𝒯))
A single gradient step with step size α gives:
θ′ = θ − α ∇_θ ℒₛ(θ)
This is the fast adaptation step. With multiple inner steps, you iterate:
θ⁽⁰⁾ = θ
θ⁽i+1⁾ = θ⁽i⁾ − α ∇_{θ⁽i⁾} ℒ_𝒯(θ⁽i⁾; Dₛ)
and set θ′ = θ⁽m⁾.
Now evaluate on the query set:
ℒ_q(θ′) = ℒ_𝒯(θ′; D_q(𝒯))
The meta-objective across tasks is:
min_θ 𝔼_{𝒯 ∼ p(𝒯)} [ ℒ_𝒯(θ′(θ, 𝒯); D_q(𝒯)) ]
The key is that θ′ depends on θ. Therefore, when we compute the meta-gradient ∇_θ ℒ_q(θ′), we must differentiate through the inner update.
For one inner step:
θ′(θ) = θ − α ∇_θ ℒₛ(θ)
The meta-gradient is:
∇_θ ℒ_q(θ′(θ))
Using the chain rule:
∇_θ ℒ_q(θ′) = (∂θ′/∂θ)ᵀ ∇_{θ′} ℒ_q(θ′)
Compute ∂θ′/∂θ:
θ′ = θ − α ∇_θ ℒₛ(θ)
Differentiate w.r.t. θ:
∂θ′/∂θ = I − α ∂(∇_θ ℒₛ(θ))/∂θ
But ∂(∇_θ ℒₛ)/∂θ is the Hessian:
∂(∇_θ ℒₛ(θ))/∂θ = ∇²_θ ℒₛ(θ)
So:
∂θ′/∂θ = I − α ∇²_θ ℒₛ(θ)
Therefore:
∇_θ ℒ_q(θ′)
= (I − α ∇²_θ ℒₛ(θ))ᵀ ∇_{θ′} ℒ_q(θ′)
If the Hessian is symmetric (common), transpose doesn’t change it:
∇_θ ℒ_q(θ′)
= (I − α ∇²_θ ℒₛ(θ)) ∇_{θ′} ℒ_q(θ′)
This is why MAML is considered “second-order”: it involves Hessian-vector products.
Computing the exact meta-gradient can be expensive. A common approximation is to ignore the Hessian term:
∂θ′/∂θ ≈ I
Then:
∇_θ ℒ_q(θ′) ≈ ∇_{θ′} ℒ_q(θ′)
This is FOMAML. It often works surprisingly well, trading some accuracy for speed and memory.
Another popular first-order method is Reptile, which repeatedly:
Update:
θ ← θ + ε(θ′ − θ)
Reptile can be derived as optimizing a meta-objective that encourages within-task generalization. It’s simpler (no second-order terms) and sometimes competitive.
It’s tempting to say: “MAML learns an initialization.” That’s true, but incomplete.
MAML learns θ such that:
In geometric terms, θ is placed in parameter space near many task-specific optima, in a way that a small step can reach each optimum.
The adaptation rule includes choices:
These strongly affect performance. Sometimes α is itself meta-learned.
For each meta-iteration:
Here β is the outer learning rate.
At meta-test time:
This completes the central mechanism: θ is trained so that θ → θ′ quickly yields good performance.
In few-shot classification, the model must build a classifier for novel classes from very few labeled examples.
Two broad families of approaches:
| Family | Core idea | Examples |
|---|---|---|
| Metric-based | Learn an embedding where nearest-neighbor works | Prototypical Networks, Matching Networks |
| Optimization-based | Learn parameters/initialization to optimize quickly | MAML, FOMAML, Reptile |
MAML’s advantage is flexibility: it can adapt the whole network, not just a linear head. Its disadvantage is computational cost.
Because MAML only assumes differentiability, it applies to:
In RL, tasks might be different environments. The inner loop becomes one or a few policy-gradient updates; the query loss is expected return after adaptation.
Meta-learning can overfit in two ways:
Meta-overfitting is especially likely if:
Practical mitigations:
Exact MAML requires differentiating through inner-loop computation graphs.
Costs:
Common workarounds:
Meta-learning is not magic. It leverages task similarity. You should expect:
It helps to make θ and θ′ concrete:
The quality of meta-learning is measured by how good θ′ becomes given a tiny Dₛ.
If you imagine each task has its own optimal parameters θ*(𝒯), then MAML tries to find θ such that:
This is why, during meta-training, you must repeatedly practice: adapt on support, evaluate on query.
| Criterion | Metric-based (Prototypical) | MAML |
|---|---|---|
| Speed at meta-test | Very fast | Requires inner optimization |
| Flexibility | Often assumes class structure | Works for many losses/settings |
| Implementation complexity | Moderate | High (unrolling, stability) |
| Best when | Embedding is sufficient | Task requires deeper adaptation |
If your tasks differ mainly in labels/classes but share representation, metric methods are strong. If tasks require changing internal features or dynamics, optimization-based meta-learning can shine.
This closes the loop: meta-learning is a training paradigm over tasks, with a fast inner adaptation and a meta-objective optimizing θ so that θ′ performs well after few-shot updates.
Consider a simple 1D parameter θ ∈ ℝ. For a sampled task 𝒯, define the support loss ℒₛ(θ) and query loss ℒ_q(θ). We do one inner step: θ′ = θ − α dℒₛ/dθ. We want dℒ_q(θ′)/dθ.
Inner update (one step):
θ′(θ) = θ − α (dℒₛ(θ)/dθ)
Differentiate θ′ w.r.t. θ:
dθ′/dθ = 1 − α d/dθ (dℒₛ/dθ)
= 1 − α (d²ℒₛ/dθ²)
Apply chain rule to the meta-objective:
d/dθ ℒ_q(θ′(θ)) = (dℒ_q/dθ′) · (dθ′/dθ)
Substitute the expression for dθ′/dθ:
dℒ_q(θ′)/dθ = (dℒ_q/dθ′) · (1 − α d²ℒₛ/dθ²)
Insight: Even in 1D, the meta-gradient includes a curvature term from the support loss. MAML is optimizing θ not just for low loss, but for producing useful gradients from few examples.
Suppose tasks are 1D linear regression with parameter a (task-specific slope). For each task 𝒯, data is y = a x with small noise. The model is f_θ(x) = θ x. Support set has a few (x, y) pairs. Show how one gradient step moves θ toward a.
Define mean-squared error on support set Dₛ:
ℒₛ(θ) = (1/|Dₛ|) ∑(θ xᵢ − yᵢ)²
Use yᵢ = a xᵢ (ignore noise for clarity):
θ xᵢ − yᵢ = θ xᵢ − a xᵢ = (θ − a) xᵢ
Rewrite the loss:
ℒₛ(θ) = (1/|Dₛ|) ∑ ((θ − a) xᵢ)²
= (θ − a)² · (1/|Dₛ|) ∑ xᵢ²
Compute the gradient:
dℒₛ/dθ = 2(θ − a) · (1/|Dₛ|) ∑ xᵢ²
One inner gradient step:
θ′ = θ − α · 2(θ − a) · (1/|Dₛ|) ∑ xᵢ²
Factor the update:
θ′ = θ − c(θ − a) where c = 2α(1/|Dₛ|) ∑ xᵢ²
So:
θ′ = (1 − c) θ + c a
Insight: For this family, one step is a convex combination of θ and the task slope a (if 0 < c < 1). Meta-learning θ amounts to choosing an initialization that is close (on average) to task-specific optima so that one step lands near a.
For one inner step: θ′ = θ − α ∇_θ ℒₛ(θ). Full MAML uses ∇_θ ℒ_q(θ′(θ)). FOMAML approximates this gradient. Write both explicitly to see the difference.
Full MAML meta-gradient:
∇_θ ℒ_q(θ′) = (∂θ′/∂θ)ᵀ ∇_{θ′} ℒ_q(θ′)
Compute the Jacobian:
∂θ′/∂θ = I − α ∇²_θ ℒₛ(θ)
So full MAML is:
∇_θ ℒ_q(θ′) = (I − α ∇²_θ ℒₛ(θ))ᵀ ∇_{θ′} ℒ_q(θ′)
FOMAML approximation sets:
∂θ′/∂θ ≈ I
Therefore:
∇_θ ℒ_q(θ′) ≈ ∇_{θ′} ℒ_q(θ′)
Insight: FOMAML treats θ′ as if it were independent of θ when computing the gradient. You keep the benefit of adapting in the inner loop, but you ignore how changing θ changes the adaptation trajectory.
Meta-learning trains over a distribution of tasks p(𝒯), not a single dataset.
Each task splits into support Dₛ (for adaptation) and query D_q (for meta-objective), rewarding generalization after adaptation.
In MAML, θ are meta-parameters (often an initialization), and θ′ are task-adapted parameters after inner-loop updates.
The outer objective minimizes expected query loss after adaptation: 𝔼_{𝒯}[ℒ_𝒯(θ′; D_q)].
Full MAML differentiates through the inner update, introducing second-order terms involving ∇²_θ ℒₛ.
FOMAML and Reptile are popular first-order alternatives that reduce computation and memory cost.
Meta-learning can meta-overfit: you must validate on held-out tasks and control inner-loop capacity and step sizes.
Meta-learning is most effective when tasks share structure such that fast adaptation from a shared θ is possible.
Using the same data for adaptation and meta-evaluation (no support/query split), which rewards memorization rather than adaptation.
Assuming meta-learning will help when tasks are unrelated; without shared structure in p(𝒯), transfer cannot work.
Treating α (inner learning rate) and the number of inner steps as minor details—these can make MAML unstable or ineffective.
Reporting only meta-training performance; the real test is performance on unseen meta-test tasks after adaptation.
You have tasks 𝒯 ∼ p(𝒯). For each task you compute θ′ = θ − α ∇_θ ℒ_𝒯(θ; Dₛ). Write the meta-objective using a query set D_q and describe (in one sentence) what it encourages.
Hint: Use an expectation over tasks and evaluate loss at θ′ on D_q.
MetaLoss(θ) = 𝔼_{𝒯 ∼ p(𝒯)} [ ℒ_𝒯(θ′; D_q(𝒯)) ], where θ′ = θ − α ∇_θ ℒ_𝒯(θ; Dₛ(𝒯)). It encourages choosing θ so that a small gradient-based adaptation using Dₛ yields parameters that generalize well to new data D_q from the same task.
Derive ∂θ′/∂θ for one inner step θ′ = θ − α ∇_θ ℒₛ(θ), and identify where the Hessian appears.
Hint: Differentiate both sides w.r.t. θ; derivative of a gradient is a Hessian.
Differentiate: ∂θ′/∂θ = ∂/∂θ [θ − α ∇_θ ℒₛ(θ)] = I − α ∂(∇_θ ℒₛ(θ))/∂θ = I − α ∇²_θ ℒₛ(θ). The Hessian appears as the Jacobian of the gradient.
In the linear regression family y = a x (task slope a), suppose one inner step yields θ′ = (1 − c)θ + c a for some 0 < c < 1 (as derived in the lesson). If tasks have slopes a with mean 𝔼[a] = μ, what initialization θ minimizes expected squared error 𝔼[(θ − a)²] before adaptation? What does that suggest about a reasonable meta-initialization when only one small step is allowed?
Hint: Minimizing 𝔼[(θ − a)²] over θ gives θ = 𝔼[a].
We minimize J(θ) = 𝔼[(θ − a)²]. Differentiate: dJ/dθ = 𝔼[2(θ − a)] = 2(θ − 𝔼[a]). Setting to 0 gives θ* = 𝔼[a] = μ. This suggests that when only a limited adaptation is possible, a good meta-initialization is near the average task optimum; the inner step then nudges θ toward each specific a.
Related nodes: