Global optimization with probabilistic surrogate models. Acquisition functions.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
When evaluating f(x) is expensive (a lab experiment, a simulation, training a model), you can’t afford thousands of trials. Bayesian optimization turns each evaluation into maximal information: it builds a probabilistic surrogate for f, uses uncertainty to decide where to sample next, and repeats—often finding near-optimal solutions in tens of evaluations rather than thousands.
Bayesian optimization (BO) is a global, sample-efficient strategy for optimizing an unknown, expensive black-box function f(x). It maintains a probabilistic surrogate model that provides a predictive mean μ(x) and predictive variance σ²(x), then chooses the next query by maximizing an acquisition function that balances exploitation (low/high μ) and exploration (high σ). The loop is: fit surrogate → optimize acquisition → evaluate f at selected x → update surrogate.
Many real optimization problems don’t match the assumptions that make classical optimization fast:
If each evaluation is precious, the core question becomes:
Which x should I evaluate next to learn the most about where the optimum might be?
Bayesian optimization answers this using probability: it represents uncertainty about f and uses that uncertainty to guide where to sample.
Let f(x) be an unknown objective you want to minimize or maximize over a domain 𝒳 ⊂ ℝᵈ.
Bayesian optimization is a sequential decision procedure that:
xₙ₊₁ ∈ argmaxₓ∈𝒳 a(x)
The surrogate gives, at any x:
This pair (μ, σ²) is the “sufficient interface” most BO algorithms need.
If you only chase the best current μ(x), you might get stuck exploiting a wrong region.
If you only chase high σ(x), you waste evaluations in uncertain but unpromising regions.
Acquisition functions formalize a trade-off:
It’s not “Bayesian” in the sense that your final answer must be a full posterior over the optimum; often we just want a good x after few evaluations.
It’s not restricted to Gaussian processes (GPs), though GPs are the canonical choice.
It’s not guaranteed to find the global optimum in finite time—but it is often extremely effective when evaluations are expensive.
Assume we have data 𝒟ₙ = {(xᵢ, yᵢ)}ᵢ₌₁ⁿ.
A surrogate produces a predictive distribution for f(x) given 𝒟ₙ.
We summarize that as:
f(x) | 𝒟ₙ ≈ 𝒩( μₙ(x), σ²ₙ(x) )
Then a common pattern is:
aₙ(x) = Acquisition( μₙ(x), σₙ(x) )
and choose xₙ₊₁ maximizing aₙ.
This single loop—model → acquisition → evaluate → update—is the heart of Bayesian optimization.
If evaluations are expensive, you want to extract maximum value from each one. A surrogate is a cheaper function you can evaluate anywhere, with uncertainty that reflects limited data.
A good surrogate should:
That uncertainty is what enables principled exploration.
A GP is a distribution over functions. Informally:
f(·) ∼ GP(m(·), k(·, ·))
If observations are noisy:
yᵢ = f(xᵢ) + εᵢ, εᵢ ∼ 𝒩(0, σ²_noise)
Given data, the GP posterior at a test point x is Gaussian:
f(x) | 𝒟ₙ ∼ 𝒩( μₙ(x), σ²ₙ(x) )
Let X ∈ ℝⁿˣᵈ be the matrix of inputs, y ∈ ℝⁿ be outputs, and define:
With Gaussian noise, the effective covariance is K + σ²_noise I.
Then:
μₙ(x) = m(x) + k(x)ᵀ ( K + σ²_noise I )⁻¹ ( y − m(X) )
σ²ₙ(x) = k(x, x) − k(x)ᵀ ( K + σ²_noise I )⁻¹ k(x)
A useful mental model:
The kernel encodes your belief about f.
Common kernels:
| Kernel | Typical assumption about f | Notes |
|---|---|---|
| RBF / squared exponential | Very smooth | Strong inductive bias; often works in low d |
| Matérn(ν=3/2, 5/2) | Rougher than RBF | Common in BO; more realistic for many objectives |
| ARD versions | Different lengthscales per dimension | Helps identify irrelevant dimensions |
Hyperparameters (lengthscales, variance, noise) are often learned by maximizing marginal likelihood.
GPs scale as 𝒪(n³) with n evaluations due to matrix inversion (though there are approximations). In high d or with many evaluations, you might use:
| Surrogate | Strength | Weakness |
|---|---|---|
| Random forest / TPE-style models | Works for mixed categorical/continuous; robust | Uncertainty is heuristic; less smooth |
| Bayesian neural nets | Flexible in high d | Harder to calibrate and fit; heavier compute |
| Linear models with Bayesian treatment | Fast | Too simple unless features are strong |
BO doesn’t require GPs; it requires predictive distributions (or at least μ and σ).
Many objectives are noisy: y = f(x) + ε.
Noise affects:
With high noise, the surrogate remains uncertain even near observed x, which can lead the acquisition to re-sample.
For stationary kernels (like RBF/Matérn), correlation decays with distance. So σ(x) tends to increase as x moves away from observed points.
This gives you a helpful picture:
This is precisely what BO needs: a map of promising regions and unknown regions.
Even with a perfect surrogate posterior, you still need a rule to decide the next experiment.
We want a(x) to be:
The acquisition is a utility function computed from the predictive distribution at x.
Assume for simplicity (common in GPs):
f(x) | 𝒟ₙ ∼ 𝒩( μ(x), σ²(x) )
(We’ll omit subscript n for readability.)
If you are minimizing f, define the best observed value so far:
f_best = minᵢ yᵢ
Improvement at x is:
I(x) = max(0, f_best − f(x))
Because f(x) is random under the surrogate, I(x) is random. EI takes its expectation:
EI(x) = 𝔼[ I(x) ]
Let Z = (f_best − μ(x)) / σ(x) (for σ(x) > 0).
Let φ and Φ be the standard normal PDF and CDF.
Then:
EI(x) = (f_best − μ(x)) Φ(Z) + σ(x) φ(Z)
If σ → 0, EI tends to max(0, f_best − μ): pure exploitation.
Prefer x where f(x) beats f_best with high probability:
PI(x) = 𝔼[ 1{ f(x) ≤ f_best − ξ } ]
(For minimization; ξ ≥ 0 is a margin encouraging exploration.)
With Gaussian predictive distribution:
PI(x) = Φ( (f_best − ξ − μ(x)) / σ(x) )
PI is simple but can over-exploit: it ignores how much improvement you get.
Construct an optimistic bound using uncertainty.
For maximization:
UCB(x) = μ(x) + κ σ(x)
For minimization (lower confidence bound):
LCB(x) = μ(x) − κ σ(x)
Choose x that maximizes UCB (or minimizes LCB). κ controls exploration:
Instead of building a deterministic a(x), sample a function from the posterior and optimize it.
This naturally balances exploration and exploitation through posterior randomness.
| Acquisition | Strength | Weakness | Typical use |
|---|---|---|---|
| EI | Good default; intuitive trade-off | Sensitive to noise and best-so-far definition | General BO with moderate noise |
| PI | Simple, stable | Over-exploitation; needs ξ tuning | When you only care about probability of beating a target |
| UCB/LCB | Easy; strong theory | Needs κ schedule; may over-explore | When you want theoretical regret control |
| Thompson sampling | Simple decision rule; parallel-friendly | Requires posterior sampling; can be noisy | Batch BO and large-scale approximations |
Key subtlety: the acquisition a(x) is nonconvex even if the surrogate is smooth.
So BO typically nests two optimizations:
Common inner-loop strategies:
Because evaluating a(x) is cheap, you can afford many restarts.
Sometimes you optimize f(x) subject to constraints cⱼ(x) ≤ 0.
A common idea: model constraints with surrogates too, and modify acquisition:
a_constrained(x) = a(x) · P(feasible | x)
This encourages points that are both promising and likely feasible.
Surrogates provide beliefs (μ, σ²). Acquisition functions convert beliefs into actions (which x to try next). The “Bayesian” part is not just uncertainty estimation—it’s decision-making under uncertainty.
Bayesian optimization is fundamentally adaptive experimentation.
A non-adaptive baseline would pre-select a grid or random set of points and evaluate all of them. BO instead uses each new y to redirect future evaluations. This adaptivity is what yields sample efficiency.
We’ll write it for minimization.
Choose n₀ initial points, often via:
You want broad coverage so early uncertainty is meaningful.
Given 𝒟ₙ, fit GP hyperparameters (or update posterior).
Compute μₙ(x), σ²ₙ(x) as needed.
Common defaults:
Compute:
xₙ₊₁ ∈ argmaxₓ aₙ(x)
In practice, you approximate this argmax with multi-start optimization.
Observe yₙ₊₁ = f(xₙ₊₁) + ε.
Update 𝒟ₙ₊₁ = 𝒟ₙ ∪ {(xₙ₊₁, yₙ₊₁)}.
Stop when budget is exhausted or improvement saturates.
Return:
x* = argmin_{(xᵢ, yᵢ) ∈ 𝒟} yᵢ
or return argmin μₙ(x) if you prefer posterior mean minimizer.
BO is sensitive to geometry. If one coordinate spans [0, 10⁶] and another [0, 1], kernels can behave poorly.
Standard practice:
Hyperparameter tuning often has:
GPs struggle with pure categorical spaces unless encoded carefully.
Alternatives:
If you can evaluate B points in parallel, you want {x₁,…,x_B}.
Approaches:
Batch BO is harder because points should be diverse, not redundant.
BO with standard GPs can degrade when d is large (say d ≥ 20–50).
Reasons:
Common mitigations:
| Strategy | Idea |
|---|---|
| ARD lengthscales | Detect irrelevant dimensions |
| Additive kernels | f(x) ≈ ∑ fⱼ(x_{Sⱼ}) |
| Random embeddings | Optimize in a low-dimensional subspace |
| Trust-region BO | Restrict search locally and expand/shrink |
Sometimes you can evaluate cheaper approximations:
Multi-fidelity BO models f(x, s) where s is fidelity (cost). The acquisition trades off information vs cost.
You already know Bayesian inference: prior → posterior.
BO adds an action layer:
posterior → decision rule (acquisition) → new data → posterior
This is a small instance of Bayesian experimental design.
Use BO when:
Avoid or rethink BO when:
Think of BO as building a “map” of f:
With that model, most design decisions become: how should the fog behave, and what scouting strategy do you want?
You are minimizing f. You have observed best value so far f_best = 1.20.
Your surrogate gives predictive distributions at two candidate points:
Assume Gaussian predictive distributions. Compute EI(A) and EI(B) and decide which point EI prefers.
For minimization, EI(x) = (f_best − μ(x)) Φ(Z) + σ(x) φ(Z), where Z = (f_best − μ(x)) / σ(x).
Point A:
Z_A = (1.20 − 1.10) / 0.05
= 0.10 / 0.05
= 2.0
Look up (or approximate): Φ(2.0) ≈ 0.97725 and φ(2.0) ≈ 0.05399.
Compute EI(A):
EI(A) = (1.20 − 1.10) Φ(2.0) + 0.05 φ(2.0)
= 0.10 · 0.97725 + 0.05 · 0.05399
= 0.097725 + 0.0026995
≈ 0.1004245
Point B:
Z_B = (1.20 − 1.25) / 0.30
= (−0.05) / 0.30
= −0.166666…
Approximate: Φ(−0.1667) ≈ 0.4338. Also φ(−0.1667) = φ(0.1667) ≈ 0.393 (since φ is symmetric).
Compute EI(B):
EI(B) = (1.20 − 1.25) Φ(−0.1667) + 0.30 φ(−0.1667)
= (−0.05) · 0.4338 + 0.30 · 0.393
= −0.02169 + 0.1179
≈ 0.09621
Insight: Even though B has worse predicted mean (μ(B) > f_best), its large uncertainty gives it a sizable exploration term σ φ(Z). EI still slightly prefers A here because A already looks better than f_best and is fairly certain. EI often picks points that are either clearly good (low μ) or highly uncertain (high σ), and the balance depends on Z.
Assume you are maximizing f. Your surrogate gives f(x) | 𝒟 ∼ 𝒩( μ(x), σ²(x) ). Show how choosing x that maximizes μ(x) + κ σ(x) can be interpreted as optimizing an optimistic bound (a high quantile) of the predictive distribution.
For a Gaussian random variable, a (1 − α) quantile is:
q_{1−α}(x) = μ(x) + z_{1−α} σ(x)
where z_{1−α} is the standard normal quantile (e.g., z_{0.95} ≈ 1.645).
If we choose κ = z_{1−α}, then:
q_{1−α}(x) = μ(x) + κ σ(x).
Selecting x by maximizing UCB(x) = μ(x) + κ σ(x) is equivalent to:
x_{next} ∈ argmaxₓ q_{1−α}(x).
This is an explicit optimism strategy:
As κ increases, you act more optimistic about uncertain regions; as κ → 0, you recover pure greedy exploitation x = argmax μ(x).
Insight: UCB can be seen as “optimize a high-confidence optimistic scenario” of the function. This interpretation also explains why κ is a natural exploration knob: it literally selects which predictive quantile you optimize.
You are minimizing f(x) over 𝒳 = [0, 1]. You currently have 3 observations:
(x, y) = (0.1, 0.80), (0.5, 0.30), (0.9, 0.60)
Your surrogate (already fit) provides the following summaries at candidate points:
Use LCB with κ = 2 to pick the next point.
For minimization, define:
LCB(x) = μ(x) − κ σ(x)
with κ = 2.
Compute LCB(0.2):
LCB = 0.55 − 2 · 0.20
= 0.55 − 0.40
= 0.15
Compute LCB(0.4):
LCB = 0.35 − 2 · 0.05
= 0.35 − 0.10
= 0.25
Compute LCB(0.7):
LCB = 0.40 − 2 · 0.15
= 0.40 − 0.30
= 0.10
Pick x that minimizes LCB:
min{0.15, 0.25, 0.10} occurs at x = 0.7.
So choose x_next = 0.7 to evaluate f.
After evaluating and observing y_next = f(0.7), you append (0.7, y_next) to 𝒟 and refit/update the surrogate, which will typically reduce σ near 0.7 and adjust μ nearby.
Insight: Even though x = 0.4 has the best predicted mean (μ = 0.35), its uncertainty is tiny, so LCB doesn’t see much chance of discovering something dramatically better. x = 0.7 has slightly worse μ but much larger σ, so the uncertainty bonus makes it attractive as a potentially better region.
Bayesian optimization is designed for expensive, black-box objectives where you must be sample-efficient.
The surrogate’s predictive mean μ(x) estimates performance; predictive variance σ²(x) encodes epistemic uncertainty about f(x).
Acquisition functions are decision rules that trade off exploitation (good μ) and exploration (large σ).
Expected Improvement (EI) quantifies expected gain over the current best and has a closed form under Gaussian predictions.
UCB/LCB adds an uncertainty bonus/penalty μ ± κσ and can be interpreted as optimizing a predictive quantile.
BO is a closed loop: fit surrogate → optimize acquisition → evaluate f → update surrogate; the inner acquisition optimization is itself nonconvex.
Initialization, scaling, noise modeling, and acquisition optimization quality often matter as much as the choice of acquisition function.
In high dimensions or mixed discrete/continuous spaces, standard GP-based BO may need structural assumptions (ARD/additivity/trust regions) or different surrogates.
Treating σ(x) as observation noise instead of model uncertainty: σ reflects what you don’t know about f, not just randomness in y.
Poor acquisition optimization (too few restarts, bad bounds) leading to selecting suboptimal xₙ₊₁ and undercutting BO’s benefits.
Ignoring input scaling/warping: kernels and lengthscales become ill-conditioned, making μ and σ unreliable.
Using EI/PI with noisy observations without adjusting the “best so far” concept, causing the acquisition to chase noise artifacts.
You are minimizing f. Current best observed value is f_best = 0.50. For a candidate x, your surrogate gives μ(x) = 0.55 and σ(x) = 0.10. Compute PI(x) with margin ξ = 0.02, i.e., PI = Φ((f_best − ξ − μ)/σ).
Hint: Compute Z = (0.50 − 0.02 − 0.55)/0.10, then evaluate Φ(Z).
Z = (0.48 − 0.55)/0.10 = (−0.07)/0.10 = −0.7.
So PI = Φ(−0.7) ≈ 0.241 (since Φ(0.7) ≈ 0.759).
For minimization using LCB with κ = 1.5, compare two candidates:
A: μ = 0.20, σ = 0.01
B: μ = 0.23, σ = 0.05
Which does LCB choose, and why?
Hint: Compute LCB = μ − κσ for each; smaller is preferred.
LCB(A) = 0.20 − 1.5·0.01 = 0.20 − 0.015 = 0.185.
LCB(B) = 0.23 − 1.5·0.05 = 0.23 − 0.075 = 0.155.
LCB chooses B because its larger uncertainty creates a bigger exploration benefit, yielding a lower optimistic bound despite a worse mean.
Suppose a GP surrogate uses an RBF kernel with very small lengthscale ℓ in every dimension. Qualitatively, what happens to μ(x) and σ(x) away from observed points, and how might that affect BO behavior?
Hint: Think about k(x, x′) decaying with distance relative to ℓ; small ℓ means correlation drops quickly.
With very small ℓ, k(x, x′) decays extremely fast, so points are only correlated in tiny neighborhoods. Away from observed points, k(x) ≈ 0, so μ(x) reverts toward the prior mean m(x) and σ²(x) approaches the prior variance k(x, x). As a result, much of the domain looks highly uncertain, and acquisitions like UCB/EI may over-explore broadly rather than leveraging smooth generalization. BO may behave like near-random exploration unless enough data densely covers the space.
Prerequisites: Bayesian Inference, Convex Optimization
Related next nodes and supporting ideas: