Finding parameters that maximize probability of observed data.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
You’ve collected data. You believe it came from a distribution with some unknown parameter θ. Maximum Likelihood Estimation (MLE) is the idea of choosing θ so that, under your model, the data you actually saw would be as probable as possible.
Fix the observed data and view the joint pmf/pdf as a function of θ: L(θ). The MLE is θ̂ = argmaxθ L(θ). In practice we maximize the log-likelihood ℓ(θ) = log L(θ), and interior optima satisfy the score equation ∇θ ℓ(θ) = 0 (plus a second-order/endpoint check).
In statistics and machine learning, we often start with a model family—a distribution we think could plausibly generate our data, but with unknown parameter(s). Examples:
You then observe data: x₁, x₂, …, xₙ. The central question is:
Which parameter value θ makes these observations most plausible under the model?
MLE answers: choose the θ that maximizes the probability (for discrete) or density (for continuous) of what you observed.
This is the most important mental switch:
Formally, suppose the data are generated i.i.d. from a distribution with pmf/pdf f(x | θ). After you observe x₁, …, xₙ, define the likelihood function
L(θ) = ∏ᵢ f(xᵢ | θ)
This is not “the probability of θ”. It is a function that scores different θ values by how well they explain the observed data.
Imagine coin flips: xᵢ ∈ {0, 1}, where 1 = heads. If you see 9 heads out of 10 flips, then:
The MLE picks the p that yields the highest L(p).
The maximum likelihood estimator (MLE) is
θ̂ = argmaxθ L(θ)
Often θ is scalar, but in many ML models θ is a vector θ (weights). Then we write
θ̂ = argmax_{θ} L(θ)
L(θ) is a product of many terms. Products can be:
Because log is strictly increasing, maximizing L(θ) is equivalent to maximizing the log-likelihood:
ℓ(θ) = log L(θ) = log(∏ᵢ f(xᵢ | θ))
Use log rules:
ℓ(θ)
= log(∏ᵢ f(xᵢ | θ))
= ∑ᵢ log f(xᵢ | θ)
So MLE becomes:
θ̂ = argmaxθ ℓ(θ)
This “sum of per-example contributions” is one reason likelihood-based methods scale well to large datasets.
MLE is only as good as the model family f(x | θ). If the model is wrong (e.g., assuming normality for heavy-tailed data), the MLE still returns the best-fitting θ within that family—but it may not be a good description of reality. This is not a flaw of calculus; it’s the consequence of the assumptions.
To write down a likelihood, you need a story for how the data are generated. The story is the distribution f(x | θ).
Typical assumptions:
These are simplifying assumptions, but they give the clean factorization:
L(θ) = f(x₁, …, xₙ | θ) = ∏ᵢ f(xᵢ | θ)
If independence is not justified (time series, spatial data), the likelihood changes form—but the MLE principle is the same.
A common surprise: densities can exceed 1, so L(θ) can exceed 1. That’s fine. Only probabilities must be ≤ 1.
With i.i.d. data:
ℓ(θ) = ∑ᵢ log f(xᵢ | θ)
This gives you:
In machine learning, we often minimize negative log-likelihood (NLL):
NLL(θ) = −ℓ(θ) = −∑ᵢ log f(xᵢ | θ)
| Object | Notation | Data treated as | Parameter treated as | Typical use | |
|---|---|---|---|---|---|
| pmf/pdf | f(x | θ) | random | fixed | forward probabilities/densities |
| likelihood | L(θ) = ∏ᵢ f(xᵢ | θ) | fixed (observed) | variable | estimation |
| log-likelihood | ℓ(θ) = ∑ᵢ log f(xᵢ | θ) | fixed | variable | calculus/optimization |
| negative log-likelihood | −ℓ(θ) | fixed | variable | ML loss minimization |
If θ is a vector θ ∈ ℝᵈ, then:
The geometry matters: you’re maximizing a surface over ℝᵈ, not a curve over ℝ.
Once you have ℓ(θ), estimation becomes an optimization problem. For many classical distributions, you can solve it analytically by taking derivatives and setting them to zero.
The key idea: an interior maximum of a differentiable function has derivative 0.
Define the score as the derivative (or gradient) of the log-likelihood:
First-order condition (FOC) for an interior optimum:
s(θ̂) = 0
(or s(θ̂) = 0)
This gives candidate solutions.
Setting the derivative to zero finds critical points: maxima, minima, or saddle points.
Sometimes the maximum occurs at the boundary of the parameter space.
Example: Bernoulli p must satisfy 0 ≤ p ≤ 1. If all outcomes are 1, the likelihood increases as p → 1, so the MLE is p̂ = 1 (a boundary point). In that case, the derivative-based interior condition may not apply.
Because log turns products into sums, differentiation typically yields sums you can simplify.
A pattern you’ll see repeatedly:
In many ML models (logistic regression, neural nets), ℓ(θ) is differentiable but does not yield an algebraic closed-form solution.
Then MLE becomes numerical optimization:
Even when we can’t solve the score equation analytically, the score still guides algorithms.
Maximizing log-likelihood is equivalent to minimizing average surprise:
(1/n)NLL(θ) = −(1/n)∑ᵢ log f(xᵢ | θ)
So MLE chooses the parameter that makes the observed data as unsurprising as possible under the model.
A large fraction of “standard losses” in machine learning are just negative log-likelihoods for some probabilistic model.
So MLE isn’t just a statistics technique—it’s a unifying design principle for objective functions.
Assume:
yᵢ = μ(xᵢ; w) + εᵢ, εᵢ ∼ Normal(0, σ²)
Then:
f(yᵢ | xᵢ, w) = (1/√(2πσ²)) exp(−(yᵢ − μ(xᵢ; w))² / (2σ²))
Log-likelihood (dropping constants not depending on w):
ℓ(w)
= ∑ᵢ [ −(yᵢ − μ(xᵢ; w))² / (2σ²) ] + const
Maximizing ℓ(w) ⇔ minimizing ∑ᵢ (yᵢ − μ(xᵢ; w))²
That is least squares.
If yᵢ ∈ {0,1} and model predicts pᵢ = σ(wᵀxᵢ), then
f(yᵢ | xᵢ, w) = pᵢ^{yᵢ} (1 − pᵢ)^{1−yᵢ}
NLL is
−ℓ(w) = −∑ᵢ [ yᵢ log pᵢ + (1−yᵢ) log(1−pᵢ) ]
That is binary cross-entropy.
MLE is popular because, under regularity conditions and for large n:
You don’t need these proofs yet, but they explain why MLE is often the default.
MLE chooses a single best θ.
Bayesian inference treats θ as random and updates a prior p(θ) to a posterior:
p(θ | data) ∝ p(data | θ) p(θ)
Notice p(data | θ) is exactly the likelihood (up to notation). So MLE and Bayes share the same core ingredient; Bayes adds a prior.
In many settings, maximizing expected log-likelihood is equivalent to minimizing KL divergence between the true data-generating distribution and your model family. This is one reason KL divergence shows up everywhere in ML.
Once you have θ̂, you often want uncertainty. Many confidence interval methods start from the curvature of ℓ(θ) near θ̂ (observed Fisher information / Hessian). So MLE is a gateway to inference, not just point estimation.
Let x₁,…,xₙ be i.i.d. Bernoulli(p), where xᵢ ∈ {0,1}. We observe k = ∑ᵢ xᵢ heads (1s). Find the MLE p̂.
Write the pmf for one observation:
f(xᵢ | p) = p^{xᵢ}(1−p)^{1−xᵢ}
Write the likelihood (independence ⇒ product):
L(p) = ∏ᵢ p^{xᵢ}(1−p)^{1−xᵢ}
Collect exponents using ∑ᵢ xᵢ = k and ∑ᵢ (1−xᵢ) = n−k:
L(p) = p^k (1−p)^{n−k}
Take logs to simplify:
ℓ(p) = log L(p)
= log(p^k (1−p)^{n−k})
= k log p + (n−k) log(1−p)
Differentiate (score) and set to zero (interior solution):
dℓ/dp = k·(1/p) + (n−k)·(−1/(1−p))
= k/p − (n−k)/(1−p)
Set dℓ/dp = 0:
k/p = (n−k)/(1−p)
Solve for p:
k(1−p) = p(n−k)
k − kp = pn − pk
k = pn
p̂ = k/n
Check it is a maximum (second derivative):
d²ℓ/dp² = −k/p² − (n−k)/(1−p)² < 0 for p ∈ (0,1)
So the critical point is a (strict) local maximum.
Insight: For Bernoulli data, the MLE equals the sample mean: p̂ = (1/n)∑ᵢ xᵢ. This is a recurring theme: MLE often matches intuitive “frequency” estimators.
Let x₁,…,xₙ be i.i.d. Poisson(λ), with λ > 0. Find the MLE λ̂.
Write the pmf for one observation:
f(xᵢ | λ) = e^{−λ} λ^{xᵢ} / xᵢ!
Likelihood:
L(λ) = ∏ᵢ [ e^{−λ} λ^{xᵢ} / xᵢ! ]
Simplify the product:
L(λ) = (∏ᵢ e^{−λ}) (∏ᵢ λ^{xᵢ}) / (∏ᵢ xᵢ!)
= e^{−nλ} λ^{∑ᵢ xᵢ} / (∏ᵢ xᵢ!)
Log-likelihood (dropping constants that do not depend on λ):
ℓ(λ) = log L(λ)
= (−nλ) + (∑ᵢ xᵢ) log λ − ∑ᵢ log(xᵢ!)
Differentiate and set to zero:
dℓ/dλ = −n + (∑ᵢ xᵢ)/λ
Set dℓ/dλ = 0:
−n + (∑ᵢ xᵢ)/λ = 0
(∑ᵢ xᵢ)/λ = n
Solve:
λ̂ = (1/n)∑ᵢ xᵢ
Second derivative check:
d²ℓ/dλ² = −(∑ᵢ xᵢ)/λ² < 0 for λ > 0 (assuming not all xᵢ are 0)
So it’s a maximum.
Insight: Again, the MLE matches the sample mean. For Poisson, the mean equals λ, so the MLE is the natural plug-in estimator.
Let x₁,…,xₙ be i.i.d. Normal(μ, σ²). Assume μ is known. Find the MLE for σ².
Write the pdf:
f(xᵢ | σ²) = (1/√(2πσ²)) exp(−(xᵢ−μ)² / (2σ²))
Likelihood:
L(σ²) = ∏ᵢ (1/√(2πσ²)) exp(−(xᵢ−μ)² / (2σ²))
Log-likelihood:
ℓ(σ²)
= ∑ᵢ [ −(1/2)log(2πσ²) − (xᵢ−μ)²/(2σ²) ]
= −(n/2)log(2πσ²) − (1/(2σ²))∑ᵢ (xᵢ−μ)²
Differentiate w.r.t. σ²:
dℓ/d(σ²)
= −(n/2)·(1/σ²) + (1/2)(∑ᵢ (xᵢ−μ)²)·(1/(σ²)²)
Explanation: derivative of −(1/(2σ²))S is +(1/2)S·(1/(σ²)²), where S = ∑ᵢ (xᵢ−μ)²
Set derivative to zero:
−(n/2)(1/σ²) + (1/2)S(1/(σ²)²) = 0
Multiply both sides by 2(σ²)² to clear fractions:
−nσ² + S = 0
Solve:
σ̂²_MLE = S/n = (1/n)∑ᵢ (xᵢ−μ)²
Second derivative check (sketch): curvature is negative at the solution for σ² > 0, giving a maximum.
Insight: The MLE for σ² uses 1/n, not 1/(n−1). The 1/(n−1) version is the unbiased sample variance; MLE prioritizes likelihood maximization, not unbiasedness.
The likelihood L(θ) is the joint pmf/pdf of the observed data, viewed as a function of θ with the data fixed.
The MLE is θ̂ = argmaxθ L(θ); in practice we maximize ℓ(θ) = log L(θ) because it turns products into sums.
For i.i.d. data, ℓ(θ) = ∑ᵢ log f(xᵢ | θ), which is computationally and conceptually convenient.
Interior optima satisfy the score equation: ∇θ ℓ(θ̂) = 0; then you must verify it’s a maximum (curvature) or check boundaries.
Many familiar estimators are MLEs (e.g., Bernoulli p̂ = sample mean; Poisson λ̂ = sample mean).
Many ML loss functions are negative log-likelihoods (cross-entropy, squared error under Gaussian noise).
MLE depends on the assumed model family; it returns the best fit within that family, even if the family is misspecified.
Treating L(θ) as a probability distribution over θ (it is not); only in Bayesian inference do we form p(θ | data).
Forgetting parameter constraints (e.g., p ∈ [0,1], σ² > 0) and missing boundary maxima.
Setting the score to zero and stopping—without checking whether the critical point is a maximum (second derivative/Hessian) or whether multiple maxima exist.
Confusing the MLE variance formula (divide by n) with the unbiased sample variance (divide by n−1).
Uniform(0, θ) MLE: Suppose x₁,…,xₙ are i.i.d. Uniform(0, θ) with θ > 0. Derive the MLE θ̂.
Hint: Write f(x|θ) = 1/θ for 0 ≤ x ≤ θ, and 0 otherwise. The likelihood is zero if θ is smaller than any observed value.
For one observation: f(xᵢ|θ) = 1/θ if 0 ≤ xᵢ ≤ θ, else 0.
Likelihood:
L(θ) = ∏ᵢ (1/θ) · 𝟙{xᵢ ≤ θ}
= θ^{−n} · 𝟙{maxᵢ xᵢ ≤ θ}.
If θ < max xᵢ, then L(θ)=0. For θ ≥ m where m = max xᵢ, L(θ)=θ^{−n}, which decreases as θ increases.
So the maximum occurs at the smallest feasible θ, i.e. θ̂ = maxᵢ xᵢ.
Normal(μ, σ²) MLE for μ when σ² is known: Given x₁,…,xₙ i.i.d. Normal(μ, σ²) with σ² known, derive μ̂.
Hint: Write ℓ(μ) and differentiate. The exponent contains ∑ᵢ (xᵢ−μ)².
Log-likelihood (dropping constants not involving μ):
ℓ(μ) = −(1/(2σ²))∑ᵢ (xᵢ−μ)².
Differentiate:
dℓ/dμ = −(1/(2σ²))∑ᵢ 2(xᵢ−μ)(−1)
= (1/σ²)∑ᵢ (xᵢ−μ)
Set to zero:
∑ᵢ (xᵢ−μ) = 0
∑ᵢ xᵢ − nμ = 0
μ̂ = (1/n)∑ᵢ xᵢ.
Second derivative is −n/σ² < 0, so it’s a maximum.
Boundary case for Bernoulli: You observe x₁,…,xₙ all equal to 1 (all successes). What is the MLE for p? Explain why the score equation approach can be misleading here.
Hint: Write ℓ(p) = n log p when k = n, and remember p must be in [0,1].
If all xᵢ = 1, then k = n.
Likelihood: L(p)=p^n.
Log-likelihood: ℓ(p)=n log p, which increases as p increases on (0,1].
Thus the MLE is the boundary point p̂ = 1.
Why score can mislead: dℓ/dp = n/p, which never equals 0 for p ∈ (0,1]. The maximum is not an interior critical point; it occurs at the boundary, so the score=0 condition does not apply.
Next nodes you can unlock and why they connect: