Maximum Likelihood Estimation

Probability & StatisticsDifficulty: ███░░Depth: 6Unlocks: 44

Finding parameters that maximize probability of observed data.

Interactive Visualization

t=0s

Core Concepts

▸Likelihood function: the joint probability (pmf) or density (pdf) of the observed data viewed as a function of the parameter(s) with the data fixed.
▸Maximum-likelihood principle: the MLE is the parameter value that maximizes the likelihood function (a point estimate chosen to make the observed data most probable).
▸Score / first-order condition: internal maximizers satisfy that the derivative (gradient) of the log-likelihood with respect to the parameter equals zero (score = 0).

Key Symbols & Notation

theta - the parameter (scalar or vector) being estimated.L(theta) - the likelihood function: the joint probability/density of the observed data expressed as a function of theta.

Essential Relationships

↔argmax_theta L(theta) = argmax_theta log L(theta): the logarithm is monotone so maximizing log-likelihood gives the same estimator and simplifies products to sums.

Prerequisites (2)

Common Distributions6 atoms

Derivatives6 atoms

Unlocks (6)

Machine Learning Introductionlvl 3

Bayesian Inferencelvl 4

Logistic Regressionlvl 3

KL Divergencelvl 4

Confidence Intervalslvl 3

Cross-Validationlvl 4

▶ Advanced Learning Details

Graph Position

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (13)

• Likelihood function: viewing the probability or density of observed data as a function of the model parameter(s) (data fixed, parameter variable)
• Log-likelihood: the natural logarithm of the likelihood used to simplify calculations
• Maximum likelihood estimator (MLE): the parameter value(s) that maximize the likelihood/log-likelihood
• IID-product form of the likelihood for independent observations: the joint likelihood is the product of individual densities/probabilities
• Use of the log to convert the product-form likelihood into a sum (computational/numerical simplification)
• Score function: the derivative(s) of the log-likelihood with respect to parameter(s)
• Likelihood equations: setting the score(s) to zero to obtain candidate MLE(s)
• Observed information: the negative second derivative (Hessian) of the log-likelihood at a parameter value (measures curvature)
• Fisher information: the expected information (expected value of the observed information or variance of the score)
• Approximate variance/standard error of an MLE obtained from (observed or Fisher) information
• Invariance property of MLE: the MLE of a function g(θ) is g(θ̂)
• Practical issues for MLE: existence/non-uniqueness, boundary solutions, and need to check second-order conditions for maxima
• Distinction in role between 'likelihood as a function of parameters' and 'probability/density as a function of data'

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

You’ve collected data. You believe it came from a distribution with some unknown parameter θ. Maximum Likelihood Estimation (MLE) is the idea of choosing θ so that, under your model, the data you actually saw would be as probable as possible.

TL;DR:

Fix the observed data and view the joint pmf/pdf as a function of θ: L(θ). The MLE is θ̂ = argmaxθ L(θ). In practice we maximize the log-likelihood ℓ(θ) = log L(θ), and interior optima satisfy the score equation ∇θ ℓ(θ) = 0 (plus a second-order/endpoint check).

What Is Maximum Likelihood Estimation?

The problem MLE is trying to solve (why)

In statistics and machine learning, we often start with a model family—a distribution we think could plausibly generate our data, but with unknown parameter(s). Examples:

•Bernoulli(p) for binary outcomes (unknown p)
•Poisson(λ) for counts (unknown λ)
•Normal(μ, σ²) for measurements (unknown μ and/or σ²)

You then observe data: x₁, x₂, …, xₙ. The central question is:

Which parameter value θ makes these observations most plausible under the model?

MLE answers: choose the θ that maximizes the probability (for discrete) or density (for continuous) of what you observed.

Likelihood vs probability (intuition)

This is the most important mental switch:

•Probability/density: treat θ as fixed, data as random.
•Likelihood: treat the observed data as fixed, and treat θ as the variable.

Formally, suppose the data are generated i.i.d. from a distribution with pmf/pdf f(x | θ). After you observe x₁, …, xₙ, define the likelihood function

L(θ) = ∏ᵢ f(xᵢ | θ)

This is not “the probability of θ”. It is a function that scores different θ values by how well they explain the observed data.

A concrete picture

Imagine coin flips: xᵢ ∈ {0, 1}, where 1 = heads. If you see 9 heads out of 10 flips, then:

•p = 0.9 should give a high likelihood
•p = 0.1 should give a very low likelihood

The MLE picks the p that yields the highest L(p).

Definition: maximum-likelihood estimator

The maximum likelihood estimator (MLE) is

θ̂ = argmaxθ L(θ)

Often θ is scalar, but in many ML models θ is a vector θ (weights). Then we write

θ̂ = argmax_{θ} L(θ)

Why we often maximize log-likelihood instead

L(θ) is a product of many terms. Products can be:

•numerically tiny (underflow)
•algebraically messy

Because log is strictly increasing, maximizing L(θ) is equivalent to maximizing the log-likelihood:

ℓ(θ) = log L(θ) = log(∏ᵢ f(xᵢ | θ))

Use log rules:

ℓ(θ)

= log(∏ᵢ f(xᵢ | θ))

= ∑ᵢ log f(xᵢ | θ)

So MLE becomes:

θ̂ = argmaxθ ℓ(θ)

This “sum of per-example contributions” is one reason likelihood-based methods scale well to large datasets.

When MLE is a modeling commitment

MLE is only as good as the model family f(x | θ). If the model is wrong (e.g., assuming normality for heavy-tailed data), the MLE still returns the best-fitting θ within that family—but it may not be a good description of reality. This is not a flaw of calculus; it’s the consequence of the assumptions.

Core Mechanic 1: Building the Likelihood (and Log-Likelihood)

Start from a generative story (why)

To write down a likelihood, you need a story for how the data are generated. The story is the distribution f(x | θ).

Typical assumptions:

1)Independence: xᵢ’s do not influence each other given θ.
2)Identical distribution: all xᵢ share the same parameter θ.

These are simplifying assumptions, but they give the clean factorization:

L(θ) = f(x₁, …, xₙ | θ) = ∏ᵢ f(xᵢ | θ)

If independence is not justified (time series, spatial data), the likelihood changes form—but the MLE principle is the same.

Discrete vs continuous: pmf vs pdf

•If x is discrete (Bernoulli, Poisson), f(x | θ) is a pmf and L(θ) is a true probability.
•If x is continuous (Normal), f(x | θ) is a pdf and L(θ) is a density value (not a probability of an exact point). You can still maximize it.

Likelihood is not bounded by 1

A common surprise: densities can exceed 1, so L(θ) can exceed 1. That’s fine. Only probabilities must be ≤ 1.

The log-likelihood decomposes nicely

With i.i.d. data:

ℓ(θ) = ∑ᵢ log f(xᵢ | θ)

This gives you:

•easier differentiation
•easier numerical optimization
•an “average loss” viewpoint: (1/n)ℓ(θ)

In machine learning, we often minimize negative log-likelihood (NLL):

NLL(θ) = −ℓ(θ) = −∑ᵢ log f(xᵢ | θ)

A helpful comparison table

Object	Notation	Data treated as	Parameter treated as	Typical use
pmf/pdf	f(x	θ)	random	fixed	forward probabilities/densities
likelihood	L(θ) = ∏ᵢ f(xᵢ	θ)	fixed (observed)	variable	estimation
log-likelihood	ℓ(θ) = ∑ᵢ log f(xᵢ	θ)	fixed	variable	calculus/optimization
negative log-likelihood	−ℓ(θ)	fixed	variable	ML loss minimization

A note on vector parameters

If θ is a vector θ ∈ ℝᵈ, then:

•the likelihood is L(θ)
•the log-likelihood is ℓ(θ)
•derivatives become gradients ∇_{θ} ℓ(θ)

The geometry matters: you’re maximizing a surface over ℝᵈ, not a curve over ℝ.

Core Mechanic 2: Maximizing the Log-Likelihood (Score and Conditions)

Why calculus enters (why)

Once you have ℓ(θ), estimation becomes an optimization problem. For many classical distributions, you can solve it analytically by taking derivatives and setting them to zero.

The key idea: an interior maximum of a differentiable function has derivative 0.

The score function (first-order condition)

Define the score as the derivative (or gradient) of the log-likelihood:

•Scalar θ: s(θ) = dℓ(θ)/dθ
•Vector θ: s(θ) = ∇_{θ} ℓ(θ)

First-order condition (FOC) for an interior optimum:

s(θ̂) = 0

(or s(θ̂) = 0)

This gives candidate solutions.

Second-order condition (is it a max?)

Setting the derivative to zero finds critical points: maxima, minima, or saddle points.

•Scalar θ: check d²ℓ(θ)/dθ² < 0 at θ̂ for a local maximum.
•Vector θ: check the Hessian H(θ) = ∇²_{θ} ℓ(θ). A sufficient condition for a strict local maximum is that H(θ̂) is negative definite.

Boundary solutions matter

Sometimes the maximum occurs at the boundary of the parameter space.

Example: Bernoulli p must satisfy 0 ≤ p ≤ 1. If all outcomes are 1, the likelihood increases as p → 1, so the MLE is p̂ = 1 (a boundary point). In that case, the derivative-based interior condition may not apply.

Why log-likelihood often yields simple equations

Because log turns products into sums, differentiation typically yields sums you can simplify.

A pattern you’ll see repeatedly:

1)Write ℓ(θ) = ∑ᵢ log f(xᵢ | θ)
2)Differentiate term-by-term
3)Set the resulting expression to 0
4)Solve for θ̂

When there is no closed form

In many ML models (logistic regression, neural nets), ℓ(θ) is differentiable but does not yield an algebraic closed-form solution.

Then MLE becomes numerical optimization:

•gradient ascent on ℓ(θ)
•gradient descent on −ℓ(θ)
•Newton / quasi-Newton methods using curvature information

Even when we can’t solve the score equation analytically, the score still guides algorithms.

A small but powerful conceptual link

Maximizing log-likelihood is equivalent to minimizing average surprise:

(1/n)NLL(θ) = −(1/n)∑ᵢ log f(xᵢ | θ)

So MLE chooses the parameter that makes the observed data as unsurprising as possible under the model.

Application/Connection: How MLE Shows Up in Machine Learning and Statistics

MLE as the engine behind many ML losses (why)

A large fraction of “standard losses” in machine learning are just negative log-likelihoods for some probabilistic model.

•Linear regression with Gaussian noise ⇒ squared error loss
•Logistic regression ⇒ Bernoulli likelihood ⇒ cross-entropy loss
•Softmax classification ⇒ categorical likelihood ⇒ multinomial cross-entropy

So MLE isn’t just a statistics technique—it’s a unifying design principle for objective functions.

Example: Gaussian likelihood ⇒ squared error

Assume:

yᵢ = μ(xᵢ; w) + εᵢ, εᵢ ∼ Normal(0, σ²)

Then:

f(yᵢ | xᵢ, w) = (1/√(2πσ²)) exp(−(yᵢ − μ(xᵢ; w))² / (2σ²))

Log-likelihood (dropping constants not depending on w):

ℓ(w)

= ∑ᵢ [ −(yᵢ − μ(xᵢ; w))² / (2σ²) ] + const

Maximizing ℓ(w) ⇔ minimizing ∑ᵢ (yᵢ − μ(xᵢ; w))²

That is least squares.

Example: Bernoulli likelihood ⇒ cross-entropy

If yᵢ ∈ {0,1} and model predicts pᵢ = σ(wᵀxᵢ), then

f(yᵢ | xᵢ, w) = pᵢ^{yᵢ} (1 − pᵢ)^{1−yᵢ}

NLL is

−ℓ(w) = −∑ᵢ [ yᵢ log pᵢ + (1−yᵢ) log(1−pᵢ) ]

That is binary cross-entropy.

Statistical properties you’ll later connect to

MLE is popular because, under regularity conditions and for large n:

•Consistency: θ̂ → θ (in probability)
•Asymptotic normality: √n(θ̂ − θ) ≈ Normal(0, I(θ)⁻¹)
•Efficiency: achieves optimal variance among many estimators (Cramér–Rao ideas)

You don’t need these proofs yet, but they explain why MLE is often the default.

MLE vs Bayesian inference (a preview)

MLE chooses a single best θ.

Bayesian inference treats θ as random and updates a prior p(θ) to a posterior:

p(θ | data) ∝ p(data | θ) p(θ)

Notice p(data | θ) is exactly the likelihood (up to notation). So MLE and Bayes share the same core ingredient; Bayes adds a prior.

A conceptual bridge to KL divergence

In many settings, maximizing expected log-likelihood is equivalent to minimizing KL divergence between the true data-generating distribution and your model family. This is one reason KL divergence shows up everywhere in ML.

Connecting to confidence intervals

Once you have θ̂, you often want uncertainty. Many confidence interval methods start from the curvature of ℓ(θ) near θ̂ (observed Fisher information / Hessian). So MLE is a gateway to inference, not just point estimation.

Worked Examples (3)

Bernoulli MLE: estimating a coin’s bias

Let x₁,…,xₙ be i.i.d. Bernoulli(p), where xᵢ ∈ {0,1}. We observe k = ∑ᵢ xᵢ heads (1s). Find the MLE p̂.

Write the pmf for one observation:
f(xᵢ | p) = p^{xᵢ}(1−p)^{1−xᵢ}
Write the likelihood (independence ⇒ product):
L(p) = ∏ᵢ p^{xᵢ}(1−p)^{1−xᵢ}
Collect exponents using ∑ᵢ xᵢ = k and ∑ᵢ (1−xᵢ) = n−k:
L(p) = p^k (1−p)^{n−k}
Take logs to simplify:
ℓ(p) = log L(p)
= log(p^k (1−p)^{n−k})
= k log p + (n−k) log(1−p)
Differentiate (score) and set to zero (interior solution):
dℓ/dp = k·(1/p) + (n−k)·(−1/(1−p))
= k/p − (n−k)/(1−p)
Set dℓ/dp = 0:
k/p = (n−k)/(1−p)
Solve for p:
k(1−p) = p(n−k)
k − kp = pn − pk
k = pn
p̂ = k/n
Check it is a maximum (second derivative):
d²ℓ/dp² = −k/p² − (n−k)/(1−p)² < 0 for p ∈ (0,1)
So the critical point is a (strict) local maximum.

Insight: For Bernoulli data, the MLE equals the sample mean: p̂ = (1/n)∑ᵢ xᵢ. This is a recurring theme: MLE often matches intuitive “frequency” estimators.

Poisson MLE: estimating a rate from counts

Let x₁,…,xₙ be i.i.d. Poisson(λ), with λ > 0. Find the MLE λ̂.

Write the pmf for one observation:
f(xᵢ | λ) = e^{−λ} λ^{xᵢ} / xᵢ!
Likelihood:
L(λ) = ∏ᵢ [ e^{−λ} λ^{xᵢ} / xᵢ! ]
Simplify the product:
L(λ) = (∏ᵢ e^{−λ}) (∏ᵢ λ^{xᵢ}) / (∏ᵢ xᵢ!)
= e^{−nλ} λ^{∑ᵢ xᵢ} / (∏ᵢ xᵢ!)
Log-likelihood (dropping constants that do not depend on λ):
ℓ(λ) = log L(λ)
= (−nλ) + (∑ᵢ xᵢ) log λ − ∑ᵢ log(xᵢ!)
Differentiate and set to zero:
dℓ/dλ = −n + (∑ᵢ xᵢ)/λ
Set dℓ/dλ = 0:
−n + (∑ᵢ xᵢ)/λ = 0
(∑ᵢ xᵢ)/λ = n
Solve:
λ̂ = (1/n)∑ᵢ xᵢ
Second derivative check:
d²ℓ/dλ² = −(∑ᵢ xᵢ)/λ² < 0 for λ > 0 (assuming not all xᵢ are 0)
So it’s a maximum.

Insight: Again, the MLE matches the sample mean. For Poisson, the mean equals λ, so the MLE is the natural plug-in estimator.

Normal MLE (μ known σ² unknown): estimating variance carefully

Let x₁,…,xₙ be i.i.d. Normal(μ, σ²). Assume μ is known. Find the MLE for σ².

Write the pdf:
f(xᵢ | σ²) = (1/√(2πσ²)) exp(−(xᵢ−μ)² / (2σ²))
Likelihood:
L(σ²) = ∏ᵢ (1/√(2πσ²)) exp(−(xᵢ−μ)² / (2σ²))
Log-likelihood:
ℓ(σ²)
= ∑ᵢ [ −(1/2)log(2πσ²) − (xᵢ−μ)²/(2σ²) ]
= −(n/2)log(2πσ²) − (1/(2σ²))∑ᵢ (xᵢ−μ)²
Differentiate w.r.t. σ²:
dℓ/d(σ²)
= −(n/2)·(1/σ²) + (1/2)(∑ᵢ (xᵢ−μ)²)·(1/(σ²)²)
Explanation: derivative of −(1/(2σ²))S is +(1/2)S·(1/(σ²)²), where S = ∑ᵢ (xᵢ−μ)²
Set derivative to zero:
−(n/2)(1/σ²) + (1/2)S(1/(σ²)²) = 0
Multiply both sides by 2(σ²)² to clear fractions:
−nσ² + S = 0
Solve:
σ̂²_MLE = S/n = (1/n)∑ᵢ (xᵢ−μ)²
Second derivative check (sketch): curvature is negative at the solution for σ² > 0, giving a maximum.

Insight: The MLE for σ² uses 1/n, not 1/(n−1). The 1/(n−1) version is the unbiased sample variance; MLE prioritizes likelihood maximization, not unbiasedness.

Key Takeaways

✓
The likelihood L(θ) is the joint pmf/pdf of the observed data, viewed as a function of θ with the data fixed.
✓
The MLE is θ̂ = argmaxθ L(θ); in practice we maximize ℓ(θ) = log L(θ) because it turns products into sums.
✓
For i.i.d. data, ℓ(θ) = ∑ᵢ log f(xᵢ | θ), which is computationally and conceptually convenient.
✓
Interior optima satisfy the score equation: ∇θ ℓ(θ̂) = 0; then you must verify it’s a maximum (curvature) or check boundaries.
✓
Many familiar estimators are MLEs (e.g., Bernoulli p̂ = sample mean; Poisson λ̂ = sample mean).
✓
Many ML loss functions are negative log-likelihoods (cross-entropy, squared error under Gaussian noise).
✓
MLE depends on the assumed model family; it returns the best fit within that family, even if the family is misspecified.

Common Mistakes

✗
Treating L(θ) as a probability distribution over θ (it is not); only in Bayesian inference do we form p(θ | data).
✗
Forgetting parameter constraints (e.g., p ∈ [0,1], σ² > 0) and missing boundary maxima.
✗
Setting the score to zero and stopping—without checking whether the critical point is a maximum (second derivative/Hessian) or whether multiple maxima exist.
✗
Confusing the MLE variance formula (divide by n) with the unbiased sample variance (divide by n−1).

Practice

easy

Uniform(0, θ) MLE: Suppose x₁,…,xₙ are i.i.d. Uniform(0, θ) with θ > 0. Derive the MLE θ̂.

Hint: Write f(x|θ) = 1/θ for 0 ≤ x ≤ θ, and 0 otherwise. The likelihood is zero if θ is smaller than any observed value.

Show solution

For one observation: f(xᵢ|θ) = 1/θ if 0 ≤ xᵢ ≤ θ, else 0.

Likelihood:

L(θ) = ∏ᵢ (1/θ) · 𝟙{xᵢ ≤ θ}

= θ^{−n} · 𝟙{maxᵢ xᵢ ≤ θ}.

If θ < max xᵢ, then L(θ)=0. For θ ≥ m where m = max xᵢ, L(θ)=θ^{−n}, which decreases as θ increases.

So the maximum occurs at the smallest feasible θ, i.e. θ̂ = maxᵢ xᵢ.

medium

Normal(μ, σ²) MLE for μ when σ² is known: Given x₁,…,xₙ i.i.d. Normal(μ, σ²) with σ² known, derive μ̂.

Hint: Write ℓ(μ) and differentiate. The exponent contains ∑ᵢ (xᵢ−μ)².

Show solution

Log-likelihood (dropping constants not involving μ):

ℓ(μ) = −(1/(2σ²))∑ᵢ (xᵢ−μ)².

Differentiate:

dℓ/dμ = −(1/(2σ²))∑ᵢ 2(xᵢ−μ)(−1)

= (1/σ²)∑ᵢ (xᵢ−μ)

Set to zero:

∑ᵢ (xᵢ−μ) = 0

∑ᵢ xᵢ − nμ = 0

μ̂ = (1/n)∑ᵢ xᵢ.

Second derivative is −n/σ² < 0, so it’s a maximum.

medium

Boundary case for Bernoulli: You observe x₁,…,xₙ all equal to 1 (all successes). What is the MLE for p? Explain why the score equation approach can be misleading here.

Hint: Write ℓ(p) = n log p when k = n, and remember p must be in [0,1].

Show solution

If all xᵢ = 1, then k = n.

Likelihood: L(p)=p^n.

Log-likelihood: ℓ(p)=n log p, which increases as p increases on (0,1].

Thus the MLE is the boundary point p̂ = 1.

Why score can mislead: dℓ/dp = n/p, which never equals 0 for p ∈ (0,1]. The maximum is not an interior critical point; it occurs at the boundary, so the score=0 condition does not apply.

Connections

Next nodes you can unlock and why they connect:

•Machine Learning Introduction: Many ML algorithms are posed as maximizing likelihood or minimizing negative log-likelihood.
•Bayesian Inference: Bayes’ rule uses the likelihood p(data | θ) as a core component; MLE is a useful baseline/limit case.
•Logistic Regression: Logistic regression is typically fit by MLE; its cross-entropy objective is the Bernoulli negative log-likelihood.
•KL Divergence: Expected negative log-likelihood relates to cross-entropy and KL; MLE can be viewed as choosing parameters that minimize KL to the true distribution (under conditions).
•Confidence Intervals: Curvature of the log-likelihood around θ̂ underpins standard errors and interval estimates.

Quality: A (4.6/5)

← back to tree browse all →