Learning patterns from data. Supervised vs unsupervised learning.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Machine learning is the idea that instead of hand-writing rules (“if the email contains ‘free money’, mark as spam”), we let a program learn a pattern from examples. The central move is surprisingly simple: pick a flexible function f(x; θ), define what it means to be wrong with a loss, and use data to find θ that makes the loss small.
Machine learning = choosing a parameterized model f(x; θ) and fitting parameters θ by minimizing an empirical risk (average loss on a dataset). In supervised learning, data includes labels y so the loss measures prediction error; in unsupervised learning, there are no labels, so the objective captures structure (e.g., clustering, density). Training is optimization (often gradient descent), and generalization is the goal (performing well on new data).
In many problems, writing explicit rules is either too hard or too brittle.
Rules struggle because the mapping from inputs to outputs is complex and full of edge cases. Machine learning changes the workflow:
A machine learning model is a parameterized function:
We write this as:
f(x; θ)
Examples of what f might represent:
The key is that f is not a single function until you choose θ. The learning problem is: find θ from data.
Typically we have n examples. In supervised learning we write:
D = { (xᵢ, yᵢ) }ᵢ₌₁ⁿ
In unsupervised learning we usually have only inputs:
D = { xᵢ }ᵢ₌₁ⁿ
A common confusion: if we minimize error on the training set, aren’t we done?
Not quite. We care about performance on new data from the same underlying process. So we distinguish:
A model can fit training data extremely well yet generalize poorly (overfitting). This lesson sets up the objects we need to discuss that later (e.g., in Bias-Variance Tradeoff).
Machine learning problems often look different on the surface (regression vs classification vs clustering), but underneath they share a template:
The rest of the field is largely about better choices for f, better objectives, and better optimization/evaluation procedures.
We need a way to turn “patterns in data” into a concrete computational object. A parameterized model does exactly that: it defines a space of candidate functions indexed by θ.
Think of θ as coordinates in a “function space.” Each θ picks a specific function f(·; θ).
In many ML settings:
We often use boldface for vectors: x, w, θ.
A classic example is a linear score function:
s(x; w, b) = wᵀx + b
You can fold b into w by augmenting the input:
Let x′ = [x; 1] and w′ = [w; b]
Then:
s(x; w, b) = (w′)ᵀ x′
This “feature augmentation” is simple, but it’s a good reminder: the choice of representation can simplify everything.
Different tasks interpret f(x; θ) differently.
| Task type | Output f(x; θ) | Typical interpretation | |
|---|---|---|---|
| Regression | real number ŷ ∈ ℝ | predicted value | |
| Binary classification | probability p ∈ [0, 1] | P(y = 1 | x) |
| Multi-class classification | vector p ∈ [0,1]ᵏ, ∑ pⱼ = 1 | class distribution | |
| Representation learning | embedding z ∈ ℝᵐ | learned features |
The same “score” s(x; θ) might be post-processed:
A parameterized model family can be:
You don’t need a formal definition of capacity yet, but you should feel the tradeoff: richer f can fit more, but may generalize worse without regularization or enough data.
For linear models, learning often becomes geometry.
Even for non-linear models, optimization still adjusts θ to reshape decision boundaries or function curves.
A subtle but crucial distinction:
Different training algorithms can fit the same model family. And the same training algorithm can be used for different model families.
This separation is what makes ML modular.
To “learn,” we need a quantitative notion of goodness. That means a scalar objective function we can minimize.
In supervised learning, the objective is usually based on a loss function ℓ(ŷ, y) that measures how bad a prediction is for one example.
Then we aggregate loss over the dataset.
Given D = { (xᵢ, yᵢ) }ᵢ₌₁ⁿ and predictions ŷᵢ = f(xᵢ; θ), define empirical risk:
R̂(θ) = (1/n) ∑ᵢ₌₁ⁿ ℓ(f(xᵢ; θ), yᵢ)
Training typically means:
θ̂ = argmin_θ R̂(θ)
This is called Empirical Risk Minimization (ERM).
Because it is computed on the empirical sample you observed, not on the full (unknown) data-generating distribution.
If the true distribution of (x, y) is P, then the population (true) risk is:
R(θ) = E₍x,y₎∼P[ ℓ(f(x; θ), y) ]
But we cannot compute R(θ) exactly because P is unknown. We approximate it with R̂(θ).
This is the bridge between data and generalization:
You already know MLE. Many ML objectives are MLE in disguise.
Suppose your model defines a conditional distribution p(y | x; θ). The likelihood on data is:
L(θ) = ∏ᵢ₌₁ⁿ p(yᵢ | xᵢ; θ)
MLE chooses:
θ̂ = argmax_θ ∏ᵢ p(yᵢ | xᵢ; θ)
It is more convenient to maximize log-likelihood:
log L(θ) = ∑ᵢ log p(yᵢ | xᵢ; θ)
Maximizing log-likelihood is equivalent to minimizing negative log-likelihood (NLL):
θ̂ = argmin_θ −∑ᵢ log p(yᵢ | xᵢ; θ)
Now compare with ERM:
R̂(θ) = (1/n) ∑ᵢ ℓ(f(xᵢ; θ), yᵢ)
If we set:
ℓᵢ(θ) = −log p(yᵢ | xᵢ; θ)
then minimizing empirical risk equals MLE (up to the constant factor 1/n).
So one major story of ML is:
Choose a probabilistic model p(y|x;θ) ⇒ derive a loss (NLL) ⇒ minimize it.
Often we train not just by fitting data, but also by discouraging overly complex parameter settings.
A common regularized objective is:
J(θ) = (1/n) ∑ᵢ ℓ(f(xᵢ; θ), yᵢ) + λ Ω(θ)
Where:
From a probabilistic view, regularization often corresponds to a prior on θ (MAP estimation), but you don’t need that machinery to use it.
Given J(θ), gradient descent performs updates:
θ ← θ − α ∇θ J(θ)
Because you already know gradient descent, the new idea here is mostly: what is J? ML is largely the craft of choosing J so that minimizing it produces useful behavior.
To estimate generalization, we split data:
The key is to avoid letting the test set influence choices, otherwise your “estimate” of generalization becomes biased.
“Supervised vs unsupervised” is not just a label—it determines what objective you can write down.
This changes evaluation, failure modes, and what “success” even means.
Data: D = { (xᵢ, yᵢ) }
Goal: learn a mapping x → y.
Typical objective (ERM):
R̂(θ) = (1/n) ∑ᵢ ℓ(f(xᵢ; θ), yᵢ)
Common tasks:
Evaluation is comparatively straightforward:
Data: D = { xᵢ }
Goal: discover structure in x.
Because there is no y, the objective can take different forms:
min_{μ₁,…,μ_k, c₁,…,cₙ} ∑ᵢ ‖xᵢ − μ_{cᵢ}‖²
θ̂ = argmax_θ ∏ᵢ p(xᵢ; θ)
Evaluation is trickier because “ground truth” labels may not exist. You often evaluate indirectly:
Many modern systems mix ideas:
These still follow the same template: define f and define an objective J.
| Learning setting | Data available | Typical objective | What success looks like |
|---|---|---|---|
| Supervised | (x, y) pairs | minimize prediction loss | low test error on y |
| Unsupervised | x only | clustering / likelihood / reconstruction | meaningful structure; useful representations |
| Semi-supervised | few (x, y) + many x | combined supervised + unsupervised | better than supervised-only with few labels |
| Self-supervised | x only (pseudo y) | predict withheld parts / contrast views | transferable representations |
This intro node enables specific, concrete models and deeper theory:
The through-line: every one of these is still “choose f(x; θ), choose J(θ), optimize.”
You have labeled data D = { (xᵢ, yᵢ) }ᵢ₌₁ⁿ. Your model specifies a conditional probability p(y | x; θ). Show how maximizing likelihood becomes minimizing an empirical risk with a per-example loss.
Start with the likelihood of the dataset (conditional independence assumption):
L(θ) = ∏ᵢ₌₁ⁿ p(yᵢ | xᵢ; θ)
Take logs to make products into sums:
log L(θ) = log(∏ᵢ p(yᵢ | xᵢ; θ))
= ∑ᵢ log p(yᵢ | xᵢ; θ)
MLE chooses θ that maximizes log-likelihood:
θ̂ = argmax_θ ∑ᵢ log p(yᵢ | xᵢ; θ)
Convert to a minimization by multiplying by −1:
θ̂ = argmin_θ −∑ᵢ log p(yᵢ | xᵢ; θ)
Recognize this as a sum of per-example losses. Define:
ℓᵢ(θ) = −log p(yᵢ | xᵢ; θ)
Then the objective is:
J(θ) = ∑ᵢ ℓᵢ(θ)
Optionally average over n (does not change the argmin):
R̂(θ) = (1/n) ∑ᵢ₌₁ⁿ ℓᵢ(θ)
This matches ERM with loss ℓ(f(xᵢ; θ), yᵢ) when f induces p(y|x;θ).
Insight: A large fraction of supervised ML is just MLE with a convenient parameterization. The “loss function” is often simply the negative log-likelihood of your probabilistic model.
Suppose f(x; θ) = θx (a 1D linear model with one parameter θ). Data: (x₁, y₁) = (1, 2), (x₂, y₂) = (2, 3). Use squared loss ℓ(ŷ, y) = (ŷ − y)². Write the empirical risk and find the minimizing θ by calculus.
Write predictions:
ŷ₁ = f(1; θ) = θ
ŷ₂ = f(2; θ) = 2θ
Write empirical risk (average squared loss):
R̂(θ) = (1/2)[(θ − 2)² + (2θ − 3)²]
Expand the squares:
(θ − 2)² = θ² − 4θ + 4
(2θ − 3)² = 4θ² − 12θ + 9
Sum and scale:
R̂(θ) = (1/2)[(θ² − 4θ + 4) + (4θ² − 12θ + 9)]
= (1/2)[5θ² − 16θ + 13]
= (5/2)θ² − 8θ + 13/2
Differentiate and set to zero:
∂R̂/∂θ = 5θ − 8
Set 5θ − 8 = 0 ⇒ θ = 8/5 = 1.6
Check (optional) that it is a minimum:
∂²R̂/∂θ² = 5 > 0 ⇒ minimum.
Insight: Even this toy problem follows the full ML pattern: define f(x; θ), choose a loss, average it into R̂(θ), then optimize. In higher dimensions you typically use vector calculus and gradient descent rather than solving in closed form.
You have unlabeled points x₁,…,xₙ in ℝᵈ. You want to group them into k clusters. Write the k-means objective and interpret it as empirical risk minimization over latent assignments.
Introduce cluster centers μ₁,…,μ_k where each μⱼ ∈ ℝᵈ.
Introduce an assignment cᵢ ∈ {1,…,k} for each data point xᵢ (a latent/hidden variable).
Define the distortion (squared distance) for a point to its assigned center:
lossᵢ = ‖xᵢ − μ_{cᵢ}‖²
Sum over the dataset to get an empirical objective:
J(μ₁,…,μ_k, c₁,…,cₙ) = ∑ᵢ₌₁ⁿ ‖xᵢ − μ_{cᵢ}‖²
Interpretation:
Training alternates (informally):
This decreases J each step.
Insight: Unsupervised learning still optimizes a scalar objective, but the objective is about structure (compact clusters) rather than matching given targets y. The absence of labels changes both the objective and how you validate results.
Machine learning formalizes “learning from data” as choosing parameters θ of a model f(x; θ).
A parameterized model defines a whole family of functions; training selects one member of that family by optimizing θ.
Empirical risk R̂(θ) = (1/n) ∑ ℓ(f(xᵢ; θ), yᵢ) is the standard supervised training objective (ERM).
Population risk R(θ) = E[ℓ(f(x; θ), y)] is what we actually care about; R̂(θ) is a sample-based approximation.
Many common losses are negative log-likelihoods; minimizing them is equivalent to MLE (up to constants).
Regularization augments the objective with a term like λΩ(θ) to prefer simpler parameter settings and improve generalization.
Supervised learning uses labeled (x, y) data; unsupervised learning uses x only and must define objectives that capture structure (clustering, density, reconstruction).
Confusing the model with the training algorithm: f(x; θ) is the model; gradient descent is just one way to find θ.
Thinking low training loss implies a good model: generalization depends on how well the learned θ performs on unseen data.
Mixing up objectives across settings: using supervised metrics (accuracy) when you don’t have labels, or assuming unsupervised structure guarantees downstream performance.
Treating “loss” and “metric” as the same thing: you might train with cross-entropy but report accuracy; they serve different roles.
You have supervised data D = { (xᵢ, yᵢ) }ᵢ₌₁ⁿ and a model f(x; θ). Write the empirical risk for absolute error loss ℓ(ŷ, y) = |ŷ − y|. Is the objective differentiable everywhere?
Hint: Write R̂(θ) = (1/n)∑|f(xᵢ;θ) − yᵢ|. Consider what happens when f(xᵢ;θ) = yᵢ.
The empirical risk is:
R̂(θ) = (1/n) ∑ᵢ₌₁ⁿ |f(xᵢ; θ) − yᵢ|.
It is not differentiable at points where f(xᵢ; θ) − yᵢ = 0 for some i (the absolute value has a kink at 0). It is subdifferentiable there, and differentiable elsewhere.
Show that minimizing the average loss (1/n)∑ᵢ ℓᵢ(θ) has the same minimizer as minimizing the sum ∑ᵢ ℓᵢ(θ).
Hint: Multiplying an objective by a positive constant does not change the argmin.
Let J(θ) = ∑ᵢ₌₁ⁿ ℓᵢ(θ) and R̂(θ) = (1/n)J(θ).
For any θ₁, θ₂:
J(θ₁) ≤ J(θ₂) ⇔ (1/n)J(θ₁) ≤ (1/n)J(θ₂)
because 1/n > 0.
Therefore θ that minimizes J also minimizes R̂, so:
argmin_θ J(θ) = argmin_θ R̂(θ).
Unsupervised vs supervised identification: For each dataset type below, say whether it is supervised or unsupervised and propose a plausible objective.
(A) 10,000 images each tagged with {cat, dog}
(B) 1,000,000 unlabeled user click sequences
(C) 50,000 house listings with final sale price
Hint: Look for whether y is present. If not, think clustering, likelihood, reconstruction, or contrastive objectives.
(A) Supervised (classification). Objective: minimize cross-entropy / NLL for p(y|x;θ).
(B) Unsupervised (no labels). Objective examples: maximize likelihood of sequences p(x;θ), or self-supervised next-item prediction / masked prediction loss.
(C) Supervised (regression). Objective: minimize squared loss (ŷ − y)² (possibly with regularization).
Next nodes you can study: