Reducing feature dimensions. PCA, t-SNE, autoencoders.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Modern datasets often live in spaces with hundreds, thousands, or millions of features—but the structure we care about is frequently much simpler. Dimensionality reduction is the art of finding that simpler structure without throwing away what matters.
Dimensionality reduction learns a mapping f: ℝᴰ → ℝᵈ (with d ≪ D) that embeds high-dimensional points into a lower-dimensional space while preserving a chosen property (variance, distances, neighborhoods, or reconstruction). Linear methods (PCA) preserve global variance; nonlinear methods like t-SNE preserve local neighborhoods for visualization; autoencoders learn task-driven nonlinear embeddings by compressing and reconstructing data.
High-dimensional feature spaces create three recurring problems:
1) Computation and storage: Many ML algorithms scale poorly with D. Pairwise distances, covariance matrices, nearest-neighbor search, and kernel methods can become expensive.
2) Generalization and sample efficiency: With limited data, many degrees of freedom invite overfitting. In broad strokes, if your model can wiggle in D directions, you need enough data to constrain those directions.
3) Geometry gets weird in high dimensions (“curse of dimensionality”): Distances concentrate, neighborhoods become sparse, and local density estimation becomes hard. Many intuitive operations (like “find the nearest neighbors”) degrade unless the data truly lies on a lower-dimensional structure.
Dimensionality reduction aims to address these issues by learning an embedding that is lower-dimensional but still useful.
You are given data points x ∈ ℝᴰ. Dimensionality reduction chooses a smaller dimension d (typically 2, 3, 10, 50, 256, …) and learns a mapping
f: ℝᴰ → ℝᵈ with d ≪ D
so that the embedding z = f(x) retains something you care about.
The key phrase is “retain something you care about.” Dimensionality reduction is not one algorithm; it is a design pattern:
1) Low-dimensional embedding
2) Preservation criterion
3) Mapping mechanism
Dimensionality reduction methods often differ along two axes:
| Axis | Option A | Option B |
|---|---|---|
| Mapping | Parametric (explicit f; can embed new points) | Nonparametric (embeds training set; out-of-sample is nontrivial) |
| Geometry preserved | Global (variance, long-range distances) | Local (neighbors, manifold structure) |
And a third practical axis:
| Goal | Typical method |
|---|---|
| Visualization (2D/3D plots) | t-SNE, UMAP, PCA (as baseline) |
| Compression for downstream models | PCA, autoencoders |
| Denoising / representation learning | Denoising autoencoders, contractive autoencoders |
| Preprocessing for clustering / kNN | PCA, UMAP (careful), sometimes random projections |
The central skill: match the preservation criterion to your end use.
If you only remember one thing: dimensionality reduction is defined by what it preserves. Two embeddings with the same d can behave radically differently depending on the criterion.
A useful way to think:
Different algorithms pick different L.
Even though you already know PCA, it’s worth anchoring it as the prototypical global criterion.
Let centered data be x̃ = x − μ.
Pick an orthonormal basis W = [w₁ … wᵈ] ∈ ℝᴰ×ᵈ with WᵀW = I.
Projection:
PCA chooses W to maximize captured variance:
maximize ∑ᵢ ‖Wᵀ x̃ᵢ‖²
Equivalently, minimize reconstruction error:
minimize ∑ᵢ ‖x̃ᵢ − W Wᵀ x̃ᵢ‖²
Why this matters:
Suppose you care about distances between points. Let δᵢⱼ be distances in the original space (often Euclidean):
δᵢⱼ = ‖xᵢ − xⱼ‖
You want:
‖zᵢ − zⱼ‖ ≈ δᵢⱼ
A classic objective (stress) is:
Stress = ∑ᵢ<ⱼ (‖zᵢ − zⱼ‖ − δᵢⱼ)²
This is global: it tries to match all pairwise distances. In high dimensions, however, global distance preservation in very low d can be impossible, so tradeoffs are inevitable.
Visualization methods often care more about “who is near whom” than exact distances.
t-SNE does this via probability distributions.
1) Convert high-D distances into conditional neighbor probabilities:
pⱼ|ᵢ ∝ exp(−‖xᵢ − xⱼ‖² / (2σᵢ²))
σᵢ is chosen so that each point has a roughly fixed effective number of neighbors (the perplexity parameter controls this).
2) Define low-D similarities with a heavy-tailed distribution (Student-t):
qᵢⱼ ∝ (1 + ‖zᵢ − zⱼ‖²)⁻¹
3) Minimize KL divergence so that neighbors in high-D stay neighbors in low-D:
KL(P ‖ Q) = ∑ᵢ ∑ⱼ pᵢⱼ log(pᵢⱼ / qᵢⱼ)
What this criterion buys you:
What it does not guarantee:
UMAP has a similar “local connectivity” spirit but uses a graph/fuzzy simplicial set objective; practically, it often preserves more global structure than t-SNE, but still primarily optimizes neighborhood relations.
If your downstream tasks require information beyond neighborhoods—say, you need a compressed representation that can regenerate the input—then reconstruction is a natural criterion.
An autoencoder learns:
with embedding z = f_θ(x) and reconstruction x̂ = g_φ(z).
Typical loss:
L(θ, φ) = ∑ᵢ ‖xᵢ − g_φ(f_θ(xᵢ))‖²
or for images, sometimes cross-entropy / perceptual losses.
This criterion is task-agnostic but not “semantics-agnostic”: the learned z depends strongly on network architecture, bottleneck size d, regularization, and data.
| If you need… | Preserve… | Typical choice | Notes |
|---|---|---|---|
| Fast linear compression, denoising, baseline | Variance / reconstruction (linear) | PCA | Strong baseline; interpretable components |
| 2D/3D visualization of clusters | Local neighborhoods | t-SNE, UMAP | Do not over-interpret global geometry |
| Approximate distance geometry | Pairwise distances | MDS / Isomap | Hard in very low d; can distort heavily |
| Nonlinear features for downstream models | Reconstruction + regularization | Autoencoder / VAE | Parametric; can embed new data |
A practical workflow is often: PCA → (optional) t-SNE/UMAP for visualization; or PCA → autoencoder for nonlinear compression.
Two methods can preserve “local neighborhoods,” but one may:
This matters whenever you want to deploy an embedding in production, feed it into a classifier, or update it as new data arrives.
The simplest parametric mapping:
f(x) = Wᵀ(x − μ)
Properties:
When it shines:
Kernel PCA (and related methods) implicitly map data to a high-dimensional feature space via a kernel k(x, x′) and then do PCA there.
It enables nonlinear embeddings while keeping an eigenvector-based formulation.
Tradeoffs:
This node focuses on PCA/t-SNE/autoencoders, but kernel ideas explain a common theme: nonlinear geometry without explicit neural networks.
Classic t-SNE learns {zᵢ} directly for the dataset. There is no explicit f.
Consequences:
So, think of t-SNE as: optimize coordinates, not a function.
An autoencoder defines an explicit parametric mapping f_θ.
A typical structure:
Because f_θ is explicit:
But because it is flexible:
A pure reconstruction objective can yield embeddings that are hard to use for clustering or interpolation. Common fixes:
1) Denoising autoencoder
Loss:
L = ∑ᵢ ‖xᵢ − g_φ(f_θ(x̃ᵢ))‖²
Effect: forces z to capture robust structure.
2) Contractive penalty (encourage local smoothness)
Penalize Jacobian magnitude:
L = reconstruction + λ ‖∂f_θ(x)/∂x‖_F²
Interpretation: small input changes shouldn’t cause wild latent changes.
3) Variational autoencoder (VAE) (probabilistic latent space)
Instead of a point z, encoder outputs a distribution q_θ(z|x) and regularizes it toward a prior p(z) (often 𝒩(0, I)).
Objective (ELBO):
maximize ∑ᵢ 𝔼_{q_θ(z|xᵢ)}[log p_φ(xᵢ|z)] − KL(q_θ(z|xᵢ) ‖ p(z))
Effect: more continuous, “organized” latent spaces—useful for generation and interpolation.
| Method | f: ℝᴰ → ℝᵈ explicit? | Preserves | Strength | Weakness |
|---|---|---|---|---|
| PCA | Yes (linear) | Global variance / linear reconstruction | Fast, stable, interpretable | Misses nonlinear structure |
| t-SNE | No (classic) | Local neighborhoods | Excellent 2D/3D cluster visualization | Global geometry unreliable; hard out-of-sample |
| Autoencoder | Yes (nonlinear) | Reconstruction (plus whatever regularization encourages) | Parametric, flexible, scalable | Needs tuning; can learn unhelpful latents |
The high-level skill: choose the simplest mapping mechanism that can satisfy your preservation criterion.
A 2D embedding is often used to answer questions like:
Best practice:
Why the PCA pre-step helps:
Interpretation warnings (especially t-SNE):
Using z as input to a classifier/regressor can:
PCA is common when:
Autoencoders are common when:
Many retrieval systems use embeddings:
Note: This often becomes metric learning, where the preservation criterion is label- or relevance-driven (contrastive/triplet losses). Dimensionality reduction is the geometric core: you still learn f: ℝᴰ → ℝᵈ.
If you learn a low-dimensional model of “normal” data, anomalies often reconstruct poorly.
Autoencoder anomaly score:
score(x) = ‖x − g_φ(f_θ(x))‖
Caveat: Powerful autoencoders can sometimes reconstruct anomalies too well. Regularization and capacity control matter.
A guiding intuition: real-world data often lies near a manifold of much smaller intrinsic dimension than D.
But “manifold” is a hypothesis, not a guarantee. Dimensionality reduction is how you test whether a low-dimensional structure exists and is useful.
Validation depends on your purpose:
A simple sanity checklist:
1) Standardize or whiten features when appropriate.
2) Fit on train set only; apply transform to validation/test.
3) Don’t select d using test-set performance.
4) For t-SNE: avoid drawing conclusions from global geometry.
Dimensionality reduction is not a one-time trick; it is a model choice that should be evaluated like any other.
You have centered data points {x̃ᵢ} in ℝᴰ and choose an orthonormal projection matrix W ∈ ℝᴰ×ᵈ (so WᵀW = I). Show that maximizing captured variance ∑ᵢ ‖Wᵀx̃ᵢ‖² is equivalent to minimizing reconstruction error ∑ᵢ ‖x̃ᵢ − WWᵀx̃ᵢ‖².
Define the projection and reconstruction:
zᵢ = Wᵀ x̃ᵢ
x̂̃ᵢ = W zᵢ = WWᵀ x̃ᵢ
Write reconstruction error for one point:
‖x̃ᵢ − x̂̃ᵢ‖² = ‖x̃ᵢ − WWᵀ x̃ᵢ‖²
Use orthogonal decomposition: since WWᵀ is the projector onto the span of columns of W,
x̃ᵢ = (WWᵀx̃ᵢ) + ( ( I − WWᵀ ) x̃ᵢ )
The two components are orthogonal.
Apply Pythagoras:
‖x̃ᵢ‖² = ‖WWᵀx̃ᵢ‖² + ‖( I − WWᵀ )x̃ᵢ‖²
Notice the reconstruction residual is exactly the orthogonal component:
‖x̃ᵢ − WWᵀx̃ᵢ‖² = ‖( I − WWᵀ )x̃ᵢ‖²
So rearrange:
‖x̃ᵢ − WWᵀx̃ᵢ‖² = ‖x̃ᵢ‖² − ‖WWᵀx̃ᵢ‖²
Sum over i:
∑ᵢ ‖x̃ᵢ − WWᵀx̃ᵢ‖² = ∑ᵢ ‖x̃ᵢ‖² − ∑ᵢ ‖WWᵀx̃ᵢ‖²
Use WᵀW = I to simplify the retained term:
‖WWᵀx̃ᵢ‖² = ‖Wᵀx̃ᵢ‖²
(because projecting then measuring norm equals norm of coordinates in an orthonormal basis).
Final equivalence:
∑ᵢ ‖x̃ᵢ − WWᵀx̃ᵢ‖² = constant − ∑ᵢ ‖Wᵀx̃ᵢ‖²
So minimizing reconstruction error is equivalent to maximizing captured variance.
Insight: PCA’s “maximize variance” and “minimize squared reconstruction error” are the same problem because orthogonal projection splits each point into a kept part plus an orthogonal residual, and squared norms add cleanly.
You have points xᵢ in ℝᴰ and want a 2D embedding zᵢ in ℝ² for visualization. Explain how t-SNE turns distances into probabilities and why the KL(P ‖ Q) objective emphasizes preserving local neighbors.
Start from the design goal: keep local neighborhoods.
Instead of trying to match all distances, define “neighborliness” as a probability distribution.
Define conditional probabilities in high-D:
pⱼ|ᵢ = exp(−‖xᵢ − xⱼ‖² / (2σᵢ²)) / ∑_{k≠i} exp(−‖xᵢ − x_k‖² / (2σᵢ²))
Interpretation: points closer to i get higher probability mass.
Choose σᵢ to match a target perplexity.
Perplexity is 2^{H(P_i)} where H is Shannon entropy; practically, it sets an effective neighbor count so each point adapts to local density.
Symmetrize to get joint probabilities:
pᵢⱼ = (pⱼ|ᵢ + pᵢ|ⱼ) / (2n)
Now P = {pᵢⱼ} describes neighbor affinities in high-D.
Define low-D affinities with a heavy tail:
qᵢⱼ = (1 + ‖zᵢ − zⱼ‖²)⁻¹ / ∑_{a≠b} (1 + ‖z_a − z_b‖²)⁻¹
The Student-t tail prevents distant points from all collapsing together (it combats the “crowding problem”).
Optimize with KL divergence:
KL(P ‖ Q) = ∑ᵢ ∑ⱼ pᵢⱼ log(pᵢⱼ / qᵢⱼ)
Only pairs with large pᵢⱼ (true neighbors) contribute strongly to the loss.
Reason about asymmetry:
Because it is KL(P ‖ Q), if pᵢⱼ is large but qᵢⱼ is small, the penalty is severe.
If pᵢⱼ is tiny (non-neighbors) but qᵢⱼ is somewhat large, the penalty is mild.
So the optimization prioritizes: “don’t lose real neighbors,” rather than “perfectly separate all non-neighbors.”
Conclusion:
t-SNE is engineered to make local structure visually salient (clusters/neighborhoods) even if global distances become distorted.
Insight: t-SNE’s choice of KL(P ‖ Q) (not KL(Q ‖ P)) is deliberate: it heavily punishes missing true neighbors, which is exactly what you want for cluster visualization—but it also explains why global geometry is not reliable.
You want f_θ: ℝᴰ → ℝᵈ with d ≪ D to compress vectors x while keeping enough information for downstream tasks. You train an autoencoder with squared reconstruction loss. Explain what the bottleneck forces the model to do, and why a denoising variant often yields better embeddings.
Define the model:
Encoder: z = f_θ(x) ∈ ℝᵈ
Decoder: x̂ = g_φ(z) ∈ ℝᴰ
Train with reconstruction loss:
L(θ, φ) = ∑ᵢ ‖xᵢ − g_φ(f_θ(xᵢ))‖²
Interpret the bottleneck:
Because d ≪ D, z cannot store arbitrary details of x.
The network must learn a compressed code capturing the most predictable/shared factors across the dataset.
Identify a failure mode:
If the network has too much capacity (especially the decoder), it may learn a representation that reconstructs well but is not smooth or meaningful for similarity (e.g., it may encode idiosyncratic details in a brittle way).
Apply denoising:
Sample a corruption process C (Gaussian noise, masking, dropout) and form x̃ = C(x).
Train:
L = ∑ᵢ ‖xᵢ − g_φ(f_θ(x̃ᵢ))‖²
Explain why denoising helps:
To reconstruct clean x from x̃, the encoder must capture stable structure and ignore noise.
This tends to produce embeddings z that align better with semantic factors and are more robust for downstream tasks.
Insight: Autoencoders reduce dimension by forcing information through a bottleneck. Regularization (like denoising) changes what gets preserved: instead of memorizing details, the embedding must capture robust, repeatable structure.
Dimensionality reduction learns an embedding z = f(x) with f: ℝᴰ → ℝᵈ and d ≪ D; the entire problem is defined by what you choose to preserve.
Preservation criteria differ: PCA preserves global variance (and minimizes squared reconstruction error), while t-SNE prioritizes local neighborhood structure for visualization.
Mapping mechanisms matter operationally: PCA and autoencoders are parametric (easy out-of-sample), while classic t-SNE is nonparametric (great plots, awkward deployment).
Neighborhood-preserving embeddings can distort global geometry; avoid interpreting inter-cluster distances in t-SNE plots as meaningful metrics.
Autoencoders preserve reconstructability, but without regularization they can learn brittle or unhelpful latent spaces; denoising/contractive/VAE-style regularization often improves usefulness.
A strong workflow is to start with simple baselines (standardization + PCA), then escalate to nonlinear methods only if the objective demands it.
Validation must match the purpose: reconstruction error for compression, downstream metrics for features, and neighborhood metrics or stability checks for visualization.
Over-interpreting t-SNE: treating distances between clusters as meaningful, or assuming a 2D layout reflects true global geometry.
Fitting the dimensionality reducer on the full dataset (train + test), which leaks information; always fit on training data and transform held-out data.
Skipping scaling/centering when using PCA on features with incompatible units, causing components to reflect measurement scale rather than structure.
Using an autoencoder with excessive capacity and no regularization, then assuming the latent space is semantically meaningful just because reconstruction loss is low.
You need a 2D plot to inspect whether your labeled dataset has separable classes. Which method would you try first: PCA or t-SNE? Explain what each will preserve and what could mislead you in the visualization.
Hint: Think: global variance vs local neighborhoods; and what you want to learn from a plot.
Start with PCA as a baseline because it is fast, deterministic, and preserves global variance (large-scale directions). If classes separate in the top PCs, that’s strong evidence of linear separability in the original space.
Then try t-SNE if PCA does not show separation but you suspect nonlinear/local structure. t-SNE preserves local neighborhoods, often making clusters visible even when PCA looks mixed.
Misleading aspects: PCA can hide nonlinear separability; t-SNE can invent visually separated clusters and its inter-cluster distances (and relative cluster sizes) are not reliable measures of true global separation.
Suppose you trained a classic (nonparametric) t-SNE embedding on n points and now receive a new point x_new. Why is it nontrivial to compute z_new? Name two practical strategies to handle this situation.
Hint: Ask: does t-SNE learn f(x) or only {zᵢ}? What does “out-of-sample” mean here?
Classic t-SNE optimizes the coordinates {zᵢ} directly; it does not learn an explicit mapping f: ℝᴰ → ℝᵈ. Therefore, there is no built-in way to embed x_new without changing the optimization problem.
Two practical strategies:
1) Rerun (or continue) t-SNE including x_new and all prior points (expensive; may change existing positions).
2) Learn an approximate parametric map after the fact, e.g., train a regression model or neural network to predict z from x using the existing pairs (xᵢ, zᵢ), or use parametric t-SNE variants that learn an explicit encoder.
You want a nonlinear embedding z for downstream classification. You train an autoencoder with very low reconstruction error, but a classifier trained on z performs poorly compared to training on the original features. Give two reasons this can happen and one modification to your training that could improve z.
Hint: Reconstruction ≠ discriminative usefulness. Think about what information the autoencoder is incentivized to keep and how it might ignore label-relevant structure.
Two reasons:
1) The autoencoder optimizes reconstruction, not class separation. It may dedicate capacity to reconstructing high-variance nuisance factors (background, lighting, frequent but label-irrelevant details) and compress away subtle label-relevant cues.
2) With high-capacity encoder/decoder, the model can reconstruct well using a latent code that is not smooth or not aligned with semantic similarity; good reconstruction can coexist with a latent space where classes are entangled.
One modification:
Use a regularized or task-informed objective, e.g. a denoising autoencoder (train to reconstruct clean x from corrupted x̃) to encourage robust features, or add a supervised/contrastive term so that embeddings of same-class points are closer while different-class points are farther. Either changes the preservation criterion from “just reconstruct” toward “reconstruct robustly” or “preserve label structure.”
Prerequisites you can lean on:
Natural next nodes this unlocks conceptually: