The paradigm of mapping input sequences to output sequences (e.g., translation or summarization), including encoder-decoder architectures and alignment concepts; attention mechanisms are often introduced to improve information flow between encoder and decoder. Familiarity with seq2seq setups clarifies why and how attention is applied.
Self-serve tutorial - low prerequisites, straightforward concepts.
Mini-scenario (we’ll keep using it):
You’re building a tiny English→French translator for a five-word sentence.
Source (English, length S=5):
Target (French, length T=5):
If you translate word-by-word, you’ll get stuck: in French, “green” (vertes) often comes after “apples” (pommes). So when the decoder is producing the last word “vertes”, it must “look back” to the English word “green” at position 3—even though it already produced “pommes” at position 4.
This lesson is about the modeling paradigm that makes that possible: sequence-to-sequence (seq2seq) modeling. We’ll build up from the probability objective P(y₁..y_T | x₁..x_S), then see why plain encoder→decoder has an information bottleneck, and how alignment/attention uses αₜ to let the decoder read from the encoder states H=(h₁,…,h_S) at every output step.
Quick schematic (we’ll refer back to it):
x₁ x₂ x₃ x₄ x₅
v v v v v
[h₁][h₂][h₃][h₄][h₅] = H (encoder states)
\ | /
\ | /
αₜ (softmax over 1..S)
v
cₜ (context)
v
decoder state sₜ → yₜ
Seq2seq models learn a conditional distribution over output sequences given an input sequence: P(y₁..y_T | x₁..x_S). An encoder maps the source tokens to hidden states H=(h₁,…,h_S). A decoder generates tokens autoregressively, using previously generated tokens and (optionally) a context vector. Attention/alignment computes a per-step weight vector αₜ over encoder positions, forming a context cₜ=∑ᵢ αₜ,ᵢ hᵢ so the decoder can dynamically “read” the right parts of the input while generating each output token.
Many real problems are not “predict a single label” but “produce a whole sequence whose length may differ from the input.” Examples:
The defining feature is variable-length in, variable-length out, and output tokens depend on each other.
In our running example:
The goal is to model the conditional probability of the entire output sequence given the input sequence:
We don’t predict the whole sequence at once. We factorize it using the chain rule:
Here means .
This single equation captures the seq2seq contract:
To implement this, we split responsibilities:
Historically, the encoder and decoder were RNNs (LSTM/GRU). Today, they are often Transformers—but the encoder/decoder roles persist.
A classic mental model:
Early seq2seq used a single vector (often the final encoder hidden state) as a summary of the entire source. This works for very short sentences but degrades on longer ones: the decoder must squeeze everything about through a fixed-size bottleneck.
Our five-word example is short, but it already hints at a deeper need: when the decoder generates “vertes” (green), it should access information from source position 3. If all information is compressed into a single vector, the decoder has to remember precise token-level details across steps.
This is why alignment/attention matters: it provides a direct path from each output step back to the relevant encoder states.
Attention can feel like “extra machinery.” But it’s easier to appreciate once you’ve seen what the encoder–decoder is trying to do on its own—and where it struggles.
We’ll describe a standard formulation using hidden states and probability distributions. Even if you later use a Transformer, the ideas map cleanly:
Tokens are discrete symbols. Models convert them into continuous vectors.
Let the encoder produce hidden states hᵢ (one per source position):
In an RNN encoder, a typical recurrence is:
Intuition:
Even in a Transformer encoder, you still end up with one vector per token position; you can still call them .
A classic early approach defines a single vector c summarizing the source, for example the last hidden state:
(There are other choices: pooling over , or concatenating last states of a bidirectional RNN. But the core idea is “fixed-size summary.”)
The decoder maintains a hidden state sₜ and emits a distribution over the next token:
1) Update decoder state:
2) Produce logits and probabilities:
The softmax is your prerequisite: it turns logits into a probability distribution over the vocabulary.
During training, we usually feed the true previous token rather than the model’s sampled token (teacher forcing). The loss is negative log-likelihood:
Where is the ground-truth token.
At inference time:
Decoding choices:
| Strategy | How it chooses | Pros | Cons |
|---|---|---|---|
| Greedy | over softmax | Fast | Can miss better global sequences |
| Beam search | Keep top-K partial sequences | Better translations often | Slower; can prefer generic outputs |
| Sampling | Sample from distribution | Diverse outputs | Can be unstable without controls |
If is fixed, then every output token must be generated from the same compressed summary.
Imagine generating “vertes” at the end. The decoder must:
With a single , the model can learn this sometimes, but it scales poorly with longer sequences and complex reorderings.
Think of the encoder summary vector as trying to be a whole paragraph’s worth of meaning squeezed into one sticky note.
You can write a good sticky note.
But if the decoder could instead flip through the original paragraph whenever needed, it would make fewer mistakes.
That “flipping through” is exactly what attention adds.
When producing , the decoder doesn’t need the entire source equally.
So instead of a single fixed , we want a different context vector $c_t$ at each decoding step.
You are given:
Constraints:
Interpretation:
We first compute an unnormalized score for each source position .
A common pattern:
Where:
Typical scoring functions (you’ll see these in literature):
| Name | Score function | Notes |
|---|---|---|
| Dot product | Simple; requires same dimension | |
| General | Learnable linear map | |
| Additive (Bahdanau) | Works well with different dims |
You don’t need to memorize them all right now—the pattern is what matters: compare the decoder’s needs to each encoder state.
Convert scores into a distribution over positions:
This is exactly where your softmax prerequisite shows up.
Numerical stability reminder (important in real implementations):
Now we compute:
Interpretation:
This makes attention feel like a soft pointer into the source.
A common formulation:
Then output distribution:
Sometimes the context is also fed directly into the output layer, e.g. concatenate .
Let’s label source positions:
1:I, 2:eat, 3:green, 4:apples, 5:today
A plausible alignment pattern:
So might heavily weight position 3.
Below is a rough alignment matrix (rows are target steps t, columns are source positions i). Darker means larger α.
| t / i | 1 (I) | 2 (eat) | 3 (green) | 4 (apples) | 5 (today) |
|---|---|---|---|---|---|
| 1 (Je) | ████ | ░ | ░ | ░ | ░ |
| 2 (mange) | ░ | ████ | ░ | ░ | ░ |
| 3 (des) | ░ | ░ | ░ | ███ | ░ |
| 4 (pommes) | ░ | ░ | ░ | ████ | ░ |
| 5 (vertes) | ░ | ░ | ████ | ░ | ░ |
This is the alignment concept: for each output step, the model forms a distribution over input positions.
Without attention, the only path from input token to output decision at time is:
encoder computations single summary decoder state
With attention:
Now each output step has a direct, learnable channel to any encoder state.
It’s tempting to say “αₜ tells you exactly which input word caused the output.” Often it correlates with alignment, but:
So treat attention as:
The attention we described is often called encoder–decoder attention or cross-attention:
Even if you haven’t learned Transformer math yet, the conceptual mapping is straightforward:
This is exactly why understanding seq2seq clarifies attention: it tells you what problem attention is solving.
Seq2seq setups vary mainly in:
1) What counts as the “sequence” on the input side
2) What output vocabulary/tokenization looks like
3) How decoding is constrained
Examples:
| Task | Input sequence x | Output sequence y | Special concerns |
|---|---|---|---|
| Translation | tokens | tokens | reordering, morphology |
| Summarization | long tokens | shorter tokens | content selection, hallucination |
| Speech recognition | audio frames | tokens | long inputs, monotonic alignment |
| Image captioning | region features | tokens | attention over image regions |
The encoder can be anything that outputs a sequence of vectors :
As long as you have , the decoder can attend over it.
Even with a good model, the way you decode changes behavior:
A typical beam objective may look like:
where controls length penalty.
Training uses teacher forcing: decoder sees true .
Inference uses its own previous prediction .
So the model may drift if it makes an early mistake. This issue is called exposure bias.
Mitigations you may hear about:
You don’t need to solve exposure bias now, but you should recognize it as a recurring theme in seq2seq.
In Transformers:
So the conceptual flow becomes:
1) Encode source into
2) For each decoding step t, compute attention distribution over
3) Use the resulting context to predict
If you understand that, the jump to “multi-head attention” is mostly an engineering/generalization step: do it several times in parallel with different learned projections.
Our translator succeeded not because it memorized the entire input in one vector, but because at each step it can ask:
When producing “vertes”, it can attend back to “green” even though that occurred earlier and has already influenced other outputs.
That dynamic reading behavior is the essence of seq2seq alignment.
Suppose the encoder produced S=5 hidden states h₁..h₅ (each 2D for simplicity):
h₁ = [1, 0]
h₂ = [0, 1]
h₃ = [1, 1]
h₄ = [2, 0]
h₅ = [0, 2]
At decoder step t=5 (trying to produce “vertes”), assume the alignment scores (unnormalized) are:
e₅ = [e₅,₁..e₅,₅] = [-1, 0, 2, 0, -2].
Compute α₅ via softmax and then compute the context vector c₅ = ∑ᵢ α₅,ᵢ hᵢ.
Compute stabilized softmax.
Let m = max(e₅) = 2.
Compute shifted scores:
e' = e₅ - m = [-3, -2, 0, -2, -4].
Exponentiate the shifted scores:
exp(e') = [exp(-3), exp(-2), exp(0), exp(-2), exp(-4)].
Numerically (approx):
exp(-3)≈0.0498
exp(-2)≈0.1353
exp(0)=1
exp(-4)≈0.0183
So:
exp(e') ≈ [0.0498, 0.1353, 1, 0.1353, 0.0183].
Normalize to get α₅.
Sum Z = 0.0498 + 0.1353 + 1 + 0.1353 + 0.0183 = 1.3387 (approx).
Therefore:
α₅,₁ ≈ 0.0498 / 1.3387 = 0.0372
α₅,₂ ≈ 0.1353 / 1.3387 = 0.1011
α₅,₃ ≈ 1 / 1.3387 = 0.7470
α₅,₄ ≈ 0.1353 / 1.3387 = 0.1011
α₅,₅ ≈ 0.0183 / 1.3387 = 0.0137
Check: they sum to ~1.0001 (rounding error).
Compute context c₅ = ∑ᵢ α₅,ᵢ hᵢ.
Write it out:
c₅ = α₅,₁ h₁ + α₅,₂ h₂ + α₅,₃ h₃ + α₅,₄ h₄ + α₅,₅ h₅
= 0.0372[1,0] + 0.1011[0,1] + 0.7470[1,1] + 0.1011[2,0] + 0.0137[0,2].
Add component-wise.
x-component:
= 0.0372(1) + 0.1011(0) + 0.7470(1) + 0.1011(2) + 0.0137(0)
= 0.0372 + 0 + 0.7470 + 0.2022 + 0
= 0.9864.
y-component:
= 0.0372(0) + 0.1011(1) + 0.7470(1) + 0.1011(0) + 0.0137(2)
= 0 + 0.1011 + 0.7470 + 0 + 0.0274
= 0.8755.
So c₅ ≈ [0.986, 0.876].
Insight: Because e₅,₃ is much larger than the rest, α₅ puts ~75% of its mass on source position 3. The context vector c₅ becomes close to h₃ (with smaller contributions from other positions). This is the “soft pointer” idea in numbers: attention is a learned weighted average over encoder states.
Assume a tiny vocabulary of 4 tokens for the decoder: {Je, mange, pommes, vertes}. For one training example, the target sequence is y = [Je, mange].
Suppose the model outputs these probabilities:
At t=1 (predicting y₁):
P(y₁=Je | x) = 0.70
At t=2 (predicting y₂ with teacher forcing on y₁=Je):
P(y₂=mange | y₁=Je, x) = 0.20
Compute the negative log-likelihood loss for this example (natural log).
Write the seq2seq factorization for this short target:
P(y₁,y₂ | x) = P(y₁|x) · P(y₂|y₁,x).
Plug in the given probabilities:
P(y|x) = 0.70 · 0.20 = 0.14.
Negative log-likelihood (NLL) is:
L = -log P(y|x) = -log(0.14).
Compute using log rules:
-log(0.14) = -(log(14) - log(100))
= log(100) - log(14).
Numerically:
log(100)=4.6052
log(14)=2.6391
So L ≈ 4.6052 - 2.6391 = 1.9661.
Equivalently, sum token-level losses:
L = -log 0.70 + -log 0.20
≈ 0.3567 + 1.6094
= 1.9661.
Insight: Training loss decomposes across time steps: you can see exactly which step is hurting you. Here, the second token is much less probable (0.20), dominating the loss. This per-step view is also how gradients flow back through the decoder and (with attention) into the encoder states that were used to build cₜ.
Consider two different English inputs that share the same last word:
A: [I, eat, green, apples, today]
B: [I, eat, red, apples, today]
Suppose a no-attention encoder summarizes the entire source as c = h_S (the final state). The decoder must produce the correct last French adjective: vertes (green) vs rouges (red).
Explain, at a mechanistic level, why relying only on c makes this harder than using attention over H=(h₁..h₅).
In the no-attention setup, every decision y_t depends on the same fixed context c.
Formally, s_t = f_dec(s_{t-1}, e(y_{t-1}), c).
The difference between inputs A and B occurs at position 3 (green vs red).
But by the time the encoder reaches position 5, the final state h_S must contain:
all in one vector of fixed dimension d_h.
During decoding, when generating the final adjective (vertes/rouges), the model must extract from c the specific attribute that was present at input position 3, potentially many steps earlier in the encoder computation.
With attention, the decoder at the adjective step can compute αₜ that peaks at i=3.
Then c_t = ∑ᵢ αₜ,ᵢ hᵢ is dominated by h₃, which is directly tied to the color token’s representation.
So the decision boundary between “vertes” and “rouges” can depend more directly on h₃ rather than on whatever compressed trace of “green vs red” survived into h_S.
Insight: Attention reduces the need for perfect long-range compression. Instead of hoping the final encoder state retains every detail, the decoder can retrieve the relevant detail from the specific encoder state where it was encoded. This is especially important when the decisive information is not near the end of the source, or when the output needs to revisit earlier source tokens later in decoding.
Seq2seq models learn conditional sequence distributions: .
Autoregressive decoding uses the chain rule: .
An encoder produces token-level representations ; a decoder generates outputs step-by-step using a hidden state .
A fixed-size encoder summary vector creates an information bottleneck, especially for long inputs and reordering.
Attention/alignment computes scores , softmaxes them into , and forms a context .
The attention vector is a distribution over source positions and can be read as a soft alignment for each output step.
Training typically uses teacher forcing with cross-entropy loss; inference uses greedy/beam/sampling and can suffer from exposure bias.
Encoder–decoder attention is conceptually the same idea as cross-attention in Transformers, which is why this node unlocks modern attention mechanisms.
Forgetting the conditional structure and writing instead of when describing translation/summarization.
Mixing up attention scores (unnormalized) with attention weights (softmax-normalized and summing to 1).
Treating attention weights as guaranteed causal explanations for model outputs; they are often alignment-like but not definitive explanations.
Ignoring the training–inference mismatch: teacher forcing during training does not reflect autoregressive generation at inference, leading to exposure bias.
You have S=3 encoder states h₁,h₂,h₃ (vectors). At some decoder step t, the attention scores are eₜ = [0, 0, 0]. What are the attention weights αₜ? What is cₜ in terms of h₁,h₂,h₃?
Hint: Softmax of equal numbers is uniform. Then cₜ is the average of the vectors.
If eₜ = [0,0,0], then softmax gives αₜ = [1/3, 1/3, 1/3].
So:
This is the simple mean of the encoder states.
Show the chain-rule factorization for P(y₁,y₂,y₃ | x₁..x_S). Then write the negative log-likelihood loss for a single training pair (x, y) using teacher forcing.
Hint: Use ∏ over t for the probability and ∑ over t for the loss; each term conditions on y_<t and x.
Factorization:
Teacher-forcing NLL loss for one pair (x,y) of length T=3:
At decoder step t, you have attention weights αₜ = [0.1, 0.2, 0.7] over three encoder states h₁=[1,0], h₂=[0,1], h₃=[2,2]. Compute cₜ.
Hint: Compute cₜ = 0.1h₁ + 0.2h₂ + 0.7h₃ component-wise.
Compute the weighted sum:
cₜ = 0.1[1,0] + 0.2[0,1] + 0.7[2,2]
First component:
= 0.1·1 + 0.2·0 + 0.7·2 = 0.1 + 0 + 1.4 = 1.5
Second component:
= 0.1·0 + 0.2·1 + 0.7·2 = 0 + 0.2 + 1.4 = 1.6
So cₜ = [1.5, 1.6].
Unlocks and next steps:
Related conceptual neighbors you may want in a tech tree: