Reinforcement Learning from Human Feedback. Reward modeling, PPO.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
RLHF is the engineering bridge between “a powerful language model” and “a model that reliably behaves the way humans want.” It does this by (1) learning a reward function from human preference comparisons, then (2) fine-tuning the policy to maximize that learned reward while staying close to a trusted reference policy.
RLHF (Reinforcement Learning from Human Feedback) typically has two big stages: (1) train a reward model r_ϕ from human preference data over model outputs, and (2) optimize a policy π to maximize expected reward r_ϕ while applying a KL penalty to keep π close to a reference policy π_ref. In practice, the policy step is often done with PPO on a token-level sequence model, using r_ϕ as the terminal reward (plus optional shaping) and adding −β·KL(π‖π_ref) for stability and alignment.
RLHF (Reinforcement Learning from Human Feedback) is a training recipe for turning a pretrained generative model (the “policy”) into one that better matches human preferences.
Why this exists: supervised fine-tuning (SFT) teaches a model to imitate demonstrations. But in many real tasks—helpfulness, harmlessness, style, honesty, instruction-following—there isn’t a single “correct” output. Instead, there are outputs humans prefer over others. Preferences are comparative and subjective, and they can be inconsistent across people and contexts.
RLHF treats “what humans like” as a reward signal. The catch is that humans don’t provide a numeric reward for every output; they can more reliably answer questions like:
So RLHF usually proceeds in two phases:
1) Reward modeling: learn a scalar reward function r_ϕ that predicts human preference.
2) Policy optimization: fine-tune the policy π_θ to maximize expected reward under r_ϕ.
A third ingredient is crucial in large language models: keep the optimized policy close to a reference policy π_ref (typically the pretrained or SFT model). If you optimize purely for the learned reward, the policy can drift out of distribution, exploit weaknesses in r_ϕ (reward hacking), or collapse into repetitive high-reward patterns. The KL term is the stabilizer.
A useful mental picture is:
Key symbols you’ll see throughout:
Although RLHF is often described with “reinforcement learning,” in language modeling it’s sequence-level RL: you sample tokens step-by-step, but the reward might be computed at the end of the sequence from r_ϕ.
RLHF is not magic. It is a pragmatic approach that works when:
When those fail, RLHF can produce confident misalignment: behavior that looks good to r_ϕ but not to humans.
Why reward modeling first: humans can’t score every output on a consistent numeric scale. But pairwise comparisons are easier, faster, and often more reliable. Reward modeling turns those comparisons into a scalar function you can optimize.
A typical dataset contains tuples like:
You can think of this as teaching a model to assign higher reward to preferred responses.
A common approach assumes the probability that y₁ is preferred over y₂ is a logistic function of the reward difference:
P(y₁ ≻ y₂ | x) = σ(r_ϕ(x, y₁) − r_ϕ(x, y₂))
where σ(t) = 1 / (1 + e^(−t)).
This has a nice interpretation: only relative differences matter. If you add a constant c to all rewards, preferences don’t change.
Given a labeled pair (x, y⁺, y⁻), maximize log-likelihood:
ℓ(ϕ) = log σ(r_ϕ(x, y⁺) − r_ϕ(x, y⁻))
Equivalently, minimize the negative log-likelihood:
L_RM(ϕ) = − E[(x,y⁺,y⁻)] [ log σ(r_ϕ(x, y⁺) − r_ϕ(x, y⁻)) ]
Let Δ = r_ϕ(x, y⁺) − r_ϕ(x, y⁻). Then:
L_RM = −log σ(Δ)
and the gradient pushes Δ upward.
In LLM RLHF, r_ϕ is usually a copy of the base transformer with an added scalar “reward head” on top of the final hidden state (often the end-of-sequence token). Conceptually:
Vectors are bold: h, w.
Because only differences matter, reward values are only defined up to an additive constant. In practice:
This isn’t just cosmetic: PPO stability depends heavily on reward scale.
Reward modeling is supervised learning under distribution shift.
This mismatch creates “reward hacking”: the policy finds outputs that exploit blind spots in r_ϕ.
Common failure modes:
This is why the next mechanic (KL to π_ref) is not optional: it’s a containment strategy.
Start with the loss for one example:
L = −log σ(r⁺ − r⁻)
Use σ(t) = 1/(1+e^(−t)):
L = −log( 1/(1+e^(−(r⁺−r⁻))) )
= log(1 + e^(−(r⁺−r⁻)))
= softplus(−(r⁺−r⁻))
So it behaves like a smooth hinge: if r⁺ is already much larger than r⁻, the loss is small; otherwise it pushes them apart.
Human preference datasets are typically built by:
Ranking can be reduced to pairwise comparisons (all pairs, or tournament-style). Pairwise is easiest to train with and scales well.
Why RL at all: once you have r_ϕ, you want to change the policy π_θ so that it produces higher-reward outputs. This is not just supervised learning because the “label” depends on what the policy generates.
But vanilla policy gradients are high variance and can take destabilizingly large steps—especially with giant language models and imperfect rewards. PPO (Proximal Policy Optimization) is widely used because it constrains updates to be conservative.
A common RLHF objective is a KL-regularized expected reward:
J(θ) = E_{x∼D, y∼π_θ(·|x)} [ r_ϕ(x, y) − β · KL(π_θ(·|x) ‖ π_ref(·|x)) ]
Interpretation:
This can be implemented either as:
In constrained form:
maximize E[r_ϕ]
subject to E[KL(π_θ‖π_ref)] ≤ δ
The penalty form is the Lagrangian relaxation with β as a Lagrange multiplier. Many systems adapt β online to hit a target KL.
For an autoregressive policy:
π_θ(y|x) = ∏_{t=1}^T π_θ(y_t | x, y_{<t})
Then the log-prob decomposes:
log π_θ(y|x) = ∑_{t=1}^T log π_θ(y_t | x, y_{<t})
Token-level KL between two autoregressive policies over a sampled trajectory y is often estimated by:
KL(π_θ‖π_ref) ≈ ∑_{t=1}^T ( log π_θ(y_t|s_t) − log π_ref(y_t|s_t) )
where s_t = (x, y_{<t}). This is an on-policy sample estimate, not an exact KL over all y, but it’s practical.
Often you define a per-trajectory shaped reward:
R(x, y) = r_ϕ(x, y) − β · ∑_{t=1}^T ( log π_θ(y_t|s_t) − log π_ref(y_t|s_t) )
Then you treat R as the return for policy gradient.
Notice something subtle: the penalty contains log π_θ, which depends on θ. This means you must be careful about what is treated as “reward” vs what is part of the optimization objective. PPO implementations handle this by incorporating the KL term explicitly or by adding it as an extra loss.
Vanilla policy gradient would update θ proportional to:
∇_θ E[ R ] = E[ ∇_θ log π_θ(y|x) · (R − b(x)) ]
where b(x) is a baseline (value function) to reduce variance.
But if R is large or noisy, the update can be too big, changing π drastically, which:
PPO addresses this with a clipped surrogate objective.
Let
Then PPO maximizes:
L_PPO(θ) = E_t [ min( r_t(θ) · Â_t , clip(r_t(θ), 1−ε, 1+ε) · Â_t ) ]
This prevents r_t from moving too far from 1. ε is typically ~0.1–0.2.
In language modeling, “actions” are tokens y_t, and trajectories are sequences.
A standard actor-critic setup learns a value function V_ψ(s_t). You compute:
Â_t = G_t − V_ψ(s_t)
where G_t is a return estimate.
If the reward is terminal only (reward model evaluated at end):
Then a simple return is:
G_t = r_ϕ(x, y)
for all t along the trajectory (or discounted versions).
In practice, you often include per-token KL penalties, making rewards dense:
r_t = −β (logπ_θ(y_t|s_t) − logπ_ref(y_t|s_t))
and terminal r_T += r_ϕ(x, y)
Then you can compute G_t via discounted sums:
G_t = ∑_{k=t}^T γ^(k−t) r_k
Often γ is near 1; sometimes γ = 1 is used for episodic tasks.
A typical implementation minimizes a weighted sum:
L_total = L_actor + c_v · L_value + c_e · L_entropy + L_KL(optional)
Where:
You might ask: “Isn’t PPO already conservative?” It is, relative to π_old. But RLHF needs conservatism relative to a trusted reference distribution π_ref for two reasons:
1) Reward model validity: r_ϕ is trained on samples near π_ref/SFT. Staying close reduces out-of-distribution exploitation.
2) Capability retention: π_ref encodes broad language competence. Without KL, optimizing narrow preferences can degrade general performance.
A helpful lens is: KL is a regularizer on the function space of policies, not just step size.
Consider maximizing:
E_{y∼π} [ r_ϕ(x,y) ] − β KL(π‖π_ref)
Expand KL:
KL(π‖π_ref) = E_{y∼π} [ log π(y|x) − log π_ref(y|x) ]
So the objective becomes:
J = E_{y∼π} [ r_ϕ(x,y) ] − β E_{y∼π}[ log π(y|x) − log π_ref(y|x) ]
= E_{y∼π} [ r_ϕ(x,y) − β log π(y|x) + β log π_ref(y|x) ]
This shows two things:
This is one reason RLHF often behaves like: “Prefer what humans like, but among those, choose something plausible under the base model.”
This section connects the mechanics into an end-to-end pipeline and highlights the engineering decisions that matter at scale.
A common three-stage setup is:
1) Pretrain a language model on next-token prediction (not RLHF itself).
2) Supervised fine-tuning (SFT) on instruction-following demonstrations.
3) RLHF:
π_ref is often the SFT model (or a snapshot of the current policy before RL). The choice affects how conservative you are.
| Decision | Options | Why it matters | Typical choice |
|---|---|---|---|
| Preference labels | pairwise, ranking, scalar ratings | label noise + ease for humans | pairwise or ranking → pairwise |
| Reward model | separate model, shared backbone, ensemble | generalization + reward hacking | separate RM, sometimes ensemble |
| Policy optimizer | PPO, A2C, DPO/IPO-style direct methods | stability + complexity | PPO for classic RLHF |
| KL control | fixed β, adaptive β, hard constraint | prevents drift | adaptive β targeting KL |
| Reward shaping | terminal only, +per-token KL, +length penalties | variance + behavior | terminal RM + token KL |
Each iteration:
Even if you conceptually think “sequence reward,” implementations are token-based because:
You typically track:
A healthy run often shows:
The policy finds outputs that exploit r_ϕ.
Root cause: r_ϕ is a learned proxy; it generalizes imperfectly. The optimization is strong and adversarial.
Mitigations:
Symptoms: repetitive outputs, generic safe responses, refusal everywhere.
Root cause: the learned reward landscape may have narrow peaks; PPO can push into them.
Mitigations:
Symptoms: worse factuality, worse reasoning, lower performance on standard NLP benchmarks.
Root cause: RL step optimizes a narrow preference signal; without enough KL or with biased data, the model drifts away from broadly useful behaviors.
Mitigations:
Symptoms: confident wrong answers; refusal patterns.
Root cause: labelers prefer confident style, or comparisons don’t penalize confident errors enough.
Mitigations:
RLHF (RM + PPO) is not the only way. There are “direct” methods (e.g., DPO-style) that skip explicit RL and use preference data to directly update π.
Why you still learn RLHF:
If r_ϕ is even slightly misspecified, the unconstrained optimum can be far from human intent. KL provides a trust region around π_ref where r_ϕ is more reliable.
Think of it like this: r_ϕ is a map of a city drawn from a small neighborhood. KL keeps you from driving into areas where the map is wrong.
You have a prompt x and two responses y⁺ (preferred) and y⁻ (rejected). The reward model outputs r⁺ = r_ϕ(x,y⁺) and r⁻ = r_ϕ(x,y⁻). Use the standard Bradley–Terry likelihood: P(y⁺ ≻ y⁻|x) = σ(r⁺ − r⁻).
Define Δ = r⁺ − r⁻.
Negative log-likelihood loss:
L = −log σ(Δ).
Rewrite using σ(Δ) = 1/(1+e^(−Δ)):
L = −log(1/(1+e^(−Δ)))
= log(1 + e^(−Δ))
= softplus(−Δ).
Differentiate w.r.t. Δ:
∂L/∂Δ = ∂/∂Δ log(1+e^(−Δ))
= (1/(1+e^(−Δ))) · (−e^(−Δ))
= − 1/(1+e^(Δ))
= −σ(−Δ)
= σ(Δ) − 1.
Chain rule to see direction on r⁺ and r⁻:
∂L/∂r⁺ = ∂L/∂Δ · ∂Δ/∂r⁺ = (σ(Δ) − 1) · 1
∂L/∂r⁻ = ∂L/∂Δ · ∂Δ/∂r⁻ = (σ(Δ) − 1) · (−1) = 1 − σ(Δ).
Interpretation:
Insight: The preference loss only cares about relative reward. Training pushes the preferred output’s reward above the rejected output’s reward, with diminishing pressure once the ordering is confidently correct.
You want to optimize: J(θ) = E_{y∼π_θ(·|x)}[ r_ϕ(x,y) − β KL(π_θ(·|x)‖π_ref(·|x)) ]. Expand KL to see what signal the policy gradient is getting.
Start with the definition:
KL(π_θ‖π_ref) = E_{y∼π_θ}[ log π_θ(y|x) − log π_ref(y|x) ].
Substitute into J:
J = E_{y∼π_θ}[ r_ϕ(x,y) ] − β E_{y∼π_θ}[ log π_θ(y|x) − log π_ref(y|x) ].
Combine expectations (same sampling distribution π_θ):
J = E_{y∼π_θ}[ r_ϕ(x,y) − β log π_θ(y|x) + β log π_ref(y|x) ].
Interpret each term:
Autoregressive decomposition:
log π_θ(y|x) = ∑_{t=1}^T log π_θ(y_t|s_t)
log π_ref(y|x) = ∑_{t=1}^T log π_ref(y_t|s_t)
So the KL-related terms naturally become token-level sums.
Insight: The KL term is not just a ‘penalty’; it turns π_ref into a probabilistic prior and adds an entropy-like term. This is why RLHF often improves preference without completely destroying fluency—when β is tuned correctly.
At some token position t with state s_t, the old policy assigns probability 0.02 to token a, and the new policy assigns probability 0.06. Suppose the advantage estimate is Â_t = +4 and PPO ε = 0.2.
Compute the probability ratio:
r_t = π_new(a|s_t) / π_old(a|s_t) = 0.06 / 0.02 = 3.
Unclipped surrogate contribution:
r_t · Â_t = 3 · 4 = 12.
Compute clipped ratio:
clip(r_t, 1−ε, 1+ε) = clip(3, 0.8, 1.2) = 1.2.
Clipped surrogate contribution:
clip(r_t,...) · Â_t = 1.2 · 4 = 4.8.
PPO takes the min for positive advantage:
min(12, 4.8) = 4.8.
So the objective gain for this token is capped.
Insight: Even if the optimizer tries to triple a token probability in one update, PPO’s clipping limits how much that move can improve the objective—reducing destructive, reward-chasing jumps.
RLHF = reward modeling from human preferences + policy optimization to maximize the learned reward.
Reward models r_ϕ are typically trained on pairwise comparisons using a logistic loss: −log σ(r⁺ − r⁻).
Policy optimization commonly maximizes E[r_ϕ] while penalizing KL drift from a reference policy π_ref: −β·KL(π‖π_ref).
For autoregressive LMs, log-prob and KL decompose as token sums, making per-token PPO updates practical.
PPO’s clipped objective provides conservative policy updates relative to π_old, improving stability under noisy learned rewards.
KL-to-π_ref is a containment strategy against reward hacking and capability regression by staying in-distribution.
Reward scale and KL coefficient β strongly affect training dynamics; many systems adapt β to target a desired KL.
Human evaluation remains essential: r_ϕ is only a proxy and can be exploited or miscalibrated.
Treating r_ϕ as ground truth: optimizing a learned proxy without strong KL control invites reward hacking.
Ignoring reward/advantage normalization: poor scaling can make PPO unstable or cause collapse.
Confusing KL(π‖π_ref) with KL(π_ref‖π): direction matters; RLHF typically penalizes the policy drifting away from the reference.
Collecting preference data from a distribution that doesn’t match deployment prompts: the reward model generalizes poorly and alignment degrades.
You observe preference data where (x, y₁) is preferred to (x, y₂). Your reward model currently predicts r_ϕ(x,y₁)=1.0 and r_ϕ(x,y₂)=1.5. Compute the probability P(y₁ ≻ y₂|x)=σ(r₁−r₂) and the loss L=−log σ(r₁−r₂).
Hint: Compute Δ = r₁ − r₂ and apply σ(Δ)=1/(1+e^(−Δ)).
Δ = 1.0 − 1.5 = −0.5.
P = σ(−0.5) = 1/(1+e^(0.5)) ≈ 1/(1+1.6487) ≈ 0.3775.
L = −log(0.3775) ≈ 0.9741.
Show that maximizing E[r_ϕ(x,y) − β·KL(π_θ(·|x)‖π_ref(·|x))] is equivalent to maximizing E[r_ϕ(x,y) + β log π_ref(y|x) − β log π_θ(y|x)] under y∼π_θ. What conceptual role does β log π_ref play?
Hint: Expand KL(π‖π_ref) as an expectation of log ratios under π.
KL(π_θ‖π_ref) = E_{y∼π_θ}[log π_θ(y|x) − log π_ref(y|x)].
So:
E[r_ϕ − β KL]
= E[r_ϕ] − β E[log π_θ − log π_ref]
= E[r_ϕ − β log π_θ + β log π_ref].
Conceptually, β log π_ref acts like a prior term that biases the optimized policy toward outputs that the reference policy already considers likely (helping preserve fluency/capabilities and reducing out-of-distribution drift).
In a PPO step for one token, suppose π_old(a|s)=0.10, π_new(a|s)=0.13, ε=0.2, and advantage Â=−3. Compute the unclipped term r·Â and the clipped term clip(r,1−ε,1+ε)·Â, then the PPO contribution min(r·Â, clipped·Â) or max depending on sign. Which one is used and why?
Hint: For negative advantages, PPO uses max( r·Â, clip(r,...)·Â ) (equivalently min with sign accounted for) to prevent the update from over-decreasing probability.
r = 0.13/0.10 = 1.3.
Unclipped: r·Â = 1.3·(−3) = −3.9.
Clipped ratio: clip(1.3, 0.8, 1.2) = 1.2.
Clipped: 1.2·(−3) = −3.6.
Because  is negative, PPO takes the max of the two (less negative is better for the objective): max(−3.9, −3.6) = −3.6.
This prevents the optimizer from making the probability change too large in a direction that would excessively reduce the likelihood of the sampled action when the advantage is negative.