June 29, 2026ai-engineering

When a Free Model Can Replace the Frontier (and How to Prove It)

A coding agent delegates the same small typed subtask a few hundred times an hour: classify this support ticket, tag this feedback, route this request. The frontier model it calls does the job well and bills you for every call. A 4B model running on the same laptop would do it for nothing. The obvious move is to let the frontier handle the task for a while, then hand the job to the local model once the local model is good enough to take over.

The whole plan turns on the phrase “good enough,” and that phrase hides a measurement problem most people skip. In production you have no answer key. The only reference you have for whether the local model is right is the frontier’s own output, and the frontier is sometimes wrong. So “the local model agrees with the frontier 87% of the time” is not “the local model is 87% accurate.” The two numbers can differ by exactly the amount the teacher itself is wrong, and in the wrong direction.

This post is about closing that gap. The correction is a piece of textbook Bayesian machinery applied to a cost decision, and the interesting part is not that it works but where it works: I ran a pre-registered trial, held real gold labels out as an oracle the system never sees, and had an independent two-auditor panel try to break the result. The correction changed which model got shipped, it survived a regime where the theory said it should fail, and it went correctly silent on a task where the teacher made no mistakes. The whole experiment cost five cents in API calls.

Scoring a student against a fallible teacher measures agreement, not competence. To recover competence you have to model the teacher’s errors and subtract them.


The honest handoff loop: a frontier teacher labels a task, a correction recovers each local tier's true accuracy, and a cost-based descent ships the cheapest model that clears the bar


The 87% that is not what it looks like

Take a real ticket from the trial: “My promo discount disappeared after I reset my password.” The correct route is Billing. The cheap frontier teacher routed it to Account, fixating on the password reset. The 8B local model routed it to Billing, correctly.

Naive scoring treats the teacher’s label as truth, so it records this case as the local model being wrong. It is the opposite. The local model was right and the teacher was wrong. One ticket does not matter; the pattern does. Across a test set, the naive estimate of a local model’s accuracy is biased by precisely the teacher’s error pattern, and the bias does not average out, because a fixed teacher makes the same kinds of mistakes every time.

State it as an estimator and the problem is plain. Let mm be the local model’s label, ff the frontier teacher’s label, yy the true class you do not get to see. Naive readiness is P(m=f)P(m = f), agreement with the teacher. What you actually want is P(m=y)P(m = y), agreement with truth. These are equal only when the teacher never errs. The moment ff and yy diverge, the naive estimator absorbs the teacher’s errors into the student’s score, and a cost decision built on that score inherits the bias.

The gap has an exact form. Writing the two accuracies side by side,

A^naive=P(m=f),Atrue=P(m=y),\hat{A}_{\text{naive}} = P(m = f), \qquad A_{\text{true}} = P(m = y),

their difference splits cleanly into two error channels:

A^naiveAtrue=P(m=f,  my)student copies the teacher’s mistake    P(m=y,  mf)student right, teacher wrong.\hat{A}_{\text{naive}} - A_{\text{true}} = \underbrace{P(m = f,\; m \ne y)}_{\text{student copies the teacher's mistake}} \;-\; \underbrace{P(m = y,\; m \ne f)}_{\text{student right, teacher wrong}}.

Naive scoring rewards the first channel and penalizes the second, which is exactly backwards: it pays the student for parroting a wrong teacher and docks it for outperforming one.

Recovering the truth the teacher hides

Here is the move. If you know how often the teacher errs and in which directions, you can subtract that error pattern out of the observed agreement and recover what the student would have scored against truth.

The intuition first. You observe the joint distribution of (local label, teacher label) over many items: how often the local model says Billing while the teacher says Account, and so on for every pair. That joint is a blurred convolution of two things, the student’s true error pattern and the teacher’s true error pattern. If you can pin down the teacher’s pattern independently, you can deconvolve the joint and read off the student’s pattern.

The precise version is a latent-class model, the kind Dawid and Skene wrote down in 1979 for exactly this situation, noisy raters and no gold. The unknown is the local model’s confusion matrix, the full P(m=ay=j)P(m = a \mid y = j) for every predicted class aa and true class jj. Under the model’s one assumption, that the two labelers err independently given the true class, the observed joint over (local label, teacher label) factors through the hidden class:

P(m=a,  f=b)=jπjSajTbj,πj=P(y=j),Saj=P(m=ay=j),Tbj=P(f=by=j).P(m = a,\; f = b) = \sum_j \pi_j \, S_{aj} \, T_{bj}, \qquad \pi_j = P(y = j),\quad S_{aj} = P(m = a \mid y = j),\quad T_{bj} = P(f = b \mid y = j).

You hold the teacher’s matrix TT and the class prior π\pi fixed, let expectation-maximization recover the student’s matrix SS over the probability simplex, and read the recovered accuracy off its diagonal as jπjSjj\sum_j \pi_j S_{jj}. EM returns a row-stochastic matrix by construction, so that accuracy is always a real number between zero and one. That same independence assumption is the one the hard task will later break.

This last property matters more than it sounds, and it is where the first version of this work was wrong. My initial cut reached for the classic reliability correction, Spearman’s disattenuation formula, the one that divides an observed correlation by the geometric mean of two reliabilities. That formula is type-unsound for categorical labels. It can return a “correlation” above one, which is not a correlation at all. An adversarial design review flagged it before a dollar was spent, and a quick simulation confirmed the raw linear inversion produced impossible accuracies, outside [0,1], in 11% of trials. The categorical-correct form is the latent-class deconvolution, and it does not have that failure mode. I will keep calling the correction rho, after the reliability term it generalizes, but the working object is the Dawid-Skene recovery, not the scalar formula. (This trial validates one slice of a larger design, the auto-distillation readiness kernel; the slice is the readiness verdict itself.)

The recovered accuracy is one of three estimators worth tracking side by side:

EstimatorMeasuresModels teacher error?Touches gold at runtime?Failure mode
NaiveP(m=f)P(m = f)NoNoCharges the teacher’s mistakes to the student
Corrected (rho)recovered P(m=y)P(m = y)YesNoBiased when student and teacher errors correlate at the margin
OracleP(m=y)P(m = y) on held-out goldn/aYes, experiment onlyNot deployable; it is the answer key, not a method

Naive is what most pipelines actually use; corrected is the Dawid-Skene recovery, what the student would score against truth if its errors were independent of the teacher’s; oracle is the target rho approximates without ever looking at gold.

The oracle exists only inside the experiment. It is never fed to the system under test. Its single job is to grade whether rho recovered the truth.

The trial

Everything below was pre-registered: the hypothesis, the metrics, the cost model, and the falsification conditions were written down and committed before any data was collected. Pre-registration is the cheapest insurance against the failure mode where you slide a threshold after seeing the numbers and call it a win.

The task is routing. Two callouts, each four classes, 48 items: one deliberately clean and low-ambiguity, one deliberately hard and ambiguous. Running both is what separates “the method is broken” from “this task’s structure defeats it.” The local candidates are a ladder of qwen3 models, 0.6B through 30B, in ascending inference cost. The teacher is a cheap frontier model at minimal effort, drawn three times per item so its own randomness is visible.

ComponentValue
TaskSupport-ticket routing
CalloutsOne clean (low-ambiguity), one hard (ambiguous)
Classes per callout4
Items per callout48
Local ladderqwen3, 0.6B → 30B, ascending cost
TeacherCheap frontier model, minimal effort
Teacher draws3 per item
Misroute cost$3.00
Readiness budget$0.50 expected error cost / item
Pass conditionaccuracy >83.3%> 83.3\%
Primary metricM2: selected-tier distance from the oracle’s choice

The decision rule is dollarized. A misroute costs $3 to clean up; the readiness bar is $0.50 of expected error cost per item. A tier passes iff its expected misroute rate keeps cost under the bar. With misroute cost cc and per-item budget BB, that is a single inequality:

c(1A^)BA^1Bc=10.503.00=0.833c\,(1 - \hat{A}) \le B \quad\Longleftrightarrow\quad \hat{A} \ge 1 - \frac{B}{c} = 1 - \frac{0.50}{3.00} = 0.833\ldots

so any tier above 83.3% accuracy clears. The descent walks the ladder cheapest first and ships the first tier that clears the bar; if none clears, it escalates to the paid frontier. This is the artifact that actually gets deployed, so it is what the experiment should be judged on.

That choice drove the primary metric. There were two candidates:

  1. M1, does rho de-bias the accuracy estimate, measured as recovered-accuracy error against the oracle.
  2. M2, does rho change which tier the descent selects, and does it move that selection toward the tier the oracle confirms is correct.

A $0 synthetic study run before any spend predicted that M1 would be bias-limited under realistic error correlation and would not improve with sample size, while M2 would stay robust. So M2 was promoted to primary and M1 demoted to a diagnostic, on paper, before the real data existed. That ordering turned out to matter.

What the data said

The clean callout produced a clean null, and it is the right null. The cheap teacher scored 100% on the clean task, 48 out of 48, identical across all three draws. A perfect teacher means naive agreement equals oracle agreement exactly, so there is no bias for rho to remove. The recovery confirmed it: rho changed nothing, the descent shipped the free 0.6B tier under every estimator, and every clean-task query routed to a local model at zero inference cost. This is the first boundary, and it is a feature. Rho only earns its keep against a fallible teacher. When the teacher is already an oracle, the honest answer is to do nothing, and the correction does nothing.

The hard callout supplied the fallible teacher and the real test. Here the teacher scored 93.8%, missing three of the ambiguous boundary tickets. Now naive and oracle diverge, and the question has teeth. The result, measured as the bootstrap Hamming distance between the descent’s selected tier and the oracle’s selected tier, lower being closer to the gold-confirmed choice:

Gold definitionTeacher accuracyNaive miss-distanceRho miss-distance
Hand-authored gold93.8%0.1770.135
Audited consensus gold95.8%0.1630.083
Unanimous subset95.7%0.1330.082

Rho lands closer to the oracle’s choice than naive under every definition of the gold. Across 2,000 bootstrap resamples on the audited gold, rho’s selection matched the oracle’s where naive’s did not on 297 resamples, against 137 the other way, a consistent lean rather than a coin flip.

The gold audit is why I trust the top row, and it makes the result stronger, not weaker. Two independent frontier annotators relabeled all 48 items blind: a stronger model from the teacher’s own family at high effort, and a model from a different family entirely. They agreed with each other on every single item, a Cohen’s kappa of 1.000. They disagreed with my hand-authored gold on exactly one ticket, and on that ticket the cheap teacher had actually been right and my gold label wrong. Correcting that one label is what moves the table from row one to row two. Rho’s advantage roughly doubled. The win is not an artifact of sloppy labels; cleaning the labels sharpened it.

Why the correction survives a regime it should not

The deconvolution rests on an assumption: the student’s errors and the teacher’s errors are independent given the true class, the two models failing for unrelated reasons. On the hard task that assumption breaks, and breaks in a specific, intelligible way.

On a genuinely ambiguous ticket, capable models converge on the same defensible-but-not-gold reading. “Update my shipping address and the card on file before the next renewal” is Account and Billing at once; the teacher and the 30B local model both chose Account, the same way, against a gold of Billing. That is a correlated error, the exact thing the independence assumption forbids. I measured the correlation, and it rose with model capability, reaching 0.81 on the strongest tier, far past the 0.13 threshold where the synthetic study said the de-biasing should stop helping.

And it did stop helping the accuracy estimate. M1 failed everywhere, exactly as the pre-registration predicted. But M2, the selection, survived. The reason is structural and worth holding onto. The correlated errors live on the strongest tiers, which sit far above the cost bar, where the selection is never in doubt. The decision is made at the marginal tier, the one straddling the bar, and there the correlation is mild and rho’s de-biasing is sharp: it cut the marginal tier’s accuracy bias from 0.021 down to 0.001. You do not need every estimate honest. You need the estimate at the margin honest, and that is where the correction was cleanest.

A correction can be wrong everywhere that does not matter and right where it does. Selection only needs the margin honest.

What this is, and what it is not

It is a proof of mechanism, not a powered effect-size estimate. The sample is 48 items, one task family, four classes. The margins are small, distances of 0.04 to 0.08, and they are real because they hold across three independent definitions of the gold and across thousands of resamples in a consistent direction.

I report no p-value, on purpose. Bootstrap iterations are not independent observations; you could drive any p-value you like by cranking the resample count, so a significance test over resamples would be fake precision dressed as rigor. The honest evidence is the robustness across gold definitions, not a decorated number.

M1 fails, and I am telling you it fails rather than quietly dropping it. The corrected accuracy point-estimate stays optimistically biased under correlation. The only thing I claim is the selection. And one of the two auditors shared the teacher’s model family, a partial circularity; it is mitigated by the genuinely independent auditor agreeing at kappa 1.000, and by the audit’s net effect being to correct a label against the cheap teacher’s bias rather than toward it.

What is next

This proves the selection half of the loop: given a fixed ladder of local models, pick the cheapest one that is genuinely ready, using a noisy teacher and never touching gold. The other half is what to do when no stock tier clears the bar. Then you mint a tier, a small LoRA distillation trained on the teacher’s labels, and the same question returns in a sharper form, because now the teacher’s errors can be baked into the student you are training. Keeping that loop honest is the next pre-registered experiment.

The discipline underneath all of this is the point as much as the result. The hypothesis and the bar were fixed before the data. The one estimator that failed is reported next to the one that worked. The gold was audited by models that had every chance to embarrass it. Every result is a node in a content-addressed proof graph, hashed to the code and data that produced it, so the claim and its evidence travel together. A free model can replace the frontier on the right tasks. The part worth getting right is the “can,” and the part worth publishing is how you would know.