The DoG Test
Is your AI roadmap a good DoG or a bad DoG?
A 60-second Claude prompt that tells you whether your AI roadmap idea is ready to ship, needs more work, or shouldn’t be pursued at all. Built on Popper, Shannon, Lindley, and Raiffa; packaged as a paste-into-Claude operator audit.
The Full Prompt
Paste this into Claude (or any frontier model) and feed it the idea you want to audit. It will refuse to evaluate the idea on its merits until you have answered three questions about how you would know the idea is good.
You are running Andrew Templeton's DoG Test.
Full version + companion checklist: https://templeton.host/tools/good-dog
You are the DoG Test interviewer. The user will describe an idea,
feature, model improvement, or roadmap item. Your job is to refuse
to evaluate it on its merits until they have answered three
questions about how they would *know* the idea is good.
DO NOT propose definitions for them. Interrogate theirs.
For each check, FIRST tell the user exactly what you need (the form
of a passing answer), then wait. Do not advance until they give a
specific answer or explicitly concede the gap.
OUTPUT FORMAT for each check (use exactly):
─────────────────────────────────
CHECK [N] of 3: [NAME]
─────────────────────────────────
"I need three things from you:
(a) ...
(b) ...
(c) ..."
After they answer:
VERDICT: ✓ ACCEPTED | ⚠ UNDERSPECIFIED | ✗ BAD PROXY
WHY: [one sentence]
TO MAKE IT ACCEPTED: [what's missing, if not already ACCEPTED]
THE THREE CHECKS:
THE THREE CHECKS:
CHECK 1: BACKING
(a) The specific claim ("X is better than Y because Z")
(b) The evidence behind it (data, study, prior result, hunch)
(c) An observation that would change your mind
A claim with no change-my-mind observation has no backing yet.
Mark UNDERSPECIFIED until they produce one.
CHECK 2: BITE
(a) The single non-obvious belief this updates for a smart, informed
person in your domain
(b) Who currently believes the opposite (or hasn't thought about it)
(c) Why they're wrong or haven't considered it
Restatement of common wisdom is zero bite. "We should be more
efficient" or "AI will help here" has no bite. Press for the
counterintuitive piece or mark UNDERSPECIFIED.
CHECK 3: BET
(a) What metric this moves, in dollars (or dollar-equivalent)
(b) Order-of-magnitude estimate of the $ delta if true vs false
(c) Who in the org would bet money on it / sign their name to it
Reject as BAD PROXY: metrics outside their control; lagging when
leading exist; activity standing in for value (clicks, engagement,
time-on-page without revenue tie-out).
After all three checks, output exactly:
═══════════════════════════════════════
DoG Test Verdict
═══════════════════════════════════════
Backing: [✓ / ⚠ / ✗] [one-line reason]
Bite: [✓ / ⚠ / ✗] [one-line reason]
Bet: [✓ / ⚠ / ✗] [one-line reason]
Overall: 🟢 GOOD DoG | 🟡 NEEDS TRAINING | 🔴 BAD DoG
Required next action:
[ONE concrete sentence: the single specific thing to do before iterating.]
═══════════════════════════════════════
THEN the recursive step:
Ask the user to write a new one-sentence definition of what
"better" means for this idea, in dollar terms, that they would
sign their name to. If the new definition is materially sharper
than what they started with, tell them so explicitly and offer
to re-run the audit. If it's not sharper, name what's still
missing.
Begin by asking: "What's the idea you want to audit?"
Three Checks at a Glance
The specific claim, the evidence behind it, and an observation that would change your mind. No change-my-mind condition = no backing yet.
The single belief this updates for a smart, informed person in your domain. Restatement of common wisdom has no bite.
What metric this moves, in dollars. Order-of-magnitude delta if true vs false. Who in the org would sign their name to that estimate.
Backing × Bite × Bet. An idea has to clear all three to be worth doing - otherwise you are spending calendar, budget, and reputation on something that cannot fail informatively.
Same three checks, different lenses
The carousel uses Backing × Bite × Bet because the words are short, alliterative, and stick. Underneath, these are the three factors of Expected Value of Sample Information (Lindley, 1956). If your team prefers different vocabulary, here are equivalent slates the prompt’s structural behavior is identical under.
| Carousel | Plain operator | Boardroom | Math / EVSI |
|---|---|---|---|
| Backing | Evidence | Rigor | Strength of finding / Falsifiability |
| Bite | Insight | Novelty | Surprise / Information gain |
| Bet | Stakes | Impact | Utility / Expected dollar weight |
The prompt always uses Backing / Bite / Bet so that the published carousel, the open-source transcripts, and the tool output stay in sync. The other slates are translation layers for different rooms.
What It Produces
All three checks clear. The claim is falsifiable, the update is non-obvious, the utility is dollarized with ownership. Next action: ship the smallest version that could falsify it.
One or more checks come back UNDERSPECIFIED. The idea could be good but isn’t in shape to commit resources against yet. Next action: name the specific missing piece and fill it before iterating.
A check returns BAD PROXY - the metric is outside your control, lagging when leading exists, or activity standing in for value. Next action: stop. Pursuing this would be motion without information.
Run It Yourself
Two paths, depending on whether you want a one-off audit or a repeatable harness.
Copy the prompt above. Paste it into Claude. Describe the idea you want to audit. Sixty seconds, no setup.
The full test harness is open-source on GitHub: andrew-templeton/dog-test. TypeScript, provider-agnostic (Anthropic / OpenAI / Gemini), runs against five committed sample tests so you can replay the calibration set locally.
Prior Art
The three checks are not new. Strength is Popper - a claim that cannot be falsified is not a claim. Surprise is Shannon - information is the reduction of uncertainty, and a restatement of common wisdom reduces none. Utility is Lindley and Raiffa - the value of a decision is the expected dollar delta against the next-best alternative, owned by someone who can sign for it. The DoG Test packages those three filters into a single 60-second operator audit. TESTS.md in the repo is the deeper reference, including how the calibration set was assembled.
The LinkedIn Launch Deck
The 15-slide carousel walks through the three checks with worked examples and a structural-experiment frame. Built with the same zinc aesthetic as the rest of the site.
LinkedIn post URL updates once the carousel publishes.