Business Finance

backtesting

Risk & Decision ScienceDifficulty: ★★☆☆☆

The scientist derives the formula, backtests the protocol, and publishes the result so others can verify it.

Related Concepts in Other Trees

Cross-validation is the ML methodological equivalent of backtesting - evaluating a learned model on held-out historical data to verify it generalizes beyond the training set

Hypothesis TestingTech Tree

Backtesting is hypothesis testing applied to sequential protocols - you formulate H0 (strategy has no edge), replay historical data, and assess whether observed performance is statistically distinguishable from chance

FIRE MathPersonal Finance

The 4% safe withdrawal rate (Trinity Study) is the canonical personal finance backtest - a protocol tested against every historical 30-year retirement window to verify survival rates

Prerequisites (2)

Expected Valuelvl 1

Returnslvl 2

You built a Scoring Model that predicts which deals in your Pipeline will close. It uses deal size, days in stage, and Buyer engagement. Your team wants to deploy it next quarter to set Hiring Targets for the GTM Teams. Before you commit $400K in headcount to this formula, you need to know: would it have actually worked on the deals you already closed - or lost - over the last 18 months?

TL;DR:

Backtesting runs a decision rule against historical data where outcomes are already known - measuring whether it would have produced the Returns you expect before you risk real Budget deploying it.

What It Is

Backtesting takes a formula, decision rule, or protocol and applies it to past data where you already know the outcomes. You pretend you are standing at each historical moment, make the decision your rule prescribes, and then check whether the result matches what your model predicted.

The key discipline: you cannot let future information leak into past decisions. If your Scoring Model uses Close Rate data from Q4 to make a prediction about a Q2 deal, you have contaminated the test. The rule must only see what it would have seen at the time.

This is how the gap between a plausible Expected Value calculation and a validated one gets closed. Expected Value gives you the math. Backtesting gives you the evidence that the math holds up in your specific operating environment.

Why Operators Care

Every P&L decision is a bet. You allocate Budget to Marketing Spend because you believe the Expected Return justifies the cost. You set Pricing at a certain level because your Unit Economics model says it maximizes Profit. You build a Pipeline Velocity target because you think faster deals mean more Revenue.

Without backtesting, these are beliefs. With backtesting, they are hypotheses that have survived contact with your own data.

The P&L impact is direct:

•Capital Allocation: Before committing capital investments based on a formula, backtesting tells you whether that formula would have allocated well in the past.
•Error Cost reduction: A decision rule that looks good on paper but fails on historical data will fail on future data too - catching it early avoids real losses.
•Verifiability: When you publish a backtested result - the data, the rule, and the results - anyone on your team can verify it independently. This builds trust in the decision rule and reduces the Execution Risk of people quietly ignoring a model they do not believe in.

How It Works

Step 1: Define the decision rule precisely. Not 'we think big deals close faster' but 'deals over $50K with contract review started by day 15 have a Close Rate above 40%.' The rule must be specific enough that a skeptical colleague could apply it without judgment calls.

Step 2: Gather historical data where outcomes are known. You need enough periods to be meaningful. Testing a quarterly Budget Allocation model against 3 quarters is barely a test. Testing it against 12 quarters starts to mean something.

Step 3: Simulate decisions at each historical point. Walk through the data chronologically. At each decision point, apply your rule using only information available at that moment. Record what the rule would have decided.

Step 4: Compare predicted outcomes to actual outcomes. Calculate the Expected Value your rule predicted, then compare to what actually happened. Measure the Variance between predicted and actual Returns.

Step 5: Stress the edge cases. Your rule might work well on average but fail badly during a Market Downturn or when Pipeline Volume drops. Check performance across different conditions - not just the aggregate.

The output is not 'the model is good' or 'the model is bad.' It is a distribution: under what conditions does the rule work, how large is the gap between predicted and actual, and is the Variance acceptable given your risk appetite?

When to Use It

Backtest when:

•You are about to commit significant Budget to a strategy driven by a formula or model. The larger the Implementation Cost, the more a backtest is worth the effort.
•You have enough historical data with known outcomes. If you only have 3 months of data and you are testing a quarterly model, the sample is too small to mean anything.
•The decision is repeatable. Backtesting a one-time Capital Investment decision is less useful than backtesting a Pricing rule you will apply to thousands of transactions.
•Stakeholders need to trust the result. A backtested protocol with transparent data, rules, and results can be verified independently - the discipline described in the section above.

Do not backtest when:

•You have no historical data (you are entering a new market with no base case)
•The operating environment has fundamentally changed and past data is not representative
•The cost of running a small live test is lower than the cost of building the backtest

Worked Examples (2)

Backtesting a Pipeline Scoring Model Before Setting Hiring Targets

You manage a SaaS P&L with $2.4M ARR. Your Pipeline has 180 deals per quarter. You built a Scoring Model that flags deals as 'high confidence' or 'low confidence' based on three signals: deal size above $30K, Buyer engaged within 7 days, and contract review started by day 20. You want to use this model to set Hiring Targets - if it reliably identifies winners, you can staff to the high-confidence volume instead of total Pipeline Volume. You have 18 months (6 quarters) of historical deal data with outcomes.

Walk through all 1,080 historical deals (180 per quarter times 6 quarters). For each deal, apply the three-signal rule using only data available at the scoring moment. Tag each deal as 'high confidence' or 'low confidence.'
Count outcomes. Of 1,080 deals, 390 were tagged high confidence. Of those 390, 195 closed (50% Close Rate). Of the 690 tagged low confidence, 69 closed (10% Close Rate). Total actual closes: 264.
Calculate the Revenue the model would have predicted. Staffing to the 390 high-confidence deals at a 50% Close Rate and $35K average deal size: 390 × 0.50 × $35K = $6.83M over 6 quarters, or roughly $1.14M per quarter.
Compare to actual Revenue. Your actual Revenue over those 6 quarters was $9.24M ($1.54M per quarter), because some low-confidence deals did close. The model captures 74% of actual closes (195 of 264).
Check Variance across quarters. In Q1 and Q2 of year one, the model's Close Rate on high-confidence deals was only 38% - well below the 50% average. The model performed worst when Pipeline Volume was below 150 deals per quarter. This is a failure mode worth flagging.
Decision: The model works but is not safe to use as the sole input for Hiring Targets. It misses 26% of closes and underperforms in low-volume quarters. You could use it to set a floor for staffing, not a ceiling.

Insight: Backtesting did not just say 'the model works.' It revealed the specific conditions where it breaks down (low Pipeline Volume quarters) and quantified the opportunity cost of relying on it exclusively (26% of closes missed). This is information you cannot get from the formula alone.

Backtesting an Inventory Reorder Rule

You manage inventory for a retail operation with 30 product lines. Reorder decisions are currently made by managers using experience and judgment. You want to test whether a formula-based rule would reduce stockouts without increasing the capital tied up in inventory. You have 18 months of daily sales and inventory data with complete records of every stockout event.

Define the rule: Reorder when on-hand units fall below 12 days of average daily sales (calculated from the prior 30 days of sales data). Order quantity: 21 days of that same average. Deliveries arrive 5 business days after the order is placed.
Walk through the 18 months day by day for each of the 30 product lines. At each day, calculate the prior-30-day average daily sales using only past data. Simulate inventory levels starting from actual inventory on day one, applying the rule forward.
Count stockout-days. Under actual manager decisions, the 30 products accumulated 247 stockout-days over 18 months. Under the formula rule, the simulation produces 68 stockout-days - a 72% reduction.
Calculate the Error Cost reduction. Average daily Revenue lost per product-stockout is $210. Fewer stockout-days: (247 minus 68) times $210 equals $37,590 in preserved Revenue over 18 months.
Check inventory carrying levels. Under the manager approach, average inventory was 28 days of supply. Under the rule, average inventory drops to 19 days of supply - 32% less capital tied up in inventory at any given time.
Identify the failure mode. The rule produced 52 of its 68 stockout-days in November and December. The prior-30-day average lagged behind holiday demand spikes - by the time the average reflected higher sales, inventory had already hit the reorder point too late for the 5-day delivery window. Run the Sensitivity Analysis: if delivery time increases from 5 to 8 business days, the rule produces 112 stockout-days. That is still 55% fewer than 247, but the margin narrows. The rule depends on both stable demand patterns and reliable delivery times.

Insight: The backtest confirmed the rule outperforms judgment under normal conditions: $37,590 in preserved Revenue and 32% less capital in inventory. But it exposed a specific failure mode - demand spikes during peak months outrun the trailing average - and a dependency on delivery times. You would deploy the rule for 10 months of the year and override it with a higher reorder threshold during the peak season.

Key Takeaways

✓
A backtest that shows near-perfect accuracy is more suspicious than one with known failure modes. Perfect results usually mean future information leaked into past decisions or the sample was too small for the rule to encounter conditions where it breaks. The useful output is a map of where the rule works and where it does not.
✓
The most valuable backtest outcome is often a partial failure. Discovering that your Scoring Model underperforms when Pipeline Volume drops below 150 deals per quarter is more actionable than learning it works 'on average' - because it tells you exactly when to override the rule with judgment and when to trust it.
✓
Minimum data requirements scale with the complexity of your rule. A single-condition rule (deal size above $50K) might need 50 decisions with known outcomes to test meaningfully. A three-condition rule needs at least 200. If you run a backtest without enough data, the rule will appear validated simply because it never encountered enough variety to fail.

Common Mistakes

✗
Using future information in past decisions. Your Scoring Model uses 'average days to close' calculated from the full 18-month dataset, including deals that had not closed yet at the time of each simulated decision. At Q2, the model should only know close-time data from Q1 and earlier. If it sees Q3 through Q6 data when scoring a Q2 deal, the backtest looks artificially accurate because the rule had access to answers it would never have in production. The fix is strict: at each simulated decision point, filter the dataset to include only records dated before that moment.
✗
Tuning until the rule perfectly explains history. If your decision rule has 8 adjustable parameters and you test it against 10 data points, you have enough flexibility to fit random noise perfectly. The rule will fail on new data because it memorized specific historical outcomes rather than capturing a real pattern. The test: build the rule on the first 60% of your data and evaluate it against the remaining 40%. If accuracy drops substantially, the rule was memorizing rather than predicting.

Practice

medium

You have a marketing Budget Allocation rule: spend 60% on the channel with the highest ROI last quarter. You have 8 quarters of data showing ROI by channel (paid search, content, events). Design a backtest. What is your decision rule, what data do you need at each point, and what metric tells you if the rule works?

Hint: Start at Q3 (the first quarter where you have a 'last quarter' to reference). At each quarter, your rule can only see the previous quarter's ROI. Compare the rule's allocation to what a perfect allocator with hindsight would have chosen.

Show solution

Decision rule: At the start of each quarter, allocate 60% of Marketing Spend to whichever channel had the highest ROI in the prior quarter. Remaining 40% split evenly. Walk Q3 through Q8 (6 test periods). At each quarter, record the rule's allocation, then compare actual Revenue generated per channel. Metric: cumulative ROI of the rule's allocation versus cumulative ROI of equal-split allocation (your base case) and versus perfect-hindsight allocation (your ceiling). If the rule beats equal-split by more than 10% cumulative, it is adding value. If it captures less than 60% of the perfect-hindsight ROI, the signal is too noisy and last-quarter ROI is not predictive enough to drive allocation.

easy

A colleague says: 'I backtested our new Hiring Targets model and it predicts headcount needs perfectly for the last 2 years.' Name two reasons you should be skeptical before approving Budget based on this claim.

Hint: Think about how many parameters the model has relative to how many data points (quarters) exist in 2 years, and whether the model was built using the same data it was tested on.

Show solution

First, 2 years is only 8 quarters. If the model has more than 2 or 3 parameters, it may have memorized a tiny sample rather than captured a real pattern - see the second common mistake above. Second, ask whether the model was built on the same data it was tested on. If your colleague used all 8 quarters to derive the formula AND to test it, the backtest is circular. A valid test requires splitting the data: build the model on quarters 1 through 5, then test on quarters 6 through 8. A model that 'perfectly' predicts 8 data points is almost certainly capturing noise, not signal.

Connections

Backtesting sits directly on top of Expected Value and Returns. Expected Value gives you the formula for weighting uncertain outcomes. Returns tells you what those outcomes actually look like across time. Backtesting asks: does the formula, applied to real historical Returns data, produce Expected Values that match reality? Without backtesting, your Expected Value calculations are theoretical. With it, they become empirical. Downstream, backtesting feeds into Sensitivity Analysis (what happens to backtest results when assumptions shift - as in the inventory example where delivery time moved from 5 to 8 days?), Variance and Standard Deviation (how stable are the backtested Returns across different periods and conditions?), and Risk-Adjusted Return (is the rule producing genuine Alpha or tracking noise?). Any time you build a decision rule for Capital Allocation, Pricing, Hiring Targets, or Budget, backtesting is how you earn the right to deploy it.

Disclaimer: This content is for educational and informational purposes only and does not constitute financial, investment, tax, or legal advice. It is not a recommendation to buy, sell, or hold any security or financial product. You should consult a qualified financial advisor, tax professional, or attorney before making financial decisions. Past performance is not indicative of future results. The author is not a registered investment advisor, broker-dealer, or financial planner.

← back to tree browse all →