Business Finance

Data Moat

Strategy & PositioningDifficulty: ★★★★☆

Do not build custom models unless the data moat justifies it

Related Concepts in Other Trees

A data moat formally exists when I(proprietary_data; Y | public_data) > 0 - your data shares information with the target that public data does not, which is the mathematical condition justifying custom model investment

Opportunity CostPersonal Finance

Data moat analysis is opportunity cost applied to ML investment: every engineering dollar spent building a custom model without proprietary data advantage is a dollar not spent on higher-ROI work using commodity APIs

Prerequisites (2)

competitive moatlvl 2

Build, Buy, or Hirelvl 3

Unlocks (1)

Asset Driftlvl 4

Your ML engineer pitches a custom fraud detection model. Six months of build time, $380K fully loaded. Meanwhile, a vendor offers a plug-and-play API at $3K/month. Your engineer insists the custom model will be better. She's probably right - but that's not the question. The question is whether you own data that the vendor can never get, and whether that data gets better every month you operate.

TL;DR:

A Data Moat exists when your proprietary data compounds over time and creates a competitive moat that off-the-shelf solutions cannot replicate. Only build custom models when you have one - otherwise you are lighting Capital Investment on fire for a temporary edge that erodes the moment a vendor catches up.

What It Is

A Data Moat is a specific type of competitive moat where the barrier to replication is a dataset that only you possess, that improves as you operate, and that directly drives a measurable Competitive Advantage.

Three properties must all be true:

1)Proprietary - the data cannot be purchased, scraped, or reconstructed by a competitor at reasonable cost
2)Compounding - the dataset gets richer with every transaction, interaction, or decision your business processes
3)Decision-relevant - the data directly improves a prediction or classification that moves a P&L line item

If any one of these is missing, you do not have a Data Moat. You have a dataset.

Why Operators Care

When you have P&L ownership, every dollar of Capital Investment needs to earn its place. Custom model development is one of the most expensive bets an Operator can make - it consumes Engineering Labor, stretches Time-to-Fill for specialized roles, and carries high Execution Risk because you won't know if the model works until months after you start spending.

The Build, Buy, or Hire framework gives you the decision structure. The Data Moat concept gives you the input to that decision: build only when the data underneath is itself the moat.

Here is why the distinction matters to your P&L:

•With a Data Moat: Your model improves as your business grows. Every customer interaction makes the next prediction cheaper and better. Cost Per Unit falls over time. Competitors cannot buy their way to parity because the data does not exist on the open market. This is a Capital Asset that appreciates.
•Without a Data Moat: Your custom model might outperform the vendor today, but the vendor serves thousands of customers and improves from all of their data. Your edge erodes - this is Competitive Erosion in action. You spent $380K to build something a $3K/month subscription replaces in 18 months.

The asymmetry is brutal. Getting it wrong means you burned Budget on what turns out to be a Wasting Asset instead of a Compounder.

How It Works

Start with the Build, Buy, or Hire decision, but add a data audit before you choose "Build."

Step 1: Identify what data the model needs

List every input feature the model requires. For each one, ask: can a competitor acquire this data by signing a contract with a vendor, downloading a public dataset, or scraping the web?

Step 2: Score proprietary vs. commodity

Split your features into two buckets:

•Commodity data - publicly available or purchasable (weather data, census data, stock prices, standard product catalogs)
•Proprietary data - generated only by your operations (your customers' purchase sequences, your internal defect patterns, your proprietary Scoring Model outputs, your Feedback Loop labels)

If 80%+ of the model's predictive power comes from commodity data, you almost certainly do not have a Data Moat.

Step 3: Test for Compounding

Ask: does the dataset get meaningfully better with scale and time? A Feedback Loop is the engine here. The model makes a prediction, the business takes an action, you observe the outcome, and that outcome becomes training data. Each cycle makes the next prediction more accurate.

Classic Compounding patterns:

•Fraud model flags transaction -> human reviews -> label feeds back into training set
•Recommendation engine suggests product -> customer clicks or ignores -> signal improves ranking
•Pricing algorithm sets price -> market responds -> conversion data refines elasticity estimates

Step 4: Quantify the advantage

Estimate the Expected Value of the custom model's edge over the vendor alternative, over your Time Horizon. If the edge compounds, the gap between custom and vendor widens each year. If it does not compound, the vendor catches up and your Implementation Cost was wasted opportunity cost.

When to Use It

Apply this framework every time someone proposes building a custom model, algorithm, or data pipeline that could instead be purchased.

Build when all three are true:

1)Your proprietary data drives >50% of the model's predictive lift
2)A Feedback Loop exists that makes the data better with each operating cycle
3)The ROI of the custom build - accounting for Implementation Cost, opportunity cost, and Execution Risk - exceeds the vendor alternative over a 3+ year Time Horizon

Buy when any of these are true:

•The model primarily uses commodity data (vendor can match your inputs)
•Your data advantage is real but temporary (a competitor could build the same dataset in 12-18 months)
•The domain is changing so fast that model shelf life is short, making ongoing vendor updates more valuable than a static custom build
•You lack the capacity to maintain the model after launch (models are not ships you launch - they are gardens you tend)

The hardest case: you have good proprietary data but in a domain where vendors are rapidly improving. Run the numbers on Competitive Erosion rate. If the vendor closes 30% of your accuracy gap each year, your custom model's edge has a half-life of about 2 years. That $380K build needs to generate enough Profit in those 2 years to justify itself - and it probably won't.

Worked Examples (2)

Fraud detection: Data Moat present

An e-commerce company processes 2M transactions/month. Over 3 years, they have accumulated 72M labeled transactions including 180K confirmed fraud cases with internal investigation notes. A vendor API costs $0.02 per transaction ($40K/month). A custom model costs $380K to build and $8K/month to maintain (one ML engineer at 20% time). The custom model catches 94% of fraud vs. the vendor's 87%, on $120M annual transaction volume with a 1.2% fraud rate.

Annual fraud exposure: $120M x 1.2% = $1.44M in fraudulent transactions
Vendor catches 87%: $1.44M x 0.87 = $1.253M caught, $187K leaks through
Custom model catches 94%: $1.44M x 0.94 = $1.354M caught, $86K leaks through
Annual savings from custom model: $187K - $86K = $101K in reduced fraud losses
Annual cost comparison - Vendor: $40K/month x 12 = $480K. Custom: $8K/month x 12 = $96K maintenance
Year 1 net: custom saves ($480K - $96K) = $384K in vendor fees, plus $101K in fraud reduction = $485K benefit, minus $380K build cost = $105K net positive
Year 2+: the Feedback Loop kicks in. Each month adds 60K new labeled transactions. The custom model improves to 96% catch rate by end of Year 2, while the vendor stays at 87-89% because they don't have your investigation labels. The gap widens.
3-year Expected Value of custom: $105K (Y1) + $500K (Y2) + $520K (Y3) = $1.125M net positive

Insight: The Data Moat here is the 72M labeled transactions with internal investigation notes - data no vendor can acquire. The Feedback Loop (flag -> investigate -> label -> retrain) means the moat deepens every month. This is a Compounder: build it.

Demand forecasting: No Data Moat

A mid-market retailer wants to build a custom demand forecasting model. They have 3 years of sales data across 4,000 SKUs, plus weather and holiday calendars. A vendor offers forecasting-as-a-service at $5K/month trained on data from 200+ retailers. Building custom costs $280K with $6K/month maintenance.

Audit the data: sales history (proprietary but shallow at only 4K SKUs), weather (commodity - anyone can buy it), holiday calendars (public), promotional schedules (proprietary but simple)
Score proprietary vs. commodity: ~60% of predictive power comes from commodity features (seasonality, weather, day-of-week patterns). The vendor, trained on 200+ retailers, has seen every pattern your 4K SKUs can show - and thousands more.
Test for Compounding: your sales data grows linearly with time, but so does every other retailer's. No Feedback Loop exists that makes your data uniquely better - you just accumulate more of the same signal the vendor already has at 50x your scale.
3-year cost: Custom = $280K + ($6K x 36) = $496K. Vendor = $5K x 36 = $180K. Custom must deliver $316K more value to break even.
Accuracy gap: custom model is 2-3% more accurate on your specific SKUs today, worth roughly $40K/year in better Inventory Control. Over 3 years that is $120K - well short of the $316K gap.
Competitive Erosion: the vendor improves quarterly from 200+ customer datasets. Your 2-3% edge likely closes within 18 months.

Insight: No Data Moat. The proprietary data (your sales history) is not structurally different from what the vendor aggregates at scale. The vendor's data Compounding rate exceeds yours. Buy the service and reallocate that $280K to something where you have a real Informational Advantage.

Key Takeaways

✓
A Data Moat requires all three properties: proprietary data, a Compounding Feedback Loop, and direct P&L relevance. Missing any one means you have a dataset, not a moat.
✓
The Build, Buy, or Hire decision for models hinges on whether the data underneath is a Capital Asset that appreciates or a Wasting Asset that erodes as vendors improve.
✓
The most dangerous mistake is confusing 'we have data' with 'we have a Data Moat.' Every company has data. Few have data that compounds into a competitive moat competitors cannot purchase.

Common Mistakes

✗
Building because 'we have the talent' - Having ML engineers is a necessary condition, not a sufficient one. The question is never 'can we build it?' but 'does the data moat justify the Capital Investment?' Smart engineers are a scarce resource with high opportunity cost. Deploying them on a model without a Data Moat means you are not deploying them on something where a moat exists.
✗
Ignoring maintenance cost - Custom models are not Capital Investments you make once. They require ongoing retraining, monitoring, data pipeline maintenance, and infrastructure. A model that costs $300K to build often costs $80-120K/year to keep running. If the Data Moat is not real, you have created a recurring Cost Center that delivers declining value as vendors catch up.

Practice

medium

Your SaaS company has 50K customer support tickets with resolution data. An engineer proposes building a custom ticket-routing model to replace a vendor charging $2K/month. The custom build is estimated at $150K with $4K/month maintenance. Evaluate whether a Data Moat exists and recommend Build or Buy.

Hint: Apply the three-property test. Is the ticket data proprietary? Does it compound via a Feedback Loop? What happens to the vendor's product over the next 3 years as they serve hundreds of other SaaS companies?

Show solution

Proprietary: Partially. Your tickets contain company-specific product terminology and resolution patterns. But ticket routing is a well-understood NLP problem - the vendor trains on tickets from hundreds of companies.

Compounding: Weak. New tickets add volume but the routing categories don't get fundamentally richer. The Feedback Loop (route -> resolve -> measure accuracy -> retrain) exists but yields diminishing returns after ~20K examples.

Decision-relevant: Yes, faster routing improves CSAT and reduces resolution time.

Verdict: 1 of 3 properties is strong, 1 is partial, 1 is weak. No Data Moat.

Cost math: 3-year custom = $150K + ($4K x 36) = $294K. 3-year vendor = $2K x 36 = $72K. Custom must deliver $222K in incremental value. At 50K tickets/year, that means $4.44 of extra value per ticket - implausible for a routing improvement.

Recommendation: Buy. Reallocate the $150K and the engineer's time to a problem where your data is truly proprietary.

hard

A logistics company has 5 years of delivery data across 12 cities - exact routes, traffic patterns at specific times, and customer availability windows collected from 8M deliveries. No vendor has this combination of route-level and customer-level data. Estimate whether this qualifies as a Data Moat for a route optimization model, and sketch the ROI framework.

Hint: Think about what makes this data structurally different from what a mapping API provides. Consider the Feedback Loop: does each delivery make the next delivery's prediction better? What is the P&L line item this model would move?

Show solution

Proprietary: Strong. The combination of route-level timing, customer availability windows, and actual delivery outcomes across 8M deliveries is data no vendor or mapping API possesses. Google Maps knows traffic. You know traffic plus 'Mrs. Rodriguez is never home before 3pm on Tuesdays.'

Compounding: Strong. Each delivery generates a new training example. The Feedback Loop is: predict optimal route -> driver follows route -> measure actual delivery time and success rate -> retrain. More deliveries = more granular time-of-day and customer-level predictions. This is a Compounder.

Decision-relevant: Direct P&L impact. Route optimization reduces fuel costs (variable cost), increases deliveries per driver per day (Throughput), and reduces failed deliveries (Error Cost).

ROI framework:

•Current cost per delivery: estimate from your Operating Statement
•Expected improvement from custom routing: even 5% fewer miles and 8% fewer failed deliveries is significant at scale
•At 8M deliveries/5 years = ~1.6M/year, a $0.50 per-delivery improvement = $800K/year
•Subtract Implementation Cost and maintenance, compare to vendor alternative
•Key: the vendor cannot replicate your customer availability data. The moat deepens with every delivery.

Verdict: All three properties are strong. This is a textbook Data Moat. Build.

Connections

Data Moat sits at the intersection of two concepts you already know. From competitive moat, you learned that durable advantages come from slow-to-build assets that compound - not from where the spending is most visible. A Data Moat is the specific case where that slow-to-build asset is a proprietary dataset with a Feedback Loop. From Build, Buy, or Hire, you learned the framework for make-or-buy decisions. Data Moat gives you the critical input to that framework when the decision involves models or algorithms: if the data underneath is proprietary and compounding, build. If it is commodity data at a different scale, buy. Downstream, this concept connects to Informational Advantage - because a Data Moat is an Informational Advantage that scales with your operations. It also connects to how you think about Capital Investment and ROI for technical projects: the moat is what turns a depreciating software project into an appreciating Knowledge Asset.

Disclaimer: This content is for educational and informational purposes only and does not constitute financial, investment, tax, or legal advice. It is not a recommendation to buy, sell, or hold any security or financial product. You should consult a qualified financial advisor, tax professional, or attorney before making financial decisions. Past performance is not indicative of future results. The author is not a registered investment advisor, broker-dealer, or financial planner.

← back to tree browse all →