Breaking continuous tasks into discrete, measurable subtasks for LLM systems.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
LLM systems often fail not because they can’t “think,” but because we ask them to complete a continuous, messy, real-world task without agreeing on the discrete checkpoints that count as progress. Task discretization is the craft of turning “do the thing” into an ordered list of verifiable, measurable subtasks that a system can plan, execute, evaluate, and learn from.
Task discretization defines (1) atomic subtasks τ with explicit input/output and a completion test, (2) a mapping f that converts a continuous trajectory into an ordered sequence of τ identifiers, and (3) observable measurements (metrics/rewards) that certify completion and enable optimization. Done well, it turns ambiguous objectives into DAGs you can topologically sort, instrument, and improve with loss functions and inference.
Real tasks are continuous in at least three senses:
LLM-based systems struggle here because their execution model is discrete: they emit tokens, call tools, and produce artifacts. If we don’t discretize the task, we can’t reliably:
Task discretization is the process of breaking a continuous task into discrete, measurable subtasks—small enough to verify, but structured enough to recombine into the full task.
You will use three central ideas.
An atomic subtask τ is the smallest unit you treat as indivisible for the purposes of orchestration and verification.
Operationally, τ must have:
Think of τ as a function-like interface:
A key subtlety: atomic does not mean “simple.” It means you choose not to further subdivide it because verification would not improve, or because the cost of decomposition outweighs the benefit.
Let a task unfold as a continuous trajectory. You can model this as a sequence of world states or observations over time:
Task discretization defines a mapping:
Here, f segments the continuous process into an ordered list of subtask identifiers.
In practice, f is not a single formula; it is a design artifact: rules, schemas, and policies for how you label segments as subtasks.
Each τ should produce an observable measurement—a signal that can be used for:
This measurement might be:
At difficulty 5, the challenge is not “make a checklist.” It is designing subtasks that are:
This is where your prerequisites connect:
Task discretization is designing a bridge:
If you can’t measure progress, you can’t manage it. If you can’t define atomic units, you can’t orchestrate it. If you can’t define f, you can’t consistently map messy reality into actionable steps.
LLM systems are brittle when they operate on large, ambiguous scopes. The typical failure mode is:
Atomic subtasks create bounded responsibility: one τ owns one verifiable change.
A high-quality τ can be specified as a quadruple:
where:
You can think of C as a predicate:
and M as a vector of measurements:
Even if you ultimately need a single scalar reward r, keeping m as a vector prevents premature scalarization.
If you do not constrain inputs, your system will “solve” tasks by changing assumptions.
Common input constraints:
An effective pattern is: minimize the input surface so the model cannot roam.
The output should be something you can store, diff, test, or validate.
Examples of output contracts:
The output contract should also specify format stability: if downstream τ expects a field, it must exist.
Completion criteria are where discretization becomes engineering.
Good criteria are:
Bad criteria are:
A concrete completion criterion might be:
When you can’t fully decide correctness (common in language tasks), you define:
Completion:
Measurements should include at least:
Let m = (q, s, c) where:
Then you might define a scalar reward:
But notice the mechanism design problem: if you set w_c too high, the system may avoid necessary work.
You can always split further. The question is: does splitting improve reliability?
A useful test:
If the answer is “no,” τ is likely too large.
Subtasks rarely form a simple list. Most real work has dependencies.
Represent subtasks as nodes in a DAG:
Then your execution order can be found via topological sort.
This representation matters because it cleanly separates:
In MDP terms, an atomic subtask can be treated like an option (a temporally extended action):
This helps when you want hierarchical control:
| Design choice | Too coarse | Too fine | Target |
|---|---|---|---|
| τ scope | hard to verify; ambiguous failure | orchestration overhead; many handoffs | failures are diagnosable and recoverable |
| C strictness | false positives (“done” but wrong) | never completes | cheap, decisive, correlated |
| Measurements m | uninformative learning | noisy + overfitting | minimal set that predicts success |
| Output contract | hard to parse downstream | brittle format | stable schema + clear ownership |
The central skill is to balance verification power against coordination cost.
If you build one checklist per task, you get a brittle system that fails when the task deviates slightly. The mapping f is a generalization: a policy for turning observations into a subtask sequence.
Formally, imagine a trajectory of observations o(t) and states s(t). Discretization produces indices:
But in practice, the trajectory is only partially observed. You see:
So f is really a mapping from available evidence to a plan of τ.
Segmentation is deciding boundaries.
A boundary is justified when:
A useful heuristic: boundaries should align with observables. If you can’t observe the boundary, you can’t enforce it.
Many tasks can be partially ordered.
Define a dependency relation ≺ such that:
Then a valid execution sequence is any linear extension of this partial order, obtainable by topological sort.
This yields two benefits:
In LLM systems, f often has two layers:
You can implement these with:
Continuous tasks have branch points:
If f does not model branch points, the system “hallucinates progress.”
Represent branch points as:
Example output:
Then downstream dependencies are conditioned on that output.
Measurements are noisy. For many subtasks, you cannot directly observe correctness—only proxies.
Let D be observed evidence (rubric scores, test results, lint warnings, reviewer ratings).
You care about a latent variable:
Bayesian framing:
This is useful when:
Instead of a hard completion decision, you can maintain a posterior and only advance when:
for a chosen risk tolerance δ.
If you define subtasks τᵢ with measurements mᵢ, you can define an overall objective as a sum:
where R(τᵢ) could be a scalar reward or negative loss.
But beware: additivity is an assumption. Some subtasks interact (coupling). Mechanism design tells you that optimizing local rewards can harm global success.
A safer pattern:
Example:
When you turn a task into subtasks with metrics, you create a “game” the system plays.
If the metric is gameable, it will be gamed.
Common Goodhart failure:
Mechanism design response:
In other words: design measurements so that the cheapest path to a high score is aligned with real success.
A workable discretization mapping often looks like:
The mapping f is thus an executable artifact: it turns the continuous process into discrete, monitorable units.
| Approach | Strengths | Weaknesses | When to use |
|---|---|---|---|
| Rule-based f | predictable; auditable; cheap | brittle; coverage gaps | stable domains; compliance-heavy workflows |
| Learned f | adapts; handles variety | harder to debug; needs data | broad task diversity; many examples available |
| Hybrid | best of both if done well | integration complexity | production LLM systems with safety needs |
At difficulty 5, the key is not picking one, but designing interfaces so both can cooperate.
In production, you care about:
Task discretization supports all four by making the system observable and testable.
Suppose the continuous task is:
A discretized plan might include τ like:
Each subtask has a validator:
This turns “ship it” into a pipeline with explicit checkpoints.
If each τ emits:
Then you can build a trace:
This is the foundation for iterating on your system like any other software.
Once you have discrete units, you can learn at multiple levels:
Because you have measurements, you can define losses.
Example: for a classifier that predicts next τ, cross-entropy loss:
where yᵢk is the one-hot target for the chosen τ_k.
Example: for quality scoring regression, MSE:
With subtasks, you can define a higher-level MDP where actions are τ.
Let S be the state summary (repo status, which τ done, key metrics). Let A be available subtasks.
Then policy π chooses:
The transition dynamics depend on executor behavior and environment, but your validators provide reward signals.
This is how “agentic workflows” become amenable to reinforcement learning or bandit optimization.
Many safety failures come from unconstrained action spaces.
Discretization helps by:
Example: a τ that can modify production settings must have:
Below are common τ patterns used in LLM systems.
| Pattern | Output | Validator | Purpose |
|---|---|---|---|
| Extract | JSON fields | schema validation | turn text into structured spec |
| Decide | discrete label | consistency checks | branch control |
| Generate | artifact (code/text) | lint/tests/rubric | produce work product |
| Verify | pass/fail report | deterministic checks | guardrails |
| Summarize | structured summary | coverage checklist | state compression |
| Ask | question set | user response present | resolve uncertainty |
This is the core engineering discipline behind scalable agent workflows.
Continuous task: produce a QBR deck from a folder of notes, metrics exports, and email threads. Constraint: must be accurate, cite sources, and be delivered as a structured markdown outline for conversion to slides.
Goal: define atomic subtasks τ, measurements m, and a mapping f that sequences them with branch points.
1) Identify observables and risks
2) Propose atomic subtasks τ with explicit contracts
Let each τ have (I, O, C, M).
τ₁ = τ_inventory
I: folder path; allowed tools: file listing
O: JSON list of files with type, date, guessed relevance
C: JSON validates; every file has {name,type,hash}
M: (coverage = files_listed/total_files)
τ₂ = τ_extract_metrics
I: selected CSV/XLSX files; allowed tools: dataframe parser
O: canonical metrics table with columns {metric, period, value, source_ref}
C: schema validates AND every row has source_ref
M: (q = %rows with source_ref, c = runtime)
τ₃ = τ_extract_narrative
I: email/PDF notes; allowed tools: text extraction
O: bullet list of themes with citations (doc id + snippet)
C: ≥ N themes and each has ≥ 1 citation
M: (q = citation_rate)
τ₄ = τ_outline
I: metrics table + themes
O: markdown outline with required sections
C: headings match template; all charts referenced exist in metrics
M: (q = template_coverage)
τ₅ = τ_fact_check
I: outline + sources
O: report of every numeric claim with its source_ref
C: all numeric claims linked OR flagged
M: (q = linked_claim_rate)
τ₆ = τ_finalize
I: corrected outline + fact check report resolved
O: final markdown
C: linked_claim_rate = 1 AND no critical flags
M: (q = 1, s = policy_ok, c = cost)
3) Build dependency DAG and topological order
Edges:
A valid topological sequence is:
(τ_inventory, τ_extract_metrics, τ_extract_narrative, τ_outline, τ_fact_check, τ_finalize)
4) Define mapping f with branch points
If τ_extract_metrics finds missing periods, branch:
If "ask": run τ_ask_user (completion: user provides missing export)
If "estimate": run τ_estimate_with_uncertainty (must output CI and mark as estimate)
So f is conditional:
5) Add Bayesian completion for fact check
Let Z = “all numeric claims are correct.” Evidence D includes linked_claim_rate and random spot-check results.
If linked_claim_rate = 1 but spot-check fails, posterior P(Z=1|D) drops.
Advance only when P(Z=1|D) ≥ 1 − δ (choose δ = 0.05 for high-stakes reporting).
6) Mechanism design check (avoid gaming)
If you reward linked_claim_rate alone, the system might avoid numbers.
Mitigation: require minimum number of metrics/charts.
Completion criterion includes: charts_count ≥ K and all linked.
Insight: The discretization succeeds because it makes accuracy measurable (source_ref per claim), adds explicit branch handling for missing data, and prevents reward hacking (avoiding numbers) by coupling “must include metrics” with “must cite metrics.”
Continuous task: CI occasionally fails due to flaky tests. Objective: reduce flake rate while preserving coverage and not masking real failures.
We will define τ, measurements m, and a robust f that routes based on evidence.
1) Define success in observables
We cannot directly observe “true flakiness,” only failure patterns.
Observables:
Define a metric:
2) Create atomic subtasks τ
τ₁ = τ_collect_history
I: CI logs datastore
O: table {test_id, runs, fails, rerun_fails}
C: table complete for last N days
M: (q = coverage_of_runs)
τ₂ = τ_rank_suspects
I: history table
O: ranked list with flake_score and confidence
C: list length ≥ 10 and confidence intervals present
M: (q = calibration proxy)
τ₃ = τ_reproduce_locally
I: top suspect test; fixed seed policy
O: reproduction report + environment fingerprint
C: either reproduces OR documents non-repro with ≥ R attempts
M: (c = time, q = attempts)
τ₄ = τ_root_cause_hypothesis
I: reproduction artifacts
O: hypothesis label {race, time, network, order, randomness} + evidence links
C: evidence links ≥ 2
M: (q = evidence_count)
τ₅ = τ_fix
I: hypothesis + code access (restricted to test module)
O: patch
C: patch applies cleanly; unit tests for module pass
M: (q = tests_pass)
τ₆ = τ_validate_in_ci
I: patch + CI runner
O: CI run results across M reruns
C: failure rate ≤ ε AND runtime increase ≤ Δ
M: (q = 1 − failure_rate, c = runtime)
τ₇ = τ_guard_against_masking
I: patch
O: analysis: does it reduce assertions / skip?
C: no new skips; assertion count not decreased beyond threshold
M: (s = masking_risk)
3) Define dependency structure
τ_collect_history → τ_rank_suspects → τ_reproduce_locally → τ_root_cause_hypothesis → τ_fix
Then τ_fix → {τ_validate_in_ci, τ_guard_against_masking} and both must pass.
4) Define Bayesian inference for “is it truly fixed?”
Let Z = “test is non-flaky under distribution of CI conditions.”
Evidence D: M reruns with k failures.
Assume a Beta prior for failure probability p:
Posterior:
We accept fix if:
This is computable from the Beta CDF.
5) Mechanism design: prevent trivial fixes
If the reward is “CI passes,” the system might mark tests as xfail or skip.
So we add τ_guard_against_masking with strict completion criteria.
We also define a global constraint:
6) Define mapping f (router behavior)
If τ_reproduce_locally fails to reproduce after R attempts, branch:
So f is a conditional policy over τ based on evidence quality.
Insight: This discretization prevents the most common failure mode—‘fixing’ flakes by weakening tests—by explicitly measuring masking risk and treating “fixed” as a Bayesian claim about future failure probability, not a single green CI run.
You have a subtask τ that produces measurements m = (q, s, c) with q,s ∈ [0,1] and c ≥ 0. You want a scalar reward r that (a) prioritizes correctness, (b) strongly penalizes safety violations, (c) mildly penalizes cost, and (d) is bounded to stabilize learning.
1) Start with a linear reward (unbounded in cost)
r₀ = w_q q + w_s s − w_c c
Problem: as c grows, r₀ → −∞, which can destabilize optimization.
2) Bound the cost penalty with a saturating transform
Use log(1 + c): grows slowly.
Define:
r₁ = w_q q + w_s s − w_c log(1 + c)
Now the penalty increases sublinearly.
3) Add a hard safety veto (mechanism design)
If safety is violated, we want r to collapse regardless of q.
Let v be a binary safety violation indicator (from a classifier/validator):
Define:
r₂ = (1 − v) · r₁ − v · P
where P > 0 is a large penalty.
4) Bound the total reward to [−1, 1] for stability
Apply tanh:
r = tanh(r₂)
Now r ∈ (−1, 1).
5) Summarize final form
r = tanh((1 − v)(w_q q + w_s s − w_c log(1 + c)) − vP)
Interpretation:
Insight: Designing r is not just math; it is mechanism design. The ‘veto’ term prevents the agent from trading safety for quality, which linear weights often allow.
Task discretization converts continuous intent into discrete, verifiable subtasks τ connected by an ordering (often a DAG).
An atomic subtask τ must have explicit input/output contracts and a completion criterion C that is observable and reasonably decidable.
Each τ should emit measurements m (often a vector) so you can verify, score, and learn without collapsing everything into one fragile metric too early.
The discretization mapping f is a policy/artifact that segments trajectories into (τ₁, …, τₙ) and must explicitly represent branch points and uncertainty.
Topological sorting of a τ-DAG separates dependency logic from scheduling and enables parallelism and robust recovery.
Bayesian inference helps you treat completion as probabilistic when validators are noisy or coverage is incomplete.
Mechanism design is central: any metric you introduce becomes an incentive, so you must anticipate and block reward hacking (Goodhart’s law).
Well-discretized tasks become easier to instrument, debug, optimize with losses/rewards, and scale across agents/tools.
Defining τ without a real validator: if completion cannot be checked, you have not discretized—only renamed the ambiguity.
Over-splitting into tiny τ that create orchestration overhead and brittle handoffs, increasing failure probability at interfaces.
Using a single proxy metric as the objective (e.g., “CI green” or “#tests”) and getting gamed; failing to add constraints/vetoes.
Treating the plan as a fixed list: ignoring branch points (missing info, failed checks) and thereby forcing the system to hallucinate progress.
You are building an LLM system to “review a pull request for security issues.” Propose 6–8 atomic subtasks τ with (I, O, C) and at least one measurement each. Include at least one branch point and explain how f routes at that point.
Hint: Think in layers: inventory → threat model → scanning → manual reasoning → report → fix suggestions. Make sure each τ outputs an artifact that can be validated (schema, scan result thresholds, checklist completeness).
One possible discretization:
τ₁ inventory_files
I: repo path + PR diff; tools: git diff
O: JSON list of changed files with language + risk tag
C: schema valid; all diff files listed
M: coverage
τ₂ identify_attack_surface
I: inventory
O: structured list of entrypoints (HTTP handlers, auth, crypto)
C: at least one entrypoint or explicit “none found” with justification
M: rubric score
τ₃ run_static_scan
I: repo; tool: SAST
O: scan report with severities
C: report parsed; no tool errors
M: count(P0), count(P1)
τ₄ dependency_audit
I: lockfiles
O: CVE list with package + version + severity
C: parsed; sources linked
M: count(high)
τ₅ deep_review_high_risk
I: files tagged high risk
O: findings list with code references
C: each finding includes file:line and exploit narrative
M: evidence_count
Branch point τ₆ decide_blocking
I: findings + scan counts
O: decision {block, warn, approve} + rationale
C: decision ∈ set; rationale present
M: consistency check with policy
Routing f:
Validators: schema checks, policy thresholds (e.g., P0 must block), and completeness rubrics.
Consider a subtask τ where the only available validator is a noisy human rating (0–5). Design a Bayesian completion rule using a prior and repeated ratings. State clearly what latent variable you infer and what threshold you use to declare completion.
Hint: Map the 0–5 rating to a binary ‘acceptable’ event or to a probabilistic model. Use a Beta-Binomial model for acceptability, or a Gaussian model for scores if you prefer. Then define P(Z=1|D) ≥ 1−δ.
One solution (binary acceptability):
Latent variable:
Observation model:
Let p be probability a random rater marks acceptable.
Prior:
Collect n ratings with k acceptables.
Posterior:
Completion rule:
Example: p_min=0.8, δ=0.05.
Interpretation:
You discretize a task into τ₁…τₙ and define local rewards rᵢ. Give an example where maximizing ∑ᵢ rᵢ harms the global objective G. Then propose a constraint or redesign of measurements to fix it.
Hint: Use a Goodhart example: optimizing speed/cost per τ reduces quality; optimizing ‘#issues found’ creates false positives; optimizing ‘short answers’ harms completeness. Add a global veto, couple metrics, or redesign validators.
Example: Code review agent.
Local reward rᵢ for τ_find_issues is proportional to number of issues reported.
Maximizing ∑ rᵢ causes the agent to flood the PR with trivial nitpicks (global objective G = developer usefulness drops).
Fix:
This aligns incentives so high reward requires high-signal findings.
Prerequisite links:
Next-step nodes you may want after this concept: