export const slug = "agent-workflow-run-state-single-source-of-truth";
Most agent workflows fail in the same boring way: the UI says Running, the trace viewer shows something else, and your team is stuck guessing whether anything is actually happening. That’s not an “observability” problem—it’s a workflow run state problem.
If you’re building with Claude Skills (or any tool-using LLM agent), you’re probably shipping multi-step automations that can pause for OAuth, wait on humans, retry tools, and survive refreshes. The minute your product depends on that kind of long-lived execution, status stops being a label and becomes a contract.
This post lays out a practical run-state contract for agentic workflows—what “running” must mean, which states you need, what belongs in your database vs traces, and how to kill the “two sources of truth” anti-pattern for good.
The real problem: why agent workflows lie about status
Classic workflow engines had it easy: steps are deterministic, workers are predictable, and “running” mostly means “a worker is executing.” Agent workflows aren’t like that.
Agentic workflows commonly include:
- Tool calls with variable latency (search, CRM APIs, EHR portals, Google Drive)
- OAuth reconnects and expiring tokens
- Human-in-the-loop gates (approval, “send this email?”, “confirm this lead list?”)
- Retries (rate limits, transient 500s) and idempotency concerns
- Refresh/reconnect (the user reloads the page mid-run, or closes their laptop)
- Async “wait” states that might last minutes, hours, or days
In that environment, teams often end up with two status systems:
- DB state (fast, product-owned, but often too coarse)
- Tracing/observability state (rich, but slow, eventual, and not product-authoritative)
The result is status drift:
- DB says
runningbecause you never transitioned out. - Traces show no spans for 10 minutes because ingestion is delayed.
- Queue shows the job was picked up, then the worker crashed.
- The UI keeps spinning because it doesn’t know what else to do.
A healthy system makes a stronger promise:
The product has exactly one authoritative answer to: “What is the workflow doing right now, and why?”
That promise is the run-state contract.
Define the workflow run state contract (before you write code)
A run-state contract is not “some enum values.” It’s a set of guarantees that every layer of your system relies on: UI, workers, support tooling, alerting, and (eventually) customers.
Who consumes workflow run state?
Be explicit. The consumers have different needs:
- End users / operators: “Can I trust this automation? What do I do next?”
- Support / on-call: “Is it stuck? Is it safe to retry? What broke?”
- Engineers: “Which step failed? What inputs caused it? Can we reproduce?”
- Sales / RevOps owners: “What’s the SLA? How often does auth block runs?”
A single “RUNNING” label cannot satisfy all of them.
What your workflow run state must guarantee
Here’s a contract we’ve found to be both strict and workable for agent workflows:
- Authoritative: There is one system-of-record for “current state.”
- Monotonic attempts: Attempts can increase; an attempt never rewrites history.
- Timestamped transitions: Every transition has an
attime (and ideally an actor). - Explainable blocking: If the run is not progressing, it must say why.
- Recoverable: States must support resume after refresh/reconnect.
- Validatable: Invalid transitions are rejected by code (not “handled later”).
- Composable: The contract works for single-step, multi-step, and nested runs.
What your workflow run state must not do
Don’t try to encode everything.
- Don’t mirror your traces. Traces are for high-cardinality debugging details.
- Don’t embed infrastructure implementation. “Kafka partition 12 lagging” is not a product state.
- Don’t pack UI copy into the state enum. Store
blocking_reason+ evidence; generate copy in the UI.
Treat your run-state model as a product API: stable, documented, testable.
A reference state machine for workflow run state (agentic edition)
You want states that are:
- User-meaningful (operators can take action)
- Worker-actionable (workers know what to do)
- Support-debuggable (on-call can triage)
Here’s a practical baseline.
Core states
Non-terminal:
queued— accepted, not yet executingrunning— actively executing a step under a worker leasewaiting_on_tool— paused for an async tool response or external callbackwaiting_on_auth— blocked on OAuth/token reconnectwaiting_on_approval— blocked on human inputretry_scheduled— paused untilnext_retry_atstalled— expected progress did not occur within a freshness windowcancel_requested— user/system requested cancel, worker must comply
Terminal:
succeededfailedcanceledcompleted_with_warnings
A note on stalled: it’s not “failed.” It’s “we can’t prove liveness.” It’s where you route humans and automation to recover.
Required fields (minimum viable contract)
At minimum, store these on your run record:
run_id(stable)workflow_id/workflow_versionstate(current)attempt(int)step_id(where you are)updated_at(last state change)last_heartbeat_at(liveness)lease_owner(which worker owns execution)blocking_reason(structured)next_retry_at(nullable)evidence_links(array of URLs/IDs to traces/logs)
That seems like a lot until you’ve debugged “running” at 2am.
Example: run record schema (JSON)
{
"run_id": "run_01J...",
"workflow_id": "wf_healthcare_news_scan",
"workflow_version": 7,
"state": "waiting_on_auth",
"attempt": 2,
"step_id": "google_drive:list_folder",
"blocking_reason": {
"type": "oauth_reconnect_required",
"tool": "google_drive",
"message": "Reconnect Google Drive to continue."
},
"lease_owner": null,
"last_heartbeat_at": "2026-02-22T18:41:03Z",
"next_retry_at": null,
"updated_at": "2026-02-22T18:41:04Z",
"evidence_links": [
{"type": "trace", "ref": "trace_9c2b..."},
{"type": "log", "ref": "logrun://run_01J..."}
]
}
Notice what’s not there: token counts, prompt text, per-span timing. That belongs in traces.
Single source of truth: what belongs in the DB vs what belongs in traces
If you want a single source of truth for workflow run state, you must separate concerns.
The DB is authoritative for lifecycle and blocking
Your database should own:
- Current state (and why)
- Transition history (append-only)
- Attempt number + retry schedule
- Heartbeat/lease ownership
- Pointers to evidence (trace IDs, log IDs)
This makes the DB fast, queryable, and stable.
Traces are authoritative for high-cardinality debugging
Tracing/observability tooling should own:
- Span timing and nested call graphs
- Tool call inputs/outputs (with redaction)
- Token usage, model calls, prompt templates
- Detailed errors and stack traces
- “Breadcrumbs” for engineers
Traces are great for explanations, but terrible as product truth because they’re:
- Eventual (ingestion delay)
- Incomplete (sampling, dropped spans)
- Not normalized for product queries
- Not designed for “what should the user do now?”
Link them with stable IDs, not duplicated logic
The key rule:
The UI renders the DB state. The DB state links to traces.
Don’t compute state by parsing trace spans. That’s how you rebuild “two sources of truth,” just with extra latency.
Eliminating “two sources of truth” with an event model
A reliable pattern is:
- Append-only transition log (source of truth)
- Materialized current state (fast reads)
- Worker leases + heartbeats (liveness)
- Traces as evidence (debugging)
Table design: run_events + runs
You can implement this with any database. Here’s a concrete Postgres-ish sketch.
-- Current state (fast reads)
create table runs (
run_id text primary key,
workflow_id text not null,
workflow_version int not null,
state text not null,
attempt int not null default 1,
step_id text,
blocking_reason jsonb,
lease_owner text,
lease_expires_at timestamptz,
last_heartbeat_at timestamptz,
next_retry_at timestamptz,
updated_at timestamptz not null default now()
);
-- Append-only state transitions (audit + rebuild)
create table run_events (
event_id bigserial primary key,
run_id text not null references runs(run_id),
at timestamptz not null default now(),
actor_type text not null, -- worker|system|user
actor_id text,
from_state text,
to_state text not null,
step_id text,
attempt int not null,
payload jsonb -- error codes, tool names, etc.
);
create index on run_events(run_id, event_id);
The contract is enforced by transitions, not by “whatever the UI last saw.”
Why append-only matters for agent workflows
Agent workflows have messy realities:
- A step partially executed before a crash.
- A tool call succeeded, but the confirmation write failed.
- A worker lease expired and another worker took over.
When your only record is “current status,” you lose the story. With an append-only log, you can:
- Explain runs to users (“blocked on approval since 14:32”)
- Debug retries safely (“attempt 3 started after rate limit backoff”)
- Rebuild materialized state if you change your model
Workflow run state and worker liveness: leases, heartbeats, and stalls
A run can be “running” only if you can prove a worker currently owns it.
Use a lease to prove ownership
A lease is a time-bound claim on the right to mutate state.
- Worker tries to acquire lease:
lease_owner = worker_123,lease_expires_at = now()+30s - Worker periodically heartbeats: extends lease + updates
last_heartbeat_at - If worker dies: lease expires; another worker can acquire
Heartbeats prevent “forever running”
Without heartbeats, running is a lie. With them, you can implement a crisp rule:
- If
state == runningandlast_heartbeat_at < now() - freshness_window, transition tostalled.
That creates a product action point:
- auto-retry if safe
- page on-call if needed
- prompt user to reconnect/auth/approve
Pseudocode: acquiring a lease safely
// TypeScript-ish pseudocode
async function tryAcquireLease(runId: string, workerId: string) {
const now = new Date();
const leaseMs = 30_000;
// Single atomic update: acquire only if no lease or expired
const updated = await db.exec(`
update runs
set lease_owner = $2,
lease_expires_at = $3,
last_heartbeat_at = $1,
updated_at = $1
where run_id = $4
and state in ('queued','running','retry_scheduled')
and (lease_expires_at is null or lease_expires_at < $1)
returning run_id;
`, [now, workerId, new Date(now.getTime() + leaseMs), runId]);
return updated.rowCount === 1;
}
If you can’t enforce leases atomically, your “single source of truth” collapses under contention.
Transition rules: make invalid states impossible
The fastest path to status chaos is letting any component set any state.
Instead:
- Centralize transitions in one module/service
- Validate
from_state -> to_staterules - Require fields for certain transitions (e.g.,
next_retry_at)
Example: transition validation
# Python-ish pseudocode
ALLOWED = {
"queued": {"running", "canceled"},
"running": {
"waiting_on_tool",
"waiting_on_auth",
"waiting_on_approval",
"retry_scheduled",
"succeeded",
"failed",
"cancel_requested",
"completed_with_warnings",
"stalled",
},
"waiting_on_auth": {"queued", "running", "canceled"},
"waiting_on_approval": {"running", "canceled"},
"waiting_on_tool": {"running", "retry_scheduled", "failed"},
"retry_scheduled": {"queued", "running", "canceled"},
"stalled": {"queued", "running", "failed", "canceled"},
"cancel_requested": {"canceled", "failed"},
"succeeded": set(),
"failed": set(),
"canceled": set(),
"completed_with_warnings": set(),
}
def transition(run, to_state, *, payload=None):
if to_state not in ALLOWED[run.state]:
raise ValueError(f"Invalid transition {run.state} -> {to_state}")
if to_state == "retry_scheduled" and not run.next_retry_at:
raise ValueError("retry_scheduled requires next_retry_at")
# write event + update materialized state in one transaction
This looks strict, but it’s kinder than letting UI + workers “figure it out.”
UI semantics: show state + reason + freshness (not just state)
Even with perfect states, UX fails if you don’t show why.
A robust UI model is:
- State:
waiting_on_auth - Reason:
oauth_reconnect_required+ tool name - Freshness:
last_heartbeat_at/updated_at
So instead of a spinner, you can show:
- “Waiting on auth: Reconnect Google Drive to continue.”
- “Retry scheduled: Next attempt at 14:32.”
- “Stalled: No heartbeat for 5 minutes. Retry now?”
Example: API response shape
{
"run_id": "run_01J...",
"state": "retry_scheduled",
"step_id": "news:fetch_feeds",
"attempt": 3,
"updated_at": "2026-02-22T18:55:12Z",
"last_heartbeat_at": "2026-02-22T18:55:11Z",
"next_retry_at": "2026-02-22T18:57:12Z",
"blocking_reason": {
"type": "rate_limited",
"tool": "web",
"message": "Backoff after 429 from publisher feed."
},
"evidence_links": [{"type": "trace", "ref": "trace_..."}]
}
This is also where Claude Skills builders win: your “skill” becomes operational software, not a demo.
Edge cases that break naive workflow run state designs (and how to handle them)
1) Refresh mid-run (the “it reset” problem)
If refresh changes anything besides what you’re viewing, you’ve coupled UI to execution.
Design rule:
- Refresh must only re-fetch run state (
GET /runs/{id}) - The workflow continues independently under leases
If a refresh accidentally re-triggers steps, you have an idempotency failure, not a UI problem.
2) OAuth expires mid-run
Token expiry is not a generic error—it’s a blocking state.
Transition pattern:
- tool call fails with auth error
- transition
running -> waiting_on_auth - store
blocking_reason = oauth_reconnect_required - pause execution until reconnect event arrives
- on reconnect: transition
waiting_on_auth -> queued(orrunningif lease exists)
This is one of the biggest differences between agent workflows and “single request” skills.
3) Human approval with timeouts
Don’t represent this as “running.” It’s waiting.
running -> waiting_on_approval- store who needs to approve and deadline
- if deadline passes: transition to
failedorcompleted_with_warningsdepending on semantics
4) Retries without double side effects (idempotency keys)
If a step has side effects (send email, create CRM lead, write to Drive), retries must be safe.
Practical pattern:
- Every side-effecting step uses an idempotency key derived from
(run_id, step_id, attempt_group) - Tool adapters store
idempotency_key -> external_idmapping - On retry, the adapter checks and returns the existing external result
5) Partial reruns and checkpoints
You will eventually want “rerun from step 5.” That’s where many systems accidentally rewrite history.
Treat reruns as:
- new attempt (or new run) with explicit linkage
- append events that record checkpoint selection
- keep the old run immutable for audit
Observability without status drift: make traces evidence, not truth
A good compromise that avoids “two sources of truth” is:
- DB state drives UI and alerts
- traces provide evidence for that state
Minimal “evidence links” contract
Every time you:
- start a step
- finish a step
- schedule a retry
- enter a waiting state
- hit a terminal state
…attach an evidence_link to:
- a trace id
- a log correlation id
- a support bundle id
That gives support a one-click path from “Waiting on tool” to the exact tool call spans—without asking the trace system to decide state.
Operational playbook: the metrics that actually matter
Once your workflow run state is a contract, you can measure reliability in a way that maps to customer pain.
Track (by workflow, by tenant, by tool):
- Time-in-state percentiles (especially
waiting_on_auth,waiting_on_tool,stalled) - Auth-blocked rate (% runs that enter
waiting_on_auth) - Retry rate and retry success rate
- Stall rate (runs entering
stalledper 1k runs) - Completed-with-warnings rate (your “it worked but…” reality)
And—critically—treat “stalled” as a first-class incident funnel.
“Stalled” triage checklist
When a run is stalled, you should be able to answer these quickly:
- Did the worker lose its lease (crash, deploy, network)?
- Is the workflow actually waiting on an external system (auth/tool/human) but misclassified?
- Did we schedule a retry but fail to persist
next_retry_at? - Are we blocked on a queue backlog / worker capacity issue?
- Is there a poison-pill input causing repeated failure?
A strong contract makes these questions queryable.
Implementation checklist (copy/paste)
Use this as a punch-list when you implement or refactor.
Workflow run state model
- Enumerate non-terminal and terminal states
- Define transition rules (and unit tests)
- Define required fields per state (
next_retry_at,blocking_reason, etc.) - Define liveness rules (
last_heartbeat_atfreshness windows)
Persistence + concurrency
- Store current state in
runs - Store append-only transitions in
run_events - Update both in one transaction
- Implement leases with atomic compare-and-set
APIs
-
GET /runs/{id}(current state) -
GET /runs/{id}/events(history) -
POST /runs/{id}/cancel(setscancel_requested) -
POST /runs/{id}/approve(consumes approval payload) -
POST /runs/{id}/reconnect(auth reconnect callback)
UI semantics
- Render
state + blocking_reason + updated_at - Show clear “next action” buttons for waiting states
- Show “freshness” (last heartbeat) for running/stalled
Observability
- Attach
run_idandstep_idto every trace span - Store
evidence_linkson transitions - Never derive product state by parsing spans
Why this matters for Claude Skills users (and where nNode fits)
Claude Skills make it easy to build a tool-using capability. The hard part is turning that capability into something teams can operate:
- It should survive refresh.
- It should pause for OAuth and resume cleanly.
- It should explain what it’s doing and what it needs.
- It should retry safely without duplicate side effects.
That’s exactly the “workflow-first” mindset we’re building toward at nNode: durable, multi-step agentic automation where run-state semantics are a product feature—not an afterthought.
If you’re building agent workflows and you’re tired of “it says running but nothing is happening,” you’re already feeling the need for a run-state contract and a real single source of truth.
If you want to see how we think about dependable agentic execution (and where we’re heading with the “no-parsing” mission), take a look at nnode.ai.