Workflow Run State as a Contract: Designing a Single Source of Truth for Agent Workflows

export const slug = "agent-workflow-run-state-single-source-of-truth";

Most agent workflows fail in the same boring way: the UI says Running, the trace viewer shows something else, and your team is stuck guessing whether anything is actually happening. That’s not an “observability” problem—it’s a workflow run state problem.

If you’re building with Claude Skills (or any tool-using LLM agent), you’re probably shipping multi-step automations that can pause for OAuth, wait on humans, retry tools, and survive refreshes. The minute your product depends on that kind of long-lived execution, status stops being a label and becomes a contract.

This post lays out a practical run-state contract for agentic workflows—what “running” must mean, which states you need, what belongs in your database vs traces, and how to kill the “two sources of truth” anti-pattern for good.

The real problem: why agent workflows lie about status

Classic workflow engines had it easy: steps are deterministic, workers are predictable, and “running” mostly means “a worker is executing.” Agent workflows aren’t like that.

Agentic workflows commonly include:

Tool calls with variable latency (search, CRM APIs, EHR portals, Google Drive)
OAuth reconnects and expiring tokens
Human-in-the-loop gates (approval, “send this email?”, “confirm this lead list?”)
Retries (rate limits, transient 500s) and idempotency concerns
Refresh/reconnect (the user reloads the page mid-run, or closes their laptop)
Async “wait” states that might last minutes, hours, or days

In that environment, teams often end up with two status systems:

DB state (fast, product-owned, but often too coarse)
Tracing/observability state (rich, but slow, eventual, and not product-authoritative)

The result is status drift:

DB says running because you never transitioned out.
Traces show no spans for 10 minutes because ingestion is delayed.
Queue shows the job was picked up, then the worker crashed.
The UI keeps spinning because it doesn’t know what else to do.

A healthy system makes a stronger promise:

The product has exactly one authoritative answer to: “What is the workflow doing right now, and why?”

That promise is the run-state contract.

Define the workflow run state contract (before you write code)

A run-state contract is not “some enum values.” It’s a set of guarantees that every layer of your system relies on: UI, workers, support tooling, alerting, and (eventually) customers.

Who consumes workflow run state?

Be explicit. The consumers have different needs:

End users / operators: “Can I trust this automation? What do I do next?”
Support / on-call: “Is it stuck? Is it safe to retry? What broke?”
Engineers: “Which step failed? What inputs caused it? Can we reproduce?”
Sales / RevOps owners: “What’s the SLA? How often does auth block runs?”

A single “RUNNING” label cannot satisfy all of them.

What your workflow run state must guarantee

Here’s a contract we’ve found to be both strict and workable for agent workflows:

Authoritative: There is one system-of-record for “current state.”
Monotonic attempts: Attempts can increase; an attempt never rewrites history.
Timestamped transitions: Every transition has an at time (and ideally an actor).
Explainable blocking: If the run is not progressing, it must say why.
Recoverable: States must support resume after refresh/reconnect.
Validatable: Invalid transitions are rejected by code (not “handled later”).
Composable: The contract works for single-step, multi-step, and nested runs.

What your workflow run state must not do

Don’t try to encode everything.

Don’t mirror your traces. Traces are for high-cardinality debugging details.
Don’t embed infrastructure implementation. “Kafka partition 12 lagging” is not a product state.
Don’t pack UI copy into the state enum. Store blocking_reason + evidence; generate copy in the UI.

Treat your run-state model as a product API: stable, documented, testable.

A reference state machine for workflow run state (agentic edition)

You want states that are:

User-meaningful (operators can take action)
Worker-actionable (workers know what to do)
Support-debuggable (on-call can triage)

Here’s a practical baseline.

Core states

Non-terminal:

queued — accepted, not yet executing
running — actively executing a step under a worker lease
waiting_on_tool — paused for an async tool response or external callback
waiting_on_auth — blocked on OAuth/token reconnect
waiting_on_approval — blocked on human input
retry_scheduled — paused until next_retry_at
stalled — expected progress did not occur within a freshness window
cancel_requested — user/system requested cancel, worker must comply

Terminal:

succeeded
failed
canceled
completed_with_warnings

A note on stalled: it’s not “failed.” It’s “we can’t prove liveness.” It’s where you route humans and automation to recover.

Required fields (minimum viable contract)

At minimum, store these on your run record:

run_id (stable)
workflow_id / workflow_version
state (current)
attempt (int)
step_id (where you are)
updated_at (last state change)
last_heartbeat_at (liveness)
lease_owner (which worker owns execution)
blocking_reason (structured)
next_retry_at (nullable)
evidence_links (array of URLs/IDs to traces/logs)

That seems like a lot until you’ve debugged “running” at 2am.

Example: run record schema (JSON)

{
  "run_id": "run_01J...",
  "workflow_id": "wf_healthcare_news_scan",
  "workflow_version": 7,
  "state": "waiting_on_auth",
  "attempt": 2,
  "step_id": "google_drive:list_folder",
  "blocking_reason": {
    "type": "oauth_reconnect_required",
    "tool": "google_drive",
    "message": "Reconnect Google Drive to continue."
  },
  "lease_owner": null,
  "last_heartbeat_at": "2026-02-22T18:41:03Z",
  "next_retry_at": null,
  "updated_at": "2026-02-22T18:41:04Z",
  "evidence_links": [
    {"type": "trace", "ref": "trace_9c2b..."},
    {"type": "log", "ref": "logrun://run_01J..."}
  ]
}

Notice what’s not there: token counts, prompt text, per-span timing. That belongs in traces.

Single source of truth: what belongs in the DB vs what belongs in traces

If you want a single source of truth for workflow run state, you must separate concerns.

The DB is authoritative for lifecycle and blocking

Your database should own:

Current state (and why)
Transition history (append-only)
Attempt number + retry schedule
Heartbeat/lease ownership
Pointers to evidence (trace IDs, log IDs)

This makes the DB fast, queryable, and stable.

Traces are authoritative for high-cardinality debugging

Tracing/observability tooling should own:

Span timing and nested call graphs
Tool call inputs/outputs (with redaction)
Token usage, model calls, prompt templates
Detailed errors and stack traces
“Breadcrumbs” for engineers

Traces are great for explanations, but terrible as product truth because they’re:

Eventual (ingestion delay)
Incomplete (sampling, dropped spans)
Not normalized for product queries
Not designed for “what should the user do now?”

Link them with stable IDs, not duplicated logic

The key rule:

The UI renders the DB state. The DB state links to traces.

Don’t compute state by parsing trace spans. That’s how you rebuild “two sources of truth,” just with extra latency.

Eliminating “two sources of truth” with an event model

A reliable pattern is:

Append-only transition log (source of truth)
Materialized current state (fast reads)
Worker leases + heartbeats (liveness)
Traces as evidence (debugging)

Table design: `run_events` + `runs`

You can implement this with any database. Here’s a concrete Postgres-ish sketch.

-- Current state (fast reads)
create table runs (
  run_id text primary key,
  workflow_id text not null,
  workflow_version int not null,
  state text not null,
  attempt int not null default 1,
  step_id text,
  blocking_reason jsonb,
  lease_owner text,
  lease_expires_at timestamptz,
  last_heartbeat_at timestamptz,
  next_retry_at timestamptz,
  updated_at timestamptz not null default now()
);

-- Append-only state transitions (audit + rebuild)
create table run_events (
  event_id bigserial primary key,
  run_id text not null references runs(run_id),
  at timestamptz not null default now(),
  actor_type text not null,          -- worker|system|user
  actor_id text,
  from_state text,
  to_state text not null,
  step_id text,
  attempt int not null,
  payload jsonb                      -- error codes, tool names, etc.
);

create index on run_events(run_id, event_id);

The contract is enforced by transitions, not by “whatever the UI last saw.”

Why append-only matters for agent workflows

Agent workflows have messy realities:

A step partially executed before a crash.
A tool call succeeded, but the confirmation write failed.
A worker lease expired and another worker took over.

When your only record is “current status,” you lose the story. With an append-only log, you can:

Explain runs to users (“blocked on approval since 14:32”)
Debug retries safely (“attempt 3 started after rate limit backoff”)
Rebuild materialized state if you change your model

Workflow run state and worker liveness: leases, heartbeats, and stalls

A run can be “running” only if you can prove a worker currently owns it.

Use a lease to prove ownership

A lease is a time-bound claim on the right to mutate state.

Worker tries to acquire lease: lease_owner = worker_123, lease_expires_at = now()+30s
Worker periodically heartbeats: extends lease + updates last_heartbeat_at
If worker dies: lease expires; another worker can acquire

Heartbeats prevent “forever running”

Without heartbeats, running is a lie. With them, you can implement a crisp rule:

If state == running and last_heartbeat_at < now() - freshness_window, transition to stalled.

That creates a product action point:

auto-retry if safe
page on-call if needed
prompt user to reconnect/auth/approve

Pseudocode: acquiring a lease safely

// TypeScript-ish pseudocode
async function tryAcquireLease(runId: string, workerId: string) {
  const now = new Date();
  const leaseMs = 30_000;

  // Single atomic update: acquire only if no lease or expired
  const updated = await db.exec(`
    update runs
      set lease_owner = $2,
          lease_expires_at = $3,
          last_heartbeat_at = $1,
          updated_at = $1
    where run_id = $4
      and state in ('queued','running','retry_scheduled')
      and (lease_expires_at is null or lease_expires_at < $1)
    returning run_id;
  `, [now, workerId, new Date(now.getTime() + leaseMs), runId]);

  return updated.rowCount === 1;
}

If you can’t enforce leases atomically, your “single source of truth” collapses under contention.

Transition rules: make invalid states impossible

The fastest path to status chaos is letting any component set any state.

Instead:

Centralize transitions in one module/service
Validate from_state -> to_state rules
Require fields for certain transitions (e.g., next_retry_at)

Example: transition validation

# Python-ish pseudocode
ALLOWED = {
  "queued": {"running", "canceled"},
  "running": {
    "waiting_on_tool",
    "waiting_on_auth",
    "waiting_on_approval",
    "retry_scheduled",
    "succeeded",
    "failed",
    "cancel_requested",
    "completed_with_warnings",
    "stalled",
  },
  "waiting_on_auth": {"queued", "running", "canceled"},
  "waiting_on_approval": {"running", "canceled"},
  "waiting_on_tool": {"running", "retry_scheduled", "failed"},
  "retry_scheduled": {"queued", "running", "canceled"},
  "stalled": {"queued", "running", "failed", "canceled"},
  "cancel_requested": {"canceled", "failed"},
  "succeeded": set(),
  "failed": set(),
  "canceled": set(),
  "completed_with_warnings": set(),
}

def transition(run, to_state, *, payload=None):
  if to_state not in ALLOWED[run.state]:
    raise ValueError(f"Invalid transition {run.state} -> {to_state}")

  if to_state == "retry_scheduled" and not run.next_retry_at:
    raise ValueError("retry_scheduled requires next_retry_at")

  # write event + update materialized state in one transaction

This looks strict, but it’s kinder than letting UI + workers “figure it out.”

UI semantics: show state + reason + freshness (not just state)

Even with perfect states, UX fails if you don’t show why.

A robust UI model is:

State: waiting_on_auth
Reason: oauth_reconnect_required + tool name
Freshness: last_heartbeat_at / updated_at

So instead of a spinner, you can show:

“Waiting on auth: Reconnect Google Drive to continue.”
“Retry scheduled: Next attempt at 14:32.”
“Stalled: No heartbeat for 5 minutes. Retry now?”

Example: API response shape

{
  "run_id": "run_01J...",
  "state": "retry_scheduled",
  "step_id": "news:fetch_feeds",
  "attempt": 3,
  "updated_at": "2026-02-22T18:55:12Z",
  "last_heartbeat_at": "2026-02-22T18:55:11Z",
  "next_retry_at": "2026-02-22T18:57:12Z",
  "blocking_reason": {
    "type": "rate_limited",
    "tool": "web",
    "message": "Backoff after 429 from publisher feed."
  },
  "evidence_links": [{"type": "trace", "ref": "trace_..."}]
}

This is also where Claude Skills builders win: your “skill” becomes operational software, not a demo.

Edge cases that break naive workflow run state designs (and how to handle them)

1) Refresh mid-run (the “it reset” problem)

If refresh changes anything besides what you’re viewing, you’ve coupled UI to execution.

Design rule:

Refresh must only re-fetch run state (GET /runs/{id})
The workflow continues independently under leases

If a refresh accidentally re-triggers steps, you have an idempotency failure, not a UI problem.

2) OAuth expires mid-run

Token expiry is not a generic error—it’s a blocking state.

Transition pattern:

tool call fails with auth error
transition running -> waiting_on_auth
store blocking_reason = oauth_reconnect_required
pause execution until reconnect event arrives
on reconnect: transition waiting_on_auth -> queued (or running if lease exists)

This is one of the biggest differences between agent workflows and “single request” skills.

3) Human approval with timeouts

Don’t represent this as “running.” It’s waiting.

running -> waiting_on_approval
store who needs to approve and deadline
if deadline passes: transition to failed or completed_with_warnings depending on semantics

4) Retries without double side effects (idempotency keys)

If a step has side effects (send email, create CRM lead, write to Drive), retries must be safe.

Practical pattern:

Every side-effecting step uses an idempotency key derived from (run_id, step_id, attempt_group)
Tool adapters store idempotency_key -> external_id mapping
On retry, the adapter checks and returns the existing external result

5) Partial reruns and checkpoints

You will eventually want “rerun from step 5.” That’s where many systems accidentally rewrite history.

Treat reruns as:

new attempt (or new run) with explicit linkage
append events that record checkpoint selection
keep the old run immutable for audit

Observability without status drift: make traces evidence, not truth

A good compromise that avoids “two sources of truth” is:

DB state drives UI and alerts
traces provide evidence for that state

Minimal “evidence links” contract

Every time you:

start a step
finish a step
schedule a retry
enter a waiting state
hit a terminal state

…attach an evidence_link to:

a trace id
a log correlation id
a support bundle id

That gives support a one-click path from “Waiting on tool” to the exact tool call spans—without asking the trace system to decide state.

Operational playbook: the metrics that actually matter

Once your workflow run state is a contract, you can measure reliability in a way that maps to customer pain.

Track (by workflow, by tenant, by tool):

Time-in-state percentiles (especially waiting_on_auth, waiting_on_tool, stalled)
Auth-blocked rate (% runs that enter waiting_on_auth)
Retry rate and retry success rate
Stall rate (runs entering stalled per 1k runs)
Completed-with-warnings rate (your “it worked but…” reality)

And—critically—treat “stalled” as a first-class incident funnel.

“Stalled” triage checklist

When a run is stalled, you should be able to answer these quickly:

Did the worker lose its lease (crash, deploy, network)?
Is the workflow actually waiting on an external system (auth/tool/human) but misclassified?
Did we schedule a retry but fail to persist next_retry_at?
Are we blocked on a queue backlog / worker capacity issue?
Is there a poison-pill input causing repeated failure?

A strong contract makes these questions queryable.

Implementation checklist (copy/paste)

Use this as a punch-list when you implement or refactor.

Workflow run state model

Enumerate non-terminal and terminal states
Define transition rules (and unit tests)
Define required fields per state (next_retry_at, blocking_reason, etc.)
Define liveness rules (last_heartbeat_at freshness windows)

Persistence + concurrency

Store current state in runs
Store append-only transitions in run_events
Update both in one transaction
Implement leases with atomic compare-and-set

APIs

GET /runs/{id} (current state)
GET /runs/{id}/events (history)
POST /runs/{id}/cancel (sets cancel_requested)
POST /runs/{id}/approve (consumes approval payload)
POST /runs/{id}/reconnect (auth reconnect callback)

UI semantics

Render state + blocking_reason + updated_at
Show clear “next action” buttons for waiting states
Show “freshness” (last heartbeat) for running/stalled

Observability

Attach run_id and step_id to every trace span
Store evidence_links on transitions
Never derive product state by parsing spans

Why this matters for Claude Skills users (and where nNode fits)

Claude Skills make it easy to build a tool-using capability. The hard part is turning that capability into something teams can operate:

It should survive refresh.
It should pause for OAuth and resume cleanly.
It should explain what it’s doing and what it needs.
It should retry safely without duplicate side effects.

That’s exactly the “workflow-first” mindset we’re building toward at nNode: durable, multi-step agentic automation where run-state semantics are a product feature—not an afterthought.

If you’re building agent workflows and you’re tired of “it says running but nothing is happening,” you’re already feeling the need for a run-state contract and a real single source of truth.

If you want to see how we think about dependable agentic execution (and where we’re heading with the “no-parsing” mission), take a look at nnode.ai.