Agent workflow concurrency in production: Run IDs, locks, and idempotency (so two chats don’t collide)

If your agent platform can’t run two chats at once, you don’t have a “UX limitation”—you have an execution model problem. The moment you ship concurrent runs, you’ll meet duplicate webhooks, double-sent Slack messages, two Google Docs with the same name, and the classic: “Why did it run twice?”

This guide is a production-focused playbook for agent workflow concurrency: how to structure workflow runs so they’re safe to retry, safe to replay, and safe to run simultaneously—without turning your system into a pile of global state and best-effort side effects.

nNode’s angle: we build inspectable (“white box”) workflows—not black-box “do the thing” actions. Concurrency gets dramatically easier when every run has a clear contract, a step timeline, and a side-effect ledger.

Quick definitions (so we don’t argue in circles)

Workflow: a reusable multi-step automation (e.g., “Inbound lead → enrich → create CRM record → notify Slack”).
Run: one execution of that workflow (immutable identity, bounded inputs).
Step: one unit of work inside a run (read, transform, write, call tool).
Side effect: an irreversible external write (send message, create doc, charge card).
Retry: re-attempt the same step after a transient failure.
Replay/Resume: continue a run later (after approval, crash recovery, operator action).
Idempotency: “repeating the same request produces the same outcome.”

The 6 concurrency failure modes you’ll actually see

1) Duplicate triggers (webhooks, polling, “at least once” delivery)

Your system receives the same event twice. If your workflow “creates a thing,” you now created two things.

2) Shared mutable context (global memory, shared files, shared workspace state)

Two runs write to the same key/value memory or the same “current draft” document.

3) Non-idempotent side effects (double send, double create)

Retries and replays happen. If you can’t make external writes idempotent, concurrency will hurt you.

4) Retry storms (timeouts → repeated writes)

A tool call times out after the provider processed it. Your worker retries. Now you have duplicates.

5) Approval races (human-in-the-loop resumes twice)

Two tabs, two operators, or an auto-resume + a manual resume collide.

6) Cross-run token/session confusion

The run uses the wrong OAuth context (wrong Google account, wrong CRM portal) because identity wasn’t bound to the run.

The “Run Contract” checklist (copy/paste this into your design docs)

To ship agent workflow concurrency safely, define a contract every run must satisfy:

Run ID: globally unique, immutable (e.g., run_01J...).
Correlation ID: ties the run to the triggering event (e.g., webhook event ID).
Actor: user/workspace/tenant identity bound to the run.
Environment: prod vs staging (never guess).
Step ID: stable per step within the run.
Attempt number: increments on retry (attempt=1,2,3).
Artifact boundaries:
- Per-run artifacts (default): memory, temp files, drafts.
- Shared artifacts (explicit): only via named shared stores + locking/version checks.

In nNode terms: this is what makes a workflow “white box.” When a run is inspectable, you can see collisions instead of debugging vibes.

Pattern 1: Idempotency keys everywhere (the cheapest concurrency win)

Any step that performs a side effect should have an idempotency key. The simplest rule:

Idempotency key = stable Step ID (not a timestamp, not attempt number).

Example: using a Step ID as an idempotency key (TypeScript-ish)

type RunContext = {
  runId: string;
  stepId: string;      // stable across retries
  attempt: number;     // changes on retry
  actorId: string;
};

async function postToSlack(ctx: RunContext, channel: string, text: string) {
  const idempotencyKey = `${ctx.runId}:${ctx.stepId}`;

  // If Slack doesn't support idempotency keys natively, you must.
  // Store a mapping from idempotencyKey -> message_ts.
  const existing = await db.slack_dedupe.findUnique({ where: { idempotencyKey } });
  if (existing) return existing.messageTs;

  const res = await slack.chat.postMessage({ channel, text });

  await db.slack_dedupe.create({
    data: { idempotencyKey, messageTs: res.ts, channel }
  });

  return res.ts;
}

Why this matters: retries become safe, replays become safe, and concurrency stops creating duplicate side effects.

Pattern 2: Deduplicate at ingress (before you even start a run)

Most “it ran twice” incidents start at the top of the funnel.

Do this at your workflow trigger boundary:

Persist trigger_event_id with a uniqueness constraint.
If you see the same event ID again, return the existing run_id.

Example: webhook dedupe table (SQL)

create table workflow_triggers (
  trigger_event_id text primary key,
  workflow_key text not null,
  run_id text not null,
  received_at timestamptz not null default now()
);

This turns “duplicate delivery” into “idempotent start.”

Pattern 3: Distributed locks only where needed (lock the object, not the world)

Locks are useful—until you lock everything and kill throughput.

Use locking only for shared mutable resources:

A specific CRM contact
A specific Google Doc
A specific “client workspace” state bucket

Example: lock per resource key (Redis)

async function withLock(resourceKey: string, ttlMs: number, fn: () => Promise<void>) {
  const lockKey = `lock:${resourceKey}`;
  const token = crypto.randomUUID();

  const acquired = await redis.set(lockKey, token, { NX: true, PX: ttlMs });
  if (!acquired) throw new Error(`LockBusy: ${resourceKey}`);

  try {
    await fn();
  } finally {
    // In production, use a safe unlock (check token before delete).
    const current = await redis.get(lockKey);
    if (current === token) await redis.del(lockKey);
  }
}

Rule of thumb: default to no lock + idempotency + version checks. Add locks only for the small set of resources that truly can’t tolerate concurrent writes.

Pattern 4: Optimistic concurrency with version checks (ETags, updatedAt, revision IDs)

When an API supports “update if version matches,” you get concurrency safety without locks.

Read resource + capture version/revision.
Update only if the version is unchanged.
If it changed, re-read and re-apply a deterministic transform.

This is especially useful for “append a line,” “update status,” “merge fields” operations.

Pattern 5: Side-effect ledger (a tiny write-ahead log for reality)

Agents don’t fail like regular code. They fail mid-flight, after partial progress, and often after they already touched external systems.

A pragmatic approach: record intent before commit.

side_effects table stores: (run_id, step_id, effect_type, idempotency_key, status, external_ref)
Steps check the ledger first.

Minimal schema (conceptual)

create table side_effects (
  run_id text not null,
  step_id text not null,
  effect_type text not null,
  idempotency_key text not null,
  status text not null, -- proposed|committed|failed
  external_ref text,
  primary key (run_id, step_id, effect_type)
);

In nNode, this maps cleanly to “white box” debugging: you can show a run timeline plus a side-effect ledger so operators can answer “did we actually send it?” in seconds.

Pattern 6: Outbox for notifications (so runs don’t double-post)

Many workflows end with “notify Slack / email / webhook.” That’s the easiest place to accidentally duplicate.

Instead:

Write an outbox record in the same transaction as your workflow step completion.
A separate publisher drains the outbox with dedupe + retry.

This decouples workflow execution from delivery and makes concurrency survivable.

Minimal implementation blueprint (tool-agnostic)

If you’re building Claude Skills or agentic automations and you want concurrency without chaos, start with four tables/collections:

runs: run_id, workflow_key, actor_id, correlation_id, status
steps: (run_id, step_id, name, status, attempt, input_hash, output_ref)
side_effects: ledger of external writes
idempotency: (scope, key) -> result/ref with TTL where appropriate

And adopt three invariants:

A step’s inputs are deterministic (freeze them; don’t re-fetch “latest” mid-retry).
Every side effect has an idempotency key.
Shared state is explicit (and guarded by version checks or resource locks).

Integration-specific quick hits (Drive/Slack/HubSpot)

Google Drive / Docs

Dedup on (run_id, step_id) stored alongside file_id.
Avoid naming collisions by including run prefix in drafts, then rename once committed.

Slack

If you post to a thread, store and reuse thread_ts.
Treat “already posted” as success, not error.

HubSpot / CRMs

Prefer upserts using an external ID (e.g., source=workflow + run_id).
If you must create, create once and persist the CRM object ID as the step output.

What to log so debugging is fast (the “white box” advantage)

At minimum, log these fields on every step/tool call:

run_id, step_id, attempt, actor_id, correlation_id
tool name + request fingerprint (not secrets)
idempotency key + dedupe hit/miss
external refs (message timestamp, file ID, CRM ID)

When concurrency incidents happen (and they will), this turns a 2-hour incident into a 2-minute query.

Closing: two chats isn’t a UX feature—it’s an execution model

Before you ship true multi-chat or multi-run execution, ask:

What’s my idempotency strategy for every side effect?
What state is shared across runs—and how is it protected?
Can an operator explain “what happened” from the run timeline alone?

If you’re building reusable, client-facing automations (or Claude Skills that touch real systems), these patterns are the difference between “cool demo” and “production-ready.”

nNode is built around this idea: workflows you can inspect, replay, and run concurrently without guessing—with the integrations that matter (Drive/Slack/CRMs) and the run/step primitives that make safe concurrency achievable.

If you’re working on multi-step agent automations and want a more reliable way to run them (including two chats safely), take a look at nnode.ai.