If you’ve shipped a tool-using agent beyond a toy environment, you’ve seen the pattern:
- It passes the demo.
- It fails on Tuesday.
- Nobody can answer, with confidence:
- What exactly happened?
- Which tool call caused the damage (or the delay)?
- Was it a model mistake, a schema drift, a permissions issue, or an integration outage?
- What did it cost?
- Who approved it (if anyone)?
Classic observability (logs + metrics + traces) already solved this for distributed systems. Agentic systems are distributed systems—with an extra twist:
- The “planner” is probabilistic.
- The agent can take actions.
- “It seemed reasonable” is not an acceptable postmortem.
This guide is a production blueprint for agent observability using OpenTelemetry (OTel) as the spine:
- A tracing model that makes tool-call failures debuggable
- An audit-log model that makes agent actions defensible
- A minimal schema you can actually implement
- Cost attribution and reliability KPIs that matter
- A practical loop to turn traces into repeatable reliability improvements
At nNode, we’re obsessed with the harness/supervisor layer—the part that makes agents trustworthy (and improvable) rather than just impressive. Observability is where that harness either exists… or it doesn’t.
What “agent observability” must answer (in production)
A good observability system for tool-using agents should answer these questions quickly:
- What happened? (timeline of planning → tool calls → approvals → side effects)
- Why did it happen? (inputs, intermediate reasoning signals, tool arguments, tool results)
- What changed? (model version, prompt/policy version, tool schema version)
- What did it cost? (tokens, latency, dollars, downstream vendor costs)
- Is it safe? (PII controls, redaction, approval gates, least privilege)
- Can we reproduce it? (trace → labeled trajectory → eval test → regression prevention)
The mistake teams make: they treat “LLM tracing” as an app-level add-on.
In production, observability is not “LLM-only.” It’s supervisor-first:
- planning
- tool selection
- tool execution
- verification
- retries
- circuit breakers
- approvals
The 3 layers of truth: traces, audit logs, and outcomes
You need three different things, and you should not mash them into one datastore.
| Layer | Purpose | Sampling allowed? | PII policy | Who reads it? |
|---|---|---|---|---|
| Traces | Debugging, latency, dependency mapping | Yes (carefully) | Redact aggressively | Engineers/on-call |
| Audit log / action ledger | Compliance, non-repudiation, “what changed in the world” | No | Strict, immutable, minimal | Security, compliance, leadership |
| Outcomes / business KPIs | ROI, reliability, product health | N/A | Usually aggregated | Product, leadership |
Why separate traces from audit logs?
Traces are optimized for debug truth: timelines, spans, attributes, sampling.
Audit logs are optimized for compliance truth: append-only, immutable, complete coverage.
If you sample the only record of an irreversible action, you will eventually regret it.
Reference architecture: OpenTelemetry-first telemetry pipeline
Here’s a practical architecture that scales from “one workflow” to a fleet.
┌───────────────────────────┐
│ Agent Runtime / Harness │
│ (planner + tools + verify)│
└──────────────┬────────────┘
│
┌─────────────────┴─────────────────┐
│ │
(1) Traces/metrics/logs (2) Audit Ledger
OpenTelemetry SDK Append-only store
OTLP exporter (WORM/immutable)
│ │
▼ ▼
┌───────────────────┐ ┌───────────────────┐
│ OTel Collector │ │ Audit pipeline │
│ - batching │ │ - hash chain │
│ - tail sampling │ │ - redaction policy │
│ - enrichment │ │ - retention │
└─────────┬─────────┘ └─────────┬─────────┘
│ │
▼ ▼
┌───────────────────┐ ┌───────────────────┐
│ APM backend │ │ Audit storage │
│ (Tempo/Jaeger/etc) │ │ (DB + object store) │
└───────────────────┘ └───────────────────┘
│
▼
┌──────────────────────────────┐
│ Dashboards + alerts + IR loop │
└──────────────────────────────┘
Context propagation: your run_id is your primary key
Make every tool call and every downstream HTTP request carry:
traceparent(W3C Trace Context)baggagecontaining stable identifiers (e.g.run_id,tenant_id,workflow_id)
This is how you connect:
- agent span → tool span → vendor API span (Gmail, Drive, Slack, database, etc.)
What to log: a minimal event schema that actually works
You don’t need 200 fields. You need the right fields.
Below is a minimal schema that supports:
- debugging tool-call failures
- cost attribution
- approval gate analysis
- safe replay / eval generation
Required identifiers
run_id: unique per agent runworkflow_id: stable workflow identifier (e.g.invoice_followup_v3)tenant_id: multi-tenant boundaryactor_user_id: who initiatedactor_is_agent: boolenvironment: prod/staging
Tool call fields (the observability heart)
tool_nametool_args_hash(hash of canonicalized args)tool_result_hash(hash of canonicalized result)side_effect_level:none | read | write | irreversibleidempotency_key: stable key to prevent double-writes
Safety + governance fields
approval_state:not_required | requested | approved | rejected | timed_outapprover_idapproval_latency_mspolicy_version
Reliability + cost fields
retry_counterror_class(normalized)circuit_breaker_trippedtokens_in,tokens_outlatency_mscost_usd
Minimal JSON example
{
"timestamp": "2026-05-06T06:00:00.000Z",
"environment": "prod",
"tenant_id": "t_9f3d",
"workflow_id": "customer_support_triage_v7",
"run_id": "run_01J3K6...",
"actor_user_id": "u_123",
"actor_is_agent": true,
"event_type": "tool.call",
"tool_name": "gmail.send",
"side_effect_level": "write",
"idempotency_key": "gmail.send:thread_18b4f5e2:msg_02",
"tool_args_hash": "sha256:3b4d...",
"tool_result_hash": "sha256:91ad...",
"approval_state": "approved",
"approver_id": "u_901",
"approval_latency_ms": 18421,
"retry_count": 1,
"error_class": null,
"circuit_breaker_tripped": false,
"tokens_in": 2231,
"tokens_out": 412,
"latency_ms": 3920,
"cost_usd": 0.0184,
"model": {
"provider": "openai",
"name": "gpt-4.1",
"version": "2026-04-xx"
},
"prompt_version": "support_triage_prompt@2026-05-01",
"policy_version": "support_policy@2026-05-03"
}
Redaction + the “secure payload vault” pattern (do this early)
The best practice for production agent telemetry is:
- Do not dump raw emails, documents, or tool payloads into traces/logs.
- Store hashes + pointers in traces.
- Keep sensitive payloads in a separate, access-controlled vault.
Pattern
- Canonicalize payload (stable JSON, sorted keys)
- Compute
sha256 - Store payload in a vault (encrypted; strict ACL; short retention)
- Emit only:
payload_hashpayload_ref(opaque ID)- size metadata
// Pseudocode
const canonical = canonicalizeJson(toolArgs)
const hash = sha256(canonical)
const ref = await payloadVault.put({ hash, canonical, ttlHours: 24 })
span.setAttribute("tool.args.hash", `sha256:${hash}`)
span.setAttribute("tool.args.ref", ref) // opaque, access-controlled
span.setAttribute("tool.args.bytes", canonical.length)
This one design decision prevents “your observability stack became your biggest data leak.”
Spans that matter: a practical tracing model for tool-using agents
Model your run with a single parent span and a small number of meaningful child spans.
Recommended span tree
agent.run(root)agent.planagent.tool.selectagent.tool.call(repeat)agent.verifyagent.approval.wait(optional)agent.finish
Suggested attributes
Use attributes (tags) to make traces searchable:
agent.run_idagent.workflow_idagent.tenant_idagent.step_indexgenai.model.name/genai.system(if you follow emerging GenAI conventions)tool.nametool.side_effect_leveltool.args.hashtool.result.hashapproval.state
Even if you adopt GenAI semantic conventions, keep your own stable “agent/tool” namespace too—because you’ll want invariants that outlive changing conventions.
Python implementation: instrument an agent run with OpenTelemetry
This example uses the OpenTelemetry Python SDK with an OTLP exporter.
Goal: generate a trace that shows planning + each tool call + verification, with cost and safety metadata.
Install
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
Setup tracing
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
resource = Resource.create({
"service.name": "agent-harness",
"service.version": "1.7.0",
"deployment.environment": "prod",
})
provider = TracerProvider(resource=resource)
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317", insecure=True))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent.harness")
Emit spans for plan → tool → verify
import time
import uuid
def sha256_hex(s: str) -> str:
import hashlib
return hashlib.sha256(s.encode("utf-8")).hexdigest()
class ToolError(Exception):
pass
def run_agent(workflow_id: str, tenant_id: str, actor_user_id: str):
run_id = f"run_{uuid.uuid4().hex}"
with tracer.start_as_current_span("agent.run") as run_span:
run_span.set_attribute("agent.run_id", run_id)
run_span.set_attribute("agent.workflow_id", workflow_id)
run_span.set_attribute("agent.tenant_id", tenant_id)
run_span.set_attribute("actor.user_id", actor_user_id)
run_span.set_attribute("actor.is_agent", True)
# 1) Plan
with tracer.start_as_current_span("agent.plan") as plan_span:
t0 = time.time()
plan = "Search recent emails, draft reply, request approval"
# Don’t log full prompt/plan if sensitive; log hashes + versions
plan_span.set_attribute("agent.plan.hash", f"sha256:{sha256_hex(plan)}")
plan_span.set_attribute("agent.prompt_version", "support_triage_prompt@2026-05-01")
plan_span.set_attribute("latency_ms", int((time.time() - t0) * 1000))
# 2) Tool call
tool_name = "gmail.search"
tool_args = '{"q":"from:vip newer_than:7d","maxResults":10}'
with tracer.start_as_current_span("agent.tool.call") as tool_span:
tool_span.set_attribute("tool.name", tool_name)
tool_span.set_attribute("tool.side_effect_level", "read")
tool_span.set_attribute("tool.args.hash", f"sha256:{sha256_hex(tool_args)}")
tool_span.set_attribute("agent.step_index", 1)
try:
# call_tool(...) is your integration wrapper
# result = call_tool(tool_name, tool_args)
result = '{"threads":[{"id":"18b4f5e2a8c"}]}'
tool_span.set_attribute("tool.result.hash", f"sha256:{sha256_hex(result)}")
except ToolError as e:
tool_span.record_exception(e)
tool_span.set_attribute("error_class", "tool_error")
raise
# 3) Approval wait (if needed)
with tracer.start_as_current_span("agent.approval.wait") as approval_span:
approval_span.set_attribute("approval.state", "approved")
approval_span.set_attribute("approval.approver_id", "u_901")
approval_span.set_attribute("approval.latency_ms", 18421)
# 4) Verify
with tracer.start_as_current_span("agent.verify") as verify_span:
verify_span.set_attribute("verify.strategy", "postcondition_check")
verify_span.set_attribute("verify.passed", True)
run_span.set_attribute("agent.outcome", "success")
return run_id
What this gives you immediately
- A per-run timeline with the exact tool calls involved
- Searchable attributes for
workflow_id,tool.name,approval.state - A place to add cost and token attribution
JavaScript/TypeScript: capture trace context through HTTP tool wrappers
If your tools are ultimately HTTP calls, instrument the wrapper once and every tool gets trace coverage.
import { trace, context } from "@opentelemetry/api";
const tracer = trace.getTracer("agent.harness");
type ToolCall = {
runId: string;
workflowId: string;
toolName: string;
sideEffectLevel: "none" | "read" | "write" | "irreversible";
args: unknown;
};
export async function callTool<T>(call: ToolCall): Promise<T> {
return await tracer.startActiveSpan("agent.tool.call", async (span) => {
span.setAttribute("agent.run_id", call.runId);
span.setAttribute("agent.workflow_id", call.workflowId);
span.setAttribute("tool.name", call.toolName);
span.setAttribute("tool.side_effect_level", call.sideEffectLevel);
// Hash args (store raw elsewhere if needed)
const argsJson = JSON.stringify(call.args);
span.setAttribute("tool.args.bytes", argsJson.length);
try {
// Your tool router might call vendor APIs here.
// Make sure fetch/axios is instrumented so downstream spans join this trace.
const result = await routeToolCall<T>(call.toolName, call.args);
return result;
} catch (err: any) {
span.recordException(err);
span.setStatus({ code: 2, message: err?.message ?? "tool failed" });
span.setAttribute("error_class", "tool_error");
throw err;
} finally {
span.end();
}
});
}
Key point: propagate OTel context so the tool wrapper span is the parent of the actual HTTP span.
Don’t forget metrics: the dashboards operators and execs both use
Traces tell you why something failed. Metrics tell you how often and how bad it is.
Minimum viable metrics for tool-using agents:
Reliability metrics
agent_runs_total{workflow_id, outcome}agent_tool_calls_total{tool_name, outcome}agent_step_retries_total{workflow_id, tool_name}agent_verification_fail_total{workflow_id}approval_requests_total{workflow_id, state}approval_queue_age_seconds(gauge/histogram)
Cost metrics
agent_tokens_total{workflow_id, model}agent_cost_usd_total{workflow_id, model}cost_per_success_usd{workflow_id}(derived)
Latency metrics
agent_run_latency_ms{workflow_id}agent_tool_latency_ms{tool_name}
If you only build one exec-facing chart, make it:
Cost per successful outcome (by workflow), over time.
It keeps everyone honest.
Failure modes → instrumentation you need to detect them
Most “agent failures” are not mystical. They cluster.
1) Wrong tool / wrong scope (permissions)
Signals to capture:
tool.name- permission errors normalized (
error_class=authz_denied) actor_user_idvs service account identitytool.side_effect_levelmismatches (agent attemptedwritewhen onlyreadallowed)
2) Right tool / wrong args (schema drift)
Signals:
tool.args.hashtool.schema_version- structured error code from the tool adapter
3) Partial completion (idempotency gaps)
Signals:
idempotency_keytool.result.hash- “completed steps” list for the run
If a run retries after a timeout and you don’t have idempotency keys, you’ll ship duplicate emails, duplicate invoices, or duplicate tickets.
4) Infinite loops / cascades
Signals:
agent.step_count- budget fields:
max_steps,max_cost_usd,max_latency_ms circuit_breaker_tripped=true
5) Approval fatigue vs risk leaks
Signals:
approval.stateapproval.latency_ms- approval rate by workflow step
You want approvals where the side-effect level is high—not everywhere.
Audit log blueprint: make agent actions defensible
A good audit log entry is:
- minimal
- immutable
- attributable
- tamper-evident
Suggested audit log record
{
"timestamp": "2026-05-06T06:00:08.000Z",
"tenant_id": "t_9f3d",
"workflow_id": "customer_support_triage_v7",
"run_id": "run_01J3K6...",
"action": {
"type": "tool.call",
"tool_name": "gmail.send",
"side_effect_level": "write",
"target": "thread_18b4f5e2a8c"
},
"approval": {
"required": true,
"state": "approved",
"approver_id": "u_901"
},
"payload": {
"args_hash": "sha256:3b4d...",
"result_hash": "sha256:91ad...",
"payload_ref": "vault_7Qw..."
},
"integrity": {
"prev_hash": "sha256:...",
"this_hash": "sha256:..."
}
}
That prev_hash → this_hash chaining gives you a tamper-evident ledger without requiring a blockchain.
Turning traces into an improvement loop (the part most teams skip)
Observability isn’t the goal. Behavior-driven improvement is.
Here’s the loop that works:
- Observe: collect traces + audit entries for every run
- Label: tag failures by class (permissions, bad args, hallucinated entity, timeout, partial completion)
- Extract trajectories: for each incident, store a “trajectory bundle” (plan hash, tool sequence, tool hashes, policy versions)
- Build evals: turn real failed trajectories into a regression suite
- Tighten harness policies: fix with guardrails, verifiers, schema constraints—not just prompt edits
- Redeploy + watch: confirm failure rate drops and cost per success improves
This is exactly where a supervisor/harness layer becomes a moat. Without it, you’re stuck in an endless cycle of:
“try a new prompt” → “hope it works” → “repeat.”
Implementation checklist (copy/paste)
Day 0 (today): get debuggable runs
- Generate a stable
run_idat the start of every agent run - Emit an
agent.runroot span withworkflow_id,tenant_id, versions - Emit an
agent.tool.callspan for every tool call - Capture
tool.name,side_effect_level,tool_args_hash,tool_result_hash - Capture
tokens_in/out,latency_ms, andcost_usdat run + step levels - Add redaction + payload vault (hashes in traces, raw in vault)
Day 7: make autonomy safe
- Add approval telemetry (
approval.state, latency, approver id) - Add idempotency keys for write/irreversible actions
- Add circuit breakers (max steps, max cost, max time)
- Normalize errors into a small
error_classtaxonomy - Add SLOs: success rate and cost-per-success per workflow
Day 30: turn production into training data (without the chaos)
- Auto-extract “failed trajectories” into a weekly regression eval set
- Add verifier spans and track verifier pass/fail rates
- Build dashboards that map tool errors by vendor endpoint
- Add policy diffing: correlate failure spikes with policy/model/schema versions
Where nNode fits in this picture
Most teams instrument the model and call it a day.
But the real production failures happen at the seams:
- tool selection vs tool execution
- retries vs idempotency
- approvals vs velocity
- verification vs silent corruption
nNode is built around the harness/supervisor layer: the part that makes agentic workflows observable, controllable, and steadily improvable. If you’re serious about shipping tool-using agents that operators can trust (and on-call can debug), that layer is where you win.
If you want to see what this looks like in a real workflow engine—where traces, auditability, approvals, and reliability loops are first-class—take a look at nnode.ai.
Soft CTA: If you’re building agents in production and keep asking “what actually happened?”, nNode is worth a conversation.