OAuth scopes long-running workflows break differently than normal apps
If you’re building OAuth scopes long-running workflows (AI agents that run for minutes, retry, and touch multiple SaaS tools), OAuth will fail in ways that feel “random”: it worked in the demo, then a background run hits a 401, or a Notion database suddenly becomes “not found.”
This isn’t because OAuth is “bad.” It’s because agent workflows behave like distributed systems: long-lived state, multiple tool calls, concurrency, and partial failure. If you treat auth like a one-time login step, you’ll ship something that disconnects at the worst possible moment.
This tutorial is a production debugging playbook you can copy into your team wiki. It’s written for teams building hosted, reusable automations (the “white-box workflow” style) where reliability is the product.
Why OAuth fails more in agent workflows than in request/response apps
Traditional web apps:
- The user is present.
- The session is short.
- You can redirect to re-auth quickly.
Agent/workflow apps:
- Runs can be long-running (minutes to hours).
- Steps can run in the background (no user present).
- You have retries (sometimes with exponential backoff).
- You have multi-tool chains (Drive → Docs → Notion → Slack).
- You may have multiple workflows running concurrently for the same user.
That combination changes your auth requirements:
- You need offline access (refresh tokens) for anything that must continue without the user in the browser.
- You need durable token storage (so restarts don’t force re-consent).
- You need run-context binding (so the right token is used for the right user/workspace/workflow).
The 6 OAuth failure modes you’ll actually see (and how they present)
1) Missing or incorrect scopes
Symptoms
- Google: API calls fail with
403/ “insufficient permissions”. - Notion: endpoints fail even though the integration looks “connected.”
Root causes
- You requested a scope set that doesn’t include the operation you’re doing.
- You changed features over time, but never forced a re-consent.
- You’re calling a different API than you think (e.g., Drive vs Docs).
Fix
- Maintain a scope-to-action map (per connector, per capability).
- Add “scope diffing” to your logs (requested vs granted).
2) Refresh token not issued
Symptoms
- Works for ~1 hour, then fails when access token expires.
- You never see a refresh token stored.
Root causes (Google)
- You didn’t request offline access (
access_type=offline). - You did request it, but the refresh token is only returned on the first consent for that client-user combination, so you lost it and can’t “get it back” without re-consent.
Fix
- Ensure you request offline access.
- Store the refresh token in long-lived storage immediately.
- If you lost it, force a re-consent with
prompt=consent.
3) Refresh token invalidation / rotation / “invalid_grant”
Symptoms
- Token refresh fails with
invalid_grant. - A subset of users are affected; repro is inconsistent.
Common root causes
- The user revoked access.
- You exceeded Google’s refresh token issuance limits (older refresh tokens stop working).
- Clock skew on your server breaks token exchange (more common than you’d like).
Fix
- Treat
invalid_grantas a “re-auth required” state, not a transient retry. - Sync server time (NTP).
- Don’t request new refresh tokens repeatedly; reuse the stored one.
4) Session cookie vs OAuth token confusion
Symptoms
- UI says “connected,” but API calls fail.
- Reconnecting in the UI “fixes it” temporarily.
Root causes
- Your UI session is alive, but the underlying OAuth credential is missing/expired/revoked.
Fix
- In the product, distinguish:
- “Signed into the app” (session)
- “Connector authorized” (OAuth)
5) Workspace/admin policy restrictions
Symptoms
- Google Workspace users fail; personal Gmail works (or vice versa).
- Failures happen after IT changes policies.
Fix
- Detect policy errors and surface a clear message (“Your admin must allow this app/scope”).
- Log the user’s account type (consumer vs Workspace) and domain.
6) Wrong principal / wrong tenant (credential mapping bugs)
Symptoms
- Files show up in the wrong Drive.
- Notion writes to the wrong workspace.
- “It works for me” but fails for a teammate.
Root causes
- Tokens aren’t strongly bound to: user_id + workspace/domain + connector + workflow/run.
Fix
- Implement strict credential isolation and run-context binding (see pattern below).
Step-by-step OAuth debugging runbook (copy/paste checklist)
Use this when a workflow breaks in production.
Step 0 — Reproduce with a minimal workflow
Before you touch code, make the failure small.
- One connector
- One API call
- No branching
Example minimal checks:
- Google Drive: list 1 file in the target folder
- Google Docs: create a doc, then read it back
- Notion: list databases or read a known page
This matters because complex workflows hide the first auth failure behind retries and downstream errors.
Step 1 — Confirm identity (account + workspace)
Log and verify:
- Google
sub(OpenID subject) or user email (if you collect it) - Workspace domain (if Google Workspace)
- Notion workspace ID
- The connector account label the user selected (if you allow multiple accounts)
If you can’t answer “which account is this credential for?” you will keep chasing ghosts.
Step 2 — Confirm granted scopes vs required scopes
Maintain a small internal table:
| Action | Required scopes |
|---|---|
| Drive: list files | drive.metadata.readonly (or broader) |
| Drive: create file | drive.file or drive |
| Docs: create/update doc | Docs scope(s) + Drive if you move/share |
Then log:
- scopes you requested
- scopes you believe were granted
- the action you attempted
Step 3 — Inspect token lifecycle events
Track these events with timestamps:
token_issuedrefresh_succeededrefresh_failedrevoked_detectedreauth_required_set
If you don’t track them, you’ll miss patterns like “refresh fails after 55 minutes” or “revoked 3 days after last run.”
Step 4 — Verify storage and retrieval are deterministic
Most “random disconnects” are storage bugs:
- refresh token stored in the wrong row
- credential overwritten by a second consent
- encryption key mismatch across environments
You want to prove that:
- the refresh token is persisted once
- every workflow run loads the same credential record
Step 5 — Add structured logs (with redaction)
Log enough to debug without leaking secrets.
A good rule: log token fingerprints, never tokens.
// TypeScript pseudo-code
function fingerprint(secret: string) {
// return a short, non-reversible identifier
return sha256(secret).slice(0, 10)
}
logger.info("oauth.refresh_failed", {
provider: "google",
user_id,
credential_id,
refresh_token_fp: fingerprint(refresh_token),
error: err.error,
error_description: err.error_description,
workflow_run_id,
})
Also log the HTTP status and Google/Notion request IDs if available.
Implementation pattern: auth as a first-class workflow dependency
The core shift is this:
Don’t treat auth as “the setup screen.” Treat auth as a dependency that each workflow run must validate.
Pattern 1 — Preflight auth checks at workflow start
At the beginning of every run:
- Load credential record (by user + connector + workspace)
- Verify it is not marked
reauth_required - Attempt a cheap API call (or a token refresh if near expiry)
- Only then proceed
# Python-ish pseudo-code
def preflight_auth(run):
cred = vault.load(run.user_id, run.connector, run.workspace_id)
if cred.status == "reauth_required":
raise ReauthRequired("User must reconnect")
# If access token is expired (or close), refresh
if cred.access_token_expires_at < now() + minutes(5):
try:
cred = oauth.refresh(cred)
vault.save(cred)
except OAuthInvalidGrant:
vault.mark_reauth_required(cred.id)
raise ReauthRequired("Token revoked or expired")
return cred
Pattern 2 — Mid-run reauth gates (only when needed)
Long workflows can cross token boundaries. Instead of refreshing “whenever,” refresh:
- when you’re within a safety window (e.g., 5 minutes)
- before expensive steps (bulk writes, file creation)
Pattern 3 — Safe failure: halt + ask, don’t partially complete
If a workflow can’t authenticate midway:
- stop
- mark run as
blocked_on_auth - provide a “Reconnect Google / Notion” action
- resume from the last idempotent checkpoint
This is where “white-box workflows” shine: users can see exactly where the run stopped and why.
Google-specific quick wins (Drive/Docs)
1) Always request offline access (and store refresh tokens immediately)
Google’s web-server flow supports offline access via access_type=offline. In practice:
- you only reliably get a refresh token when you request offline access
- you may need to force a re-consent if the user already authorized without offline access
Node.js example:
// google-auth-library style
const authorizationUrl = oauth2Client.generateAuthUrl({
access_type: "offline", // required for refresh_token
include_granted_scopes: true, // incremental auth
prompt: "consent", // force refresh_token if you lost it
scope: [
"https://www.googleapis.com/auth/drive.file",
"https://www.googleapis.com/auth/documents",
],
});
Important workflow detail: if you run multiple re-auth flows, you can invalidate older refresh tokens when you hit issuance limits. Make re-auth rare and intentional.
2) Handle invalid_grant as a state transition, not a retry
Retrying invalid_grant usually just burns time.
A practical policy:
- Refresh fails once with
invalid_grant→ mark credentialreauth_required - Surface a reconnect prompt
- Don’t keep the workflow running in a degraded state
3) Prevent “created a .txt instead of a Google Doc” class of bugs
This one is less about OAuth and more about production hardening:
- Validate you’re calling the correct endpoint (Drive create vs Docs create)
- Validate MIME types before execution
- Add a “create → read → update” minimal test to your demo script
Notion-specific quick wins (permissions are the real ‘scope’)
Notion errors often look like auth problems but are actually sharing problems.
The #1 Notion gotcha: the integration must be shared with the page/database
Even with a valid token, Notion will return errors if the target page/database isn’t shared with your integration.
Operationally, add a Notion preflight step:
- “Can I read the target database/page?”
- If not, show instructions to share it via Add connections in Notion UI
Minimal read preflight (HTTP):
curl -sS https://api.notion.com/v1/users \
-H "Authorization: Bearer $NOTION_TOKEN" \
-H "Notion-Version: 2022-06-28" \
-H "Content-Type: application/json"
If /v1/users works but a database read fails with “not found,” it’s usually sharing.
Detect permission errors early
In workflows, fail fast:
- verify the page/database is accessible at the start
- don’t wait until step 12 to discover the integration wasn’t added
Token vault + credential isolation (the architecture that stops ‘random’ failures)
If you’re building an agent platform (especially multi-tenant), you want a credential model like:
- Credential record is per user + provider + workspace
- Workflow runs reference a credential by ID (never “grab latest token for user")
- All secrets are encrypted at rest
Example schema:
-- Pseudo-SQL
CREATE TABLE oauth_credentials (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
provider TEXT NOT NULL, -- google, notion
workspace_id TEXT NULL, -- domain/workspace identifier
scopes TEXT NOT NULL, -- stored as space-delimited or JSON array
access_token_ciphertext TEXT NOT NULL,
access_token_expires_at TIMESTAMP NOT NULL,
refresh_token_ciphertext TEXT NULL,
status TEXT NOT NULL DEFAULT 'active', -- active | reauth_required | revoked
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL
);
CREATE UNIQUE INDEX oauth_cred_unique
ON oauth_credentials(user_id, provider, workspace_id);
This is the backbone of “durable workflow authentication.” Without it, concurrency and long runs will eventually mix credentials.
The 10-minute “Fresh Sign-In Demo” checklist (agencies + sales)
If you demo agent workflows, the fastest way to lose trust is: “Hang on, Google disconnected again.”
Run this script before every demo and before onboarding a client.
Google (Drive/Docs)
- Disconnect the connector (so you test the real flow)
- Reconnect and confirm the account email/domain
- Verify required scopes are present
- Run a create → read → update loop:
- create a Google Doc in the target folder
- read it back
- append a line
- Wait 65 minutes or force-refresh to confirm refresh token works
Notion
- Reconnect integration
- Confirm workspace
- Share the target database/page with the integration
- Run read → write:
- list databases or read a page
- create a new row/page
Workflow platform sanity
- Confirm only one run uses the credential at a time (or confirm your refresh logic is concurrency-safe)
- Confirm logs show a clean preflight and no hidden retries
Where nNode fits (without making OAuth your full-time job)
nNode (Endnode) is built for hosted, reusable, white-box workflows—the kind you deliver to clients or run repeatedly inside your business. In that world, OAuth reliability isn’t a checkbox; it’s a moat.
The patterns in this post map to what strong workflow platforms should give you by default:
- Preflight auth steps you can add to any workflow
- Clear “reauth required” checkpoints that pause runs safely
- Run logs that make scope/permission issues obvious
- Connector hygiene (Google Drive/Docs + Notion) that survives long-running execution
If you’re currently stitching together one-off scripts or “black box tasks,” you can adopt the runbook above today. And if you want these reliability primitives baked into a hosted workflow system, nNode is designed for exactly that: Claude Code-style power, but with integrations and durable execution.
Soft CTA
If you’re building agent automations that touch Google Drive/Docs and Notion—and you want them to keep working after the demo—take a look at nnode.ai. You’ll get a workflow-first approach where auth, retries, and long-running runs are treated as first-class concerns, not afterthoughts.