If you’ve tried to bolt “AI search” onto a real company’s Google Workspace, you’ve probably felt the gap:
- Semantic search can find relevant documents, but it struggles with “what’s the latest?”
- It can summarize a doc, but it struggles with “who owns this?”
- It can surface a paragraph, but it struggles with “what changed since last month, and which email thread approved it?”
That’s not because embeddings are bad. It’s because most operational questions are relationship questions.
This post is a practical playbook for building a business knowledge graph from Google Drive + Gmail—incrementally, permission-aware, and without migrating your whole company into a new “knowledge base.” Internally at nNode we call this a “Karpathy graph”: a lightweight, evolving context layer that helps an AI operator behave less like a chatbot and more like a teammate.
We’ll cover:
- Why vector RAG alone breaks on operator queries
- A minimum viable work-graph: the nodes/edges that matter first
- A build pipeline: Drive/Gmail → extraction → graph → retrieval
- GraphRAG vs RAG: a decision framework
- Permission-aware retrieval and approval-first execution
- Two starter workflows you can ship fast
1) The problem with “AI search” in real companies
In clean demos, RAG looks like magic: chunk docs, embed, retrieve, summarize.
In actual operations, the failure modes are predictable:
- Version chaos:
final.pdf,final_v3.pdf,final_final_revised.pdfin three folders. - Ambiguous names: “Pricing Sheet” means something different per region or customer segment.
- Email-only truth: the real terms are in a Gmail thread, not the doc.
- Ownership drift: the doc exists, but nobody knows who can approve a change.
- Permission realities: “just index everything” becomes “accidentally leak everything.”
The core issue: most operator questions require recency + relationships + permissions—not just similarity.
2) What a “Karpathy graph” means (in plain English)
This is not “let’s do a 6-month ontology project.”
A Karpathy graph (operator definition) is:
A private, evolving map of your business: people, customers, vendors, projects, documents, email threads, decisions, and tasks—plus the links between them.
It’s “lightweight” because:
- You start with metadata + pointers (not perfect document understanding).
- You accept uncertainty (confidence scores, “maybe” edges).
- You build it incrementally, driven by the workflows you actually run.
In nNode terms, this graph is the assistant’s context layer—the thing that makes “blackbox mode” (works on messy Drive) possible without forcing a migration.
3) Minimum viable schema: 12 nodes that unlock ops value
You don’t need hundreds of entity types. You need just enough structure to answer:
- “What’s the latest?”
- “Who owns it?”
- “What’s connected to this customer/vendor/project?”
- “Where did this decision come from?”
Here’s a minimum viable schema that works surprisingly well.
Core entity nodes
| Node | What it represents | Example properties |
|---|---|---|
Person | Employee / collaborator | email, name, role |
Org | Company entity | domain, legal name |
Customer | Customer account | domain, external IDs |
Vendor | Supplier/provider | domain, category |
Project | Ongoing initiative | status, owner |
Product | SKU / service | SKU, category |
Artifact nodes
| Node | What it represents | Example properties |
|---|---|---|
DriveFile | Doc/Sheet/PDF/etc. | fileId, mimeType, modifiedTime, webViewLink |
EmailThread | A Gmail thread | threadId, subject, lastMessageAt, participants |
Meeting | A calendar event + transcript | eventId, startTime, attendees |
Commitment & execution nodes
| Node | What it represents | Example properties |
|---|---|---|
Decision | “We agreed to X” | decidedAt, confidence |
ActionItem | Task someone owes | dueDate, status |
Approval | Explicit approval gate | approver, approvedAt, scope |
This schema is intentionally operator-first: it models what the business does, not just what it knows.
4) The edges that actually matter (and how to infer them safely)
If nodes are nouns, edges are the verbs.
Start with edges you can infer cheaply and safely:
OWNS(Person → Project/DriveFile)MENTIONED_IN(Entity → DriveFile/EmailThread/Meeting)RELATED_TO(Customer ↔ DriveFile/EmailThread)LATEST_VERSION_OF(DriveFile → DriveFile)DECISION_FROM(Decision → Meeting/EmailThread)ASSIGNED_TO(ActionItem → Person)REQUIRES_APPROVAL(ActionItem/Change → Approval)
How to infer edges without overfitting
Use a layered approach:
-
Deterministic signals first
- Drive file owners, sharing ACLs
- Gmail headers (From/To/CC), thread IDs
- Calendar attendees
-
Lightweight extraction second
- Domains → likely
Org/Customer/Vendor - Invoice/PO numbers → linking anchors
- Explicit phrases (“approved”, “LGTM”, “go ahead”) → candidate
Approval
- Domains → likely
-
LLM extraction last (and always scored)
- “What decision was made?”
- “Who owns the next step?”
- “Which customer is this thread about?”
If it can’t be inferred confidently, store it as a low-confidence edge and let workflows upgrade it over time.
5) Build pipeline: Drive/Gmail → extraction → graph (incremental, not brittle)
A good pipeline has two rules:
- Don’t try to parse everything on day 1.
- Keep the graph anchored to sources-of-truth (file IDs, thread IDs, event IDs) so it doesn’t drift.
Here’s a practical architecture:
graph LR
A[Google Drive API] --> M[Metadata Ingest]
B[Gmail API] --> M
C[Calendar/Meet Transcript] --> E[Extraction]
M --> G[(Graph Store)]
E --> G
G --> R[Retrieval Layer]
R --> L[LLM / Agent]
L -->|proposes actions| H[Approval Gates]
H -->|executes| T[Tools: Gmail/Drive/Sheets]
Step 0: choose a graph store (don’t overthink it)
For an MVP:
- Postgres tables (
nodes,edges) are fine. - Neo4j / Memgraph is great if you want Cypher and multi-hop queries.
The critical thing is not the DB—it’s stable IDs and permission-aware retrieval.
Step 1: ingest Drive metadata + permissions (the cheat code)
Drive metadata is high-signal and low-risk because you’re not reading content yet.
# pseudo-code (Google Drive API)
def ingest_drive_files(drive_service, since_rfc3339: str):
fields = (
"files(id, name, mimeType, modifiedTime, owners(emailAddress), "
"permissions(emailAddress, domain, type, role), webViewLink, parents)"
)
page_token = None
while True:
resp = drive_service.files().list(
q=f"modifiedTime > '{since_rfc3339}' and trashed=false",
fields=f"nextPageToken,{fields}",
pageToken=page_token,
pageSize=1000,
supportsAllDrives=True,
includeItemsFromAllDrives=True,
).execute()
for f in resp.get("files", []):
upsert_node("DriveFile", {
"id": f["id"],
"name": f.get("name"),
"mimeType": f.get("mimeType"),
"modifiedTime": f.get("modifiedTime"),
"webViewLink": f.get("webViewLink"),
})
attach_acl("DriveFile", f["id"], f.get("permissions", []))
link_owner_edges(f)
page_token = resp.get("nextPageToken")
if not page_token:
break
Why this matters: “permission-aware retrieval” is much easier if you model permissions from the start, even before you do any semantic indexing.
Step 2: ingest Gmail threads (threads > messages)
Gmail messages are messy: quoted history duplicates token count. Threads are a better unit.
# pseudo-code (Gmail API)
def ingest_gmail_threads(gmail_service, query: str = "newer_than:90d"):
# List threads
threads = gmail_service.users().threads().list(
userId="me",
q=query,
maxResults=200,
).execute().get("threads", [])
for t in threads:
full = gmail_service.users().threads().get(userId="me", id=t["id"]).execute()
subject = guess_subject(full)
participants = extract_participants(full)
last_ts = max(int(m.get("internalDate", 0)) for m in full.get("messages", []))
upsert_node("EmailThread", {
"id": full["id"],
"subject": subject,
"lastMessageAt": last_ts,
"participantEmails": participants,
})
# Link to Customers/Vendors by domain heuristics (later upgraded by extraction)
link_domains(participants, thread_id=full["id"])
Operator tip: store both the raw thread reference and a “deduped, reasoning-ready” thread text view (built by stripping quoted history and signatures).
Step 3: add lightweight extraction (entities + relationships)
Start with cheap extractors:
- email/domain parsing
- regex for PO/invoice numbers
- fuzzy matching on customer names
Then add model extraction only where it pays.
# pseudo-code: extraction with confidence
def extract_decisions_and_actions(meeting_transcript: str) -> dict:
prompt = """
Extract:
1) Decisions (clear agreements)
2) Action items (owner + due date if present)
Return JSON with confidence 0-1.
"""
result = call_llm(prompt, meeting_transcript)
return result
def write_to_graph(meeting_id: str, extraction: dict):
for d in extraction.get("decisions", []):
decision_id = upsert_node("Decision", d)
upsert_edge("DECISION_FROM", decision_id, meeting_id)
for a in extraction.get("action_items", []):
action_id = upsert_node("ActionItem", a)
upsert_edge("ACTION_FROM", action_id, meeting_id)
if a.get("owner_email"):
upsert_edge("ASSIGNED_TO", action_id, f"person:{a['owner_email']}")
Step 4: build retrieval that uses both vectors and graph
The sweet spot for many teams is hybrid:
- Vector index for “find relevant passages.”
- Graph for “what’s connected / latest / owned by / approved by.”
Practical pattern:
- Use the graph to filter and structure the candidate set (e.g., latest versions, same customer, same project).
- Use vectors to rank within that constrained set.
6) GraphRAG vs plain RAG: a decision framework
Here’s the simplest decision rule:
- Use plain RAG when the user question is basically: “Find and summarize a document.”
- Use GraphRAG / a workspace knowledge graph when the question is: “Resolve ambiguity, recency, or responsibility across multiple artifacts.”
RAG is usually enough for:
- “Summarize our Q1 onboarding doc.”
- “What does this contract clause say?”
GraphRAG wins when you need:
- Latest: “What’s the latest price sheet for Customer X?”
- Ownership: “Who can approve changes to the vendor terms?”
- Traceability: “Which email thread approved the discount?”
- Multi-hop: “Which projects depend on Vendor Y and have renewals next quarter?”
Concrete example (operator query)
Question: “Send me the latest vendor terms and the last negotiated exception.”
-
Vector RAG might retrieve the terms PDF.
-
The graph can connect:
Vendor → DriveFile(terms)Vendor → EmailThread(negotiation)EmailThread → Decision(exception)Decision → Person(approver)
Then the assistant can answer with:
- the latest terms link
- the negotiation summary
- who approved what
- and what action is safe to take next
7) Permission-aware by design (no “God mode” search)
A Google Workspace context layer has to assume:
- different users see different files
- ACLs change
- auditability matters
Practical implementation notes:
-
Propagate ACLs into the graph
- For
DriveFile, store the file’s permissions (users/domains/groups) as an ACL blob. - For derived nodes (e.g.,
Decisionextracted from a doc), inherit ACL from the source artifact.
- For
-
Enforce retrieval by the requesting user
- The retriever should accept
(user_id, query)and only return artifacts the user can access.
- The retriever should accept
-
Explain “why you’re seeing this”
- “You can see this because it’s shared with your org” is a trust multiplier.
-
Maintain an audit log
- Which sources were retrieved
- Which actions were proposed
- Which approvals were obtained
This is one reason nNode focuses on working inside the tools teams already use: the assistant should honor your existing permissions model, not bypass it.
8) How the graph powers multi-agent execution (not just answers)
The real payoff isn’t “better Q&A.” It’s reliable execution.
In multi-agent systems, a supervisor agent often needs stable state:
- “What customer is this about?”
- “What’s the current canonical doc?”
- “What’s already been done?” (idempotency)
The graph becomes the shared memory:
- Supervisor agent looks up
Project/Customercontext. - Specialist agents do targeted work (research, costing, drafting).
- Outputs are written back as new artifacts linked to the same entities.
That’s the architecture nNode is building toward: an AI operations assistant (“Sam”) that routes work through specialist agents, but stays grounded in a private business context layer.
9) Approval-first execution: where the graph reduces risk
If you let an assistant take actions, the safest default is:
- draft first
- ask for approval
- execute only after explicit confirmation
The graph helps because approvals should be grounded in context:
- who the email is going to
- which customer/project it belongs to
- what sources were used
- what changed vs the last version
Practical approval gates:
- Sending an external email
- Changing Drive sharing permissions
- Updating a “system-of-record” Sheet
- Generating a quote or invoice
Think “preview diff” for operations.
10) Start small: two starter workflows that justify the graph immediately
You don’t need a platform-wide rollout. Start with two workflows that are painful today.
Workflow A: “Find the latest vendor terms + last email thread”
Trigger: user asks “what are our terms with Vendor Y?”
Graph steps:
- Resolve
Vendorby domain/name. - Traverse to related
DriveFilenodes. - Pick candidate “latest” file using
modifiedTime+LATEST_VERSION_OFedges. - Traverse to related
EmailThreadnodes; pick most recent negotiation thread.
Output: a short brief + links, permission-aware.
Workflow B: “Meeting → decisions → tasks → follow-up email draft (approval required)”
Trigger: meeting ends, transcript arrives.
Graph steps:
- Extract
Decision+ActionItem. - Link to attendees and project/customer.
- Draft follow-up email with bullets: decisions + owners + dates.
- Present a draft; user approves; assistant sends.
This is exactly the kind of “meeting to action” loop where a context layer turns a demo into something durable.
11) Common failure modes (and how to avoid them)
-
Overbuilding schema
- If it doesn’t power a workflow this month, defer it.
-
Ignoring recency
- “Latest” is a first-class concept. Model it explicitly.
-
Entity collisions
- Two “Acme” customers? Use domains, external IDs, and confidence.
-
Permission drift
- Reconcile ACLs regularly; don’t assume static sharing.
-
Stale graph
- Build incremental sync (modifiedTime/historyId), not batch rebuilds.
12) Practical next steps (this week vs this quarter)
This week (MVP)
- Ingest Drive metadata + ACLs
- Create
DriveFile,Person,Orgnodes - Ingest Gmail threads (store thread IDs + participants)
- Add domain-based linking (low confidence is fine)
- Implement permission-aware retrieval filter
This quarter (operator-grade)
- Thread dedupe + “reasoning-ready” email views
- Meeting transcript extraction →
Decision/ActionItem -
LATEST_VERSION_OFmodeling for common doc types - Hybrid retrieval: graph-first constraints + vector ranking
- Audit log + approval gates for actions
FAQ (for operators and IT/admin stakeholders)
Do I need a “real” knowledge graph database? Not at first. What you need is a stable representation of nodes/edges and permission-aware retrieval. Many teams ship an MVP in Postgres and migrate later if needed.
Will this force us to reorganize Drive? No. The point of a Google Workspace context layer is to work with messy reality. You’ll likely improve hygiene over time, but the graph should not depend on a perfect folder structure.
Is GraphRAG always better than RAG? No. Plain RAG is simpler and often faster for “find and summarize.” GraphRAG pays off when questions require recency, ownership, and multi-hop relationships.
Where nNode fits
nNode is building an AI operations assistant (“Sam”) that works where your business already lives—especially Google Drive and Gmail—so you can get value without migrating into a new system.
The core thesis is exactly what this post described:
- a lightweight private context layer (the “Karpathy graph”)
- permission-aware retrieval
- multi-agent orchestration behind a simple chat interface
- approval-first execution
If you’re trying to make an assistant reliably answer operator questions like “what’s the latest, who owns it, what’s connected, what do we do next?”—take a look at what we’re building at https://nnode.ai.