A Visual Guide

Understanding Atomic Agents

AI agents that live as plain markdown files in a folder you own — not in someone else's database. This guide explains what that means, how the framework works, and why the shape matters from one agent on your laptop to a fleet behind a load balancer.

8 / 12

Backend protocols shipped

28 + 3

Spec docs + in-flight (the product)

2,383

Tests gating the contract

Each section opens with the load-bearing idea. Click any Deep dive for the operator-level detail. Skim the bold sentences for a 5-minute read; expand everything for the full ~25-minute picture.

Section 1

The agent IS a folder

Every atomic agent is a directory of plain markdown files. Persona, tools, memory, journal, and audit log all live as text you can read in any editor and version-control with git. There is no hidden database. The folder is the agent.

Inside an agent folder
~/agents/scout/
├── persona/
│   ├── IDENTITY.md      who I am, my mission, my scope
│   ├── SOUL.md          personality, voice, how I evolve
│   └── USER.md          about the operator, what they care about
├── tools.md             what I can read, write, and call
├── model.md             which LLM, token budgets, cost guardrails
├── judges.md            (optional) safety/approval policy
├── mandates.md          (optional) durable cumulative caps
├── memory/              typed atomic notes
│   ├── INDEX.md         always-loaded routing layer
│   └── *.md             one file per note
├── wiki/                distilled corpus (optional)
├── journal/             narrative episodic log
│   └── YYYY-MM/YYYY-MM-DD.md
└── log/                 audit trail (JSONL, one line per run)
    └── YYYY-MM/YYYY-MM-DD.jsonl

An agent's whole shape is on disk. Open persona/IDENTITY.md in any text editor and you're looking at the agent's mission statement. Open today's journal/YYYY-MM/YYYY-MM-DD.md and you're reading what the agent did today.

Most AI agents store their state somewhere you don't own — an app database, a vector store, a hosted trace system. If the company goes away, your agent goes away. If you switch laptops, you migrate through their interface. If you want a feature they don't offer, you wait.

Atomic Agents takes the opposite shape. Persona in markdown files. Memory as notes you can read with a text editor. Tools defined in a markdown table. Run logs as plain text. Move computers? Copy the folder. Want to see what the agent remembered yesterday? Open the file.

The throughline

A home user with one agent and a company with a fleet of fifty experience the same framework — graceful and self-explanatory at every scale. The agent's brain stays portable. The storage underneath can scale (more on that in Section 5).

Deep dive: the persona layer

The persona splits across three files because they answer different questions and evolve at different cadences:

IDENTITY.md
Stable. The mission, the scope, the non-goals. What this agent does and refuses to do. Rarely edited.
SOUL.md
Slow-moving. Personality, voice, ethical posture, how the agent learns from its journal. Edited when the agent's character itself needs to shift.
USER.md
Living. Who the operator is, what they care about right now, what context the agent should always carry. Edited when life changes.

The three together become the system prompt's stable preamble. Every agent call assembles them in canonical order.

Deep dive: the memory layer

Memory is structured into atomic notes (typed, one file each — feedback / decision / project / reference / user) and an INDEX.md that routes the agent to relevant notes without loading everything.

If the agent has 200 notes, loading them all into the prompt would be expensive and noisy. Instead, memory/INDEX.md contains a compressed map: short titles, types, and one-line summaries of every note. The model reads the INDEX (cheap), decides which 1–3 notes it actually needs, then loads those specific files. This is INDEX-driven recall — the pattern that lets memory scale to thousands of entries without scaling prompt cost.

An optional wiki/ directory holds the distilled corpus — articles, references, long-form material — that the agent can search but doesn't auto-load.

Deep dive: the journal and the log

Two parallel records, different audiences:

  • Journal — narrative, written by the agent in its own voice. "Today I helped Dan triage three issues; the second one surfaced a deeper pattern about…" Daily entries in markdown, organized by month. You read these like a diary.
  • Log — structured, machine-readable. One JSONL line per run with run_id, model called, tokens used, cost, tools invoked, captures written. You query these for dashboards, cost analysis, and debugging.

Same events, two formats. Humans get the prose; machines get the structured trail.

Section 2

A day in the life

An agent run is one stateless invocation that reads from the folder, calls a language model, executes any tools, and writes back what it learned. The runtime holds no state between runs. The folder is the agent's only memory.

The shape of agent.call()
flowchart TD
    Trigger["Operator triggers a run
(cron, CLI, embedded Python, Claude Code skill)"] Load["Load folder
persona + tools + model + memory INDEX + recent journal"] Assemble["Assemble system prompt"] Cost1{"Cost guardrail check
(daily/monthly cap)"} LLM["Call the language model"] Tools["Execute any tool calls
(read files, search memory, call MCP)"] Judge{"Judge layer
(if judges.md is present)"} Capture["Extract capture markers
(new memories, journal additions)"] Write["Write to vault
atomic notes + journal entry + JSONL log line"] Done["Return Response"] Trigger --> Load Load --> Assemble Assemble --> Cost1 Cost1 -->|under cap| LLM Cost1 -->|over cap| Done LLM --> Tools Tools --> Judge Judge -->|allow| Capture Judge -->|block| Done Judge -->|escalate| Done Capture --> Write Write --> Done

The runtime is stateless. Every box that touches the agent's persistent state reads from or writes to the vault. The judge layer is opt-in — without a judges.md, the diamond is skipped.

Concretely: the operator triggers a run (could be cron at 7am, could be a uv run atomic-agents run scout from the terminal, could be embedded Python in their own app). The runtime reads the agent folder, assembles the system prompt in a canonical order, checks cost guardrails, calls the model, runs whatever tools the model invoked, lets the judge layer gate any risky actions, extracts capture markers from the model's response, writes new memory entries and a journal addition, appends one JSONL line to the audit log, and returns.

If the runtime crashes mid-run, the next invocation picks up where the folder left off — there's no in-memory state to recover. Every write goes through temp file + fsync + rename + parent-dir fsync, so a power loss never leaves a half-written note. You get either the old version or the new version. Never both, never neither.

Why stateless matters

Stateless means the same agent runs unchanged from cron, launchd, a Claude Code skill, or embedded Python. The runtime just points at the folder. Switch hosts? Same agent. Switch from a CLI invocation to a scheduled job? Same agent. The folder is the agent.

Deep dive: how memory is recalled

If Scout has 200 atomic notes, loading them all into the prompt would be expensive and noisy. Instead, memory/INDEX.md contains a compressed map: short titles, types, and one-line summaries of every note. The language model reads the INDEX (cheap), decides which 1–3 notes it actually needs for the current task, then asks the runtime to load those specific files (also cheap).

This is INDEX-driven recall. The model pays context tokens for capability awareness, not capability content. The pattern scales to thousands of notes without scaling prompt cost.

Deep dive: capture markers

When the model writes its response, it can include structured capture markers — fenced JSON blocks or tool calls — that the runtime extracts and turns into new atomic notes, journal additions, or memory updates. The model never directly writes to the file system; the runtime mediates every write, validates the shape, and applies the project's WritePolicy rules.

Two paths: Path 1 uses native tool calls (the model explicitly invokes a capture tool). Path 2 uses fenced-JSON markers in the response prose. Both produce identical artifacts on disk.

Deep dive: cost guardrails

Every agent.call() path checks _check_cost_guardrails before the first LLM call and re-checks each iteration of the multi-turn loop. Daily and monthly caps are set in model.md. When 50% of daily is reached, a warning lands in the log. At 80%, a stronger warning. At 100%, the next call refuses unless explicitly marked critical=True.

Helper batches (cheap sub-LLM calls used for transformations) reserve their worst-case cost before dispatch. Delegated calls (one agent calling a specialist) clamp the child's budget to the minimum of (child's own cap) and (parent's remaining budget for this run). This tree-cap is what makes "run a fleet of agents" not bankruptcy.

Long-running pipelines outside call() — dream, eval, tuning — have their own cost gates. The discipline: every code path that calls an LLM has a cost gate, even when it's not the same gate.

Section 3

Safety + judgment

Before any side-effectful action executes, a separate judge evaluates it. The judge can approve, block, request a revision, or escalate to a human. Cumulative-cap mandates defend against the slow-burn failure mode no per-call check can catch.

How the judge layer gates actions
flowchart TD
    A["Tool call from actor agent"]
    B{"What class of action?"}
    R["read-only (load a note)"]
    W["reversible write (save a note)"]
    E["external side-effect (send email, post to API)"]
    H["high-risk (delete files, spend money)"]
    OK["Allow — runs immediately"]
    OK2["Allow with audit trail"]
    J{"Judge evaluates"}
    ALLOW["Allow → execute"]
    BLOCK["Block — execution refused"]
    REVISE["Revise — amend the action, re-judge once"]
    ESC["Escalate — write PENDING file"]
    OP["You review in your text editor:
Approve / Deny / Redact / Revise"] EXEC["Execute approved action"] A --> B B --> R B --> W B --> E B --> H R --> OK W --> OK2 E --> J H --> ESC J -->|ALLOW| ALLOW J -->|BLOCK| BLOCK J -->|REVISE| REVISE J -->|ESCALATE| ESC REVISE --> EXEC ESC --> OP OP --> EXEC

Every side-effectful action is classified, then routed through the appropriate gate before execution. The judge layer is opt-in — without a judges.md in the agent folder, none of this fires and the framework behaves as if the judge layer didn't exist.

Most agent frameworks let the model call tools directly. Atomic Agents inserts a checkpoint: before any side-effectful tool runs, a separate JudgeBackend inspects a structured proposal of what's about to happen and returns one of ALLOW, BLOCK, REVISE, or ESCALATE. Every judgment writes a JSONL audit event carrying the proposal's hashes, the outcome, the policy version, and the judge's reason.

The layer is fully opt-in. Existing deployments see no judge invocation until they drop a judges.md file into the agent folder (or set an environment variable for the hardcoded defaults). The minimum viable config is a handful of YAML lines specifying which action classes to gate.

Mandates: durable cumulative authorization

The judge layer handles single-action approval. Mandates handle the longer arc: an operator authoring mandates.md grants the agent durable scoped authority — for instance, "spend up to $6,000 cumulatively on the Q2 ad campaign, only through these tools, only on these targets, only this month." The framework defends that cap against concurrent runs and crash-restart, and surfaces audit signals when an action's executed target diverged from what was authorized at proposal time.

Why this matters for real money

A per-call cost guardrail protects against one expensive turn. A mandate protects against the slow-burn failure mode where 47 individually-cheap turns add up to a procurement disaster. Mandates are how the framework handles actors that touch real money or real external side effects without requiring per-turn re-authorization.

Deep dive: the four class policies

Every tool call is classified into one of four action classes in tools.md. The class_policy block in judges.md says what the framework does with each class:

bypass
Skip the judge entirely. No proposal, no judgment event, no audit line.
allow_with_audit
Run the judge; always allow; write the full judgment event to the audit trail. Use when you want to see the judge's opinion without it gating actions yet.
judge_required
Run the judge; its outcome is enforced. BLOCK refuses; ALLOW executes; REVISE/ESCALATE follow the state-machine semantics.
escalate
Pause for operator approval. Framework writes a PENDING file; agent.call() returns deferred; operator resolves by editing the file in any text editor.

Strictness ordering: bypass < allow_with_audit < judge_required < escalate. Unspecified classes default-fill to safe values.

Deep dive: the escalation queue

When the judge ensemble returns ESCALATE, the action is paused for operator review. The framework writes a PENDING file to <agent_root>/vault/escalations/<action_class>/<proposal_id>.md (atomic, with full proposal serialized as YAML and the judge's reason inline), and agent.call() returns with deferred=True.

The operator resolves a PENDING by editing the file in any text editor (Obsidian, vim, VS Code) and writing one resolution block at the bottom. Header grammar is strict — h3 + exact-case verb + literal word "by" + operator name:

### Approved by alice
resolved_at: 2026-05-13T09:14:22Z
note: Reviewed — sender list is correct, attachment is the public report.

The five resolution verbs are Approved, Denied, Redacted, Revised, and Auto-decided (the framework writes this itself when a timeout elapses). On the next agent.call(), the framework reads the decision, re-verifies the proposal's body hash hasn't been tampered with, and executes (or doesn't) accordingly.

Deep dive: the Revise state machine

REVISE is the most interesting outcome. The judge might say: "Send this email, but strip the attachment" or "Open the PR as draft, not for merge." The framework merges the amendment into the original proposal (recomputing the classification from the possibly-new tool name), re-runs the judge once against the amended proposal, and if the second judgment ALLOWs, executes the amended version.

Bounded at one revise iteration per spec/28 — the second judgment must return ALLOW to execute; REVISE again produces revise_loop_exhausted_blocked. Operators see this path as a sequence of audit events. No PENDING file is written for judge-driven revises.

Operator-driven revises (via the ### Revised by <op> resolution block with an embedded amendment YAML) follow the same primitives but with one extra gate: if the amendment upgrades the action class (e.g., reversible_write → high_risk), the framework re-runs a fresh ensemble judgment on the amended proposal. An operator can't downgrade their way past the judge by relabeling.

Deep dive: failure_policy — fail-closed by default

When a judge errors (timeout, budget exhausted, malformed proposal), the framework consults failure_policy to decide the enforcement outcome. Default is block for every exception type — operators must explicitly opt into looser behavior.

Two shapes are accepted. Flat (one fallback for every class):

failure_policy:
  JudgeUnavailable: block
  JudgeBudgetExhausted: block

Nested per-class (different fallback per (action_class, exception) pair):

failure_policy:
  read_only:
    JudgeUnavailable: allow      # read_only actions tolerate judge outages
  high_risk:
    JudgeUnavailable: escalate   # high_risk actions escalate to operator

The parser auto-detects which shape you used.

Section 4

Growing + collaborating

One agent becomes a team through three primitives: delegation (one agent calls a specialist), dreaming (the agent consolidates its own memory overnight), and cascade (a project layer of shared persona that multiple agents inherit). Same folder shape, same runtime, same audit trail — just composed.

Role × project cascade
flowchart TB
    P["Project folder
(shared persona, tools, judges, mandates)"] R1["Writer agent
(role-specific persona)"] R2["Editor agent
(role-specific persona)"] R3["Researcher agent
(role-specific persona)"] D["Director agent
(orchestrates the team)"] P -.inherited by.-> R1 P -.inherited by.-> R2 P -.inherited by.-> R3 P -.inherited by.-> D D --> R1 D --> R2 D --> R3

A project is a folder that holds shared configuration (persona, tools, judges, mandates) that multiple agent subfolders inherit. The Director coordinates; the Writer / Editor / Researcher each have role-specific persona on top of the shared project layer.

Delegation is the simplest move. A coordinator agent has a specialist it can call: director.delegate("writer", task="Draft section 3"). The runtime constructs the specialist agent fresh, runs it as a sub-agent, and returns the result. The specialist's run writes its own JSONL log line with parent_run_id linking back to the coordinator's run, so the audit trail rolls up naturally.

Delegation is bounded at one level — a coordinator delegates to specialists; specialists don't delegate further. This is a deliberate guardrail. Two-level delegation looks like flexibility; in practice it's how systems become unauditable.

Dreaming is the consolidation pass. The agent runs against its own journal and memory during a quiet window (overnight, between work sessions) and asks: what patterns have I noticed? what notes should be merged? what's now stale? The dream produces structured proposals that go through the same capture-and-write pipeline as any other run — including the judge layer if one is configured. Memory gets sharper over time without the operator needing to curate manually.

Cascade is the multi-agent shape. A project folder holds shared configuration (persona, tools, judges, mandates) that multiple agent subfolders inherit. The Writer, Editor, and Researcher each have role-specific persona on top of the project's shared layer. The judge layer's project floor is non-relaxable — a delegate's own judges.md may strengthen per-class policy but cannot weaken it below the floor.

Tree-cap: the cost discipline that makes fleets safe

When the coordinator delegates to a specialist, the specialist's budget is capped to the minimum of (the specialist's own daily cap) and (the coordinator's remaining budget for this run). The tree-cap is what makes "running an army of agents" not bankruptcy — every delegation tightens, never loosens, the cost ceiling.

Deep dive: dream pipelines

Dreams run as a separate pipeline (atomic_agents.dream) with their own state machine: scan (collect journal entries and memory candidates from the relevant time window), propose (LLM-driven analysis producing structured suggestions), review (optional judge pass), commit (write the approved changes to memory and journal).

Dream proposals are not auto-applied — they flow through the same capture pipeline as any other run. If the agent has a judge configured, the judge sees the dream's proposed changes the same way it sees any other side-effectful action. An operator can review and approve at the same surface they review any other agent decision.

Dreams have their own cost gate (_check_cap) — they're typically the most expensive single pass an agent runs, because consolidation requires the agent to consider its entire recent history.

Deep dive: helper provenance preservation

Helpers are cheap sub-LLM calls used for transformations: "summarize this", "extract entities from that", "classify this into one of three buckets." They're typically Haiku-class (fast, cheap) and run in parallel batches. When a helper produces an output that informs the main agent's reasoning, the runtime preserves the helper provenance — which helper produced which output, with which prompt, at which cost — and rolls it up into the parent run's audit record.

This matters because debugging an agent's reasoning sometimes requires tracing back through what its helpers told it. The provenance trail lets you reconstruct any agent decision down to the helper-prompt level.

Deep dive: eval and tuning

Eval (atomic_agents.eval) runs the agent against curated test cases organized by category: happy path, edge cases, adversarial inputs, decline cases (things the agent should refuse). Each category produces pass/fail metrics and qualitative judgments from an LLM-as-judge that scores along configurable rubrics.

Tuning (atomic_agents.tuning) takes the eval results and proposes persona refinements — "the SOUL.md draft is producing too-formal output on edge cases; consider this adjustment." The proposals are not auto-applied. The operator reviews them as standard markdown diffs.

Both run as long-lived pipelines with their own cost gates, separate from the per-call cost guardrails on agent.call().

Section 5

Scaling: the protocol-pattern story

The same agent definition runs unchanged when the operator swaps the storage underneath. A home user with one agent runs filesystem-everything. An org with a fleet swaps in SQLite for logs, Redis for locks, Postgres for fleet config — same agent, different substrate. The protocol seams are what make this real.

Protocols — the moat
flowchart TB
    Agent["Your agent
(unchanged at every scale)"] P1["MemoryBackend"] P2["LLMBackend"] P3["JudgeBackend"] P4["LockBackend"] P5["LogBackend"] P6["AgentProfileBackend"] P7["ToolRegistryBackend"] P8["MandateBackend"] Disk["Filesystem (default)"] SQLite["SQLite (indexed query)"] Redis["Redis (distributed locks)"] PG["Postgres (planned, tracked at #258)"] Agent --> P1 Agent --> P2 Agent --> P3 Agent --> P4 Agent --> P5 Agent --> P6 Agent --> P7 Agent --> P8 P1 --> Disk P5 --> Disk P5 --> SQLite P4 --> Disk P4 --> Redis P6 --> Disk P6 --> SQLite P7 --> Disk P7 --> SQLite P8 --> Disk P1 -.planned.-> PG P5 -.planned.-> PG P6 -.planned.-> PG P7 -.planned.-> PG P8 -.planned.-> PG

The agent calls protocols. Each protocol can be backed by any storage that implements the contract. Swap underneath, not on top.

Eight shipped protocols, four remaining for v1.0

The framework defines a Protocol (a precisely-specified interface) for each primitive that touches storage. Each protocol has a filesystem reference impl that ships from day one, capability advertisement so the framework knows what each backend can and can't do, and a parametrized conformance test suite that every implementation must pass.

Today's shipped backends and what they unlock:

MemoryBackend — atomic notes, INDEX, wiki, capture
Default: filesystem. Future: Postgres + pgvector for semantic search at scale.
LLMBackend — the language-model call itself
Three reference impls register at framework import: Anthropic, OpenAI (direct), Moonshot. Third-party Gemini / Bedrock / Vertex / vLLM-local backends can register without forking core.
JudgeBackend — the safety layer from Section 3
PolicyJudge (rule engine, microseconds, always-on) + LLMJudgeBackend (slower, smarter, lazy-skipped if no key resolves).
LockBackend — prevents two copies of the same agent from clobbering each other
Default: POSIX flock (single-host). Alternative: Redis advisory locks (multi-host).
LogBackend — where the audit trail goes
Default: JSONL files. Alternative: SQLite with indexed queries (so dashboards answer "show me all high-cost runs this week" in milliseconds instead of scanning thousands of JSONL lines).
AgentProfileBackend — the bootstrap (persona, tools, model, judges, MCP, goal)
Default: walks the folder. Alternative: SQLite-backed, required for SaaS deployments where the persona is editable through a web UI.
ToolRegistryBackend — the catalog of installable tools
Default: walks <agent>/tools/. Alternative: SQLite catalog with cross-tenant isolation, used for plugin-marketplace shapes.
MandateBackend — durable revocable authority for actions that touch real money
Default: filesystem. Operators author mandates.md; the backend enforces cumulative caps + crash recovery + post-action verification.

Four remaining for v1.0: PolicyBackend (org-level policy that supersedes per-agent caps and allowlists — in progress, multi-PR arc), PersonaBackend (UI-editable identity/soul/user), CorpusBackend (GB-scale wiki + semantic search), MCPServerRegistryBackend (catalog for MCP servers). v1.0 closes when those four ship + their conformance suites pin the contract.

Home shape vs company shape

Here's the same agent — same folder, same persona, same tools, same judge config — running in two very different deployments. Notice what changes and what doesn't.

🏠 Home shape

  • Folder on your laptop
  • Filesystem MemoryBackend
  • Filesystem LogBackend (JSONL)
  • POSIX-lock LockBackend
  • Filesystem ProfileBackend
  • Filesystem ToolRegistry
  • Filesystem MandateBackend (off by default — opt-in via mandates.md)
  • Anthropic LLMBackend
  • Rule-engine JudgeBackend
  • One agent, one folder, zero servers

🏢 Company shape

  • Same folder layout — possibly on shared storage
  • Filesystem MemoryBackend (same)
  • SQLite LogBackend (indexed dashboards)
  • Redis LockBackend (multi-host coordination)
  • SQLite ProfileBackend (UI-editable)
  • SQLite ToolRegistry (audit + install approval)
  • Filesystem MandateBackend (same)
  • Anthropic LLMBackend (same)
  • LLM JudgeBackend (smarter review)
  • Same agent — different substrate underneath

The home user opens the folder in Obsidian and reads the journal. The company runs the same agent across multiple Cloud Run instances behind a load balancer, with logs flowing to a SQLite-backed dashboard that answers "show me all escalations from this team this quarter" in milliseconds. The agent's persona doesn't know which world it's in. The agent's tools don't know. The agent's memory doesn't know. The only difference is what's registered as the backend for each protocol.

Two shipped protocols (MemoryBackend and MandateBackend) ship filesystem-only in v1 — both retain the company-shape pane as "(same)" above. The Protocol seams are in place from day one so future adapters (vector-database memory, SaaS / mobile / Slack-bot mandates) register via register_memory_backend() / register_mandate_backend() without forking core. The four remaining v1.0 protocols on the roadmap (Persona, Corpus, Policy, MCPServerRegistry) close the rest of the substrate-swap surface.

Postgres adapters — the next substrate

SQLite covers single-host org-shape deployments. Postgres is the next step up — multi-host shared-substrate org deployments where SQLite's single-writer model isn't enough. The umbrella tracking this work is issue #258 with sub-tasks for LogBackend / ProfileBackend / ToolRegistry / Memory+pgvector / Mandate / Lock Postgres adapters in priority order. The Implementer-contract sections in spec/20, spec/22, spec/24, spec/25, and spec/29 already document the 8-MUST shape every adapter has to satisfy.

Deep dive: how the swap actually happens

Concretely, switching from filesystem-everything to SQLite-for-logs looks like setting one environment variable:

export ATOMIC_AGENTS_LOG_BACKEND=sqlite
export ATOMIC_AGENTS_LOG_BACKEND_URL=sqlite:///shared/logs.db
uv run atomic-agents run scout --work-item "Daily morning brief"

That's it. The agent's folder doesn't change. The agent's persona doesn't change. The runtime sees the env var, dispatches log writes to SQLite, and everything else continues as before. Tomorrow you can flip it back. Next month you can flip Profile to SQLite too. The home user has none of these env vars set; they get the filesystem defaults and never need to think about it.

SQLite-shape backends need just the _BACKEND=sqlite flip — the DB path is derived from a sensible default. Redis-shape backends need both _BACKEND=redis AND _BACKEND_URL=redis://... because there's no derivable default for a Redis instance address.

Deep dive: PolicyBackend (in progress)

The eighth-and-a-half protocol. Where MandateBackend is per-agent durable authorization, PolicyBackend is the fleet-wide policy layer that supersedes per-agent caps and allowlists. A company can author a policy.md at the project root that says "no agent in this tenant may spend more than $50/day, regardless of what their model.md says" — and the framework enforces that cap on every run.

Cost-cap consumption shipped (the daily/monthly caps in policy.md now enforce on real money). The non-cap surfaces (tool allowlists, model allowlists, MCP allowlists) ship next in log-only mode behind a feature flag. The final PR locks the spec.

The semantic shift: per-agent caps in model.md set what the agent can spend; PolicyBackend sets what the operator permits it to spend. Operator-level always wins.

Deep dive: the conformance suite is the actual moat

Anyone — a third party, a company integrating internally, you in a year — can write a new backend. They drop their implementation into the conformance test harness. If the tests pass, their backend is fully interoperable with everything else. If the tests fail, they know exactly which guarantees they're violating.

The framework isn't trying to ship every possible backend itself; it's trying to be a contract that other backends can plug into. The 8-MUST Implementer contract documented in each protocol's spec doc is the rulebook every implementation has to follow.

Section 6

The spec is the product

The framework's central artifact isn't the Python code — it's the spec. 28 locked spec documents (plus 3 in-flight — 1 DRAFT, 2 RFC) describe every load-bearing primitive in precise detail. The Python package is one conforming reference implementation. Anyone can build a different one in any language.

This matters because it means the framework is not vendor lock-in. The spec documents are public. The reference implementations are MIT-licensed. If this Python project disappeared tomorrow, your atomic agent in your folder would still work — because what your folder depends on is the spec, not this particular implementation of it.

The spec docs are organized by primitive:

  • 01-19 — the foundations: anatomy, atomic memory, file formats, runtime assembly, capture rules, multi-agent projects, research foundations, evaluation, cost/observability, helpers, tuning, goals and intent, research integrity, outcomes, delegation, dreams, tools, skills, MCP
  • 20-25 — backend protocols: MemoryBackend, LockBackend, LogBackend, (23 reserved), AgentProfileBackend, ToolRegistryBackend
  • 26-29 — recent additions: cascade bundle (DRAFT), doctor preflight, judge layer, mandates
  • 30-32 — most-recent: responsibility audit (RFC), LLMBackend (locked at 31), policy backend (RFC)

Every spec doc is locked when the implementation matches and tests pass. Spec changes that imply implementation changes get filed as GitHub issues. Spec docs separate shipped behavior from explicit future / deferred boundaries — sections describing behavior not yet implemented are explicitly marked, not silently aspirational.

Why this matters for a non-developer

You probably won't write a new backend. You probably won't deploy across a fleet of Cloud Run instances. So why care about the spec?

Two reasons. Your agent is portable forever. If a year from now you want to move from Claude to a different language model, the agent doesn't change — just the LLMBackend. If you want to move from filesystem logs to a cloud monitoring tool, the agent doesn't change — just the LogBackend. The shape you depend on (the folder, the markdown files, the persona, the memory) is the layer that never moves.

And the framework is not a vendor. The spec is public. The implementations are MIT. Anyone can build a conforming runtime in any language. The atomic agent in your folder doesn't depend on this Python project surviving — it depends on the spec surviving. If the maintainer disappears, your agent still works.