Doubleword Agent Swarm | Doubleword Inference API

Models keep getting smarter, with longer context windows — so the reflex is to point one big, long-context agent at everything. For wide, shardable work, that reflex gets expensive: an agent re-sends its whole growing transcript on every turn, so reading the codebase is cheap but re-reading it is the bill. Auditing our open-source control-layer (512 source files, ~2.4M tokens of unique source) with a single long-context agent projects to ~300M tokens — 125× the corpus — or roughly $300 even with prompt caching.

The same audit as a swarm — an LLM orchestrator that designs its own team of bounded-context workers, an adversarial verifier that challenges each finding, and a synthesizer that writes the report — measured 5.6M tokens and ~$6.70. That's ~53× fewer tokens and ~45× cheaper, and it shipped 18 confirmed vulnerabilities.

The Doubleword Agent Swarm is our open-source implementation of Moonshot's Kimi agent swarm, rebuilt from scratch on open weights against the Open Responses API. The model brings the orchestration skill; the harness brings the runtime.

To run this yourself, install the dw CLI and dw login, or sign up at app.doubleword.ai.

Why This Matters

A single smart agent is a fine hammer, but broad, parallel work wants more tools in the belt. The swarm is not a replacement for a long-context agent — it's a different tool for a different problem shape: work that is wide (hundreds of files, sources, or items) and shardable (each slice can be judged mostly on its own).

We ran the same task both ways over the same corpus:

	Solo agent (Claude Opus, projected)	Swarm (Kimi K2.6, measured)
Tokens	~300M — 125× the corpus	5.6M (5.2M in / ~450K out, 348 requests) — 2.3× the corpus
Cost	~$300 with prompt caching, ~$1,800 without	~$6.70

The solo figure is projected from a real metered run: at ~7% coverage the agent had already burned 27.7M tokens, 95% of which were cache reads from re-sending the transcript. The swarm number is measured, at Kimi K2.6's Doubleword rates ($0.95/M input, $4/M output). The swarm's 5.6M total is only 2.3× the read-once floor of the corpus — each file goes to exactly one worker, and workers never echo their research back.

The pattern comes from the Kimi K2.5 report (Moonshot, Feb 2026): a PARL-trained orchestrator spawns specialised sub-agents and runs them in parallel — scale out, not just up — reporting 4.5× lower latency and +17.8 points on BrowseComp versus a single agent. The training helps, but the pattern is portable: the harness runs any tool-calling model.

How It Works

A swarm, reduced to four principles:

Self-designing orchestrator — the model decides the team and the decomposition, not you.
Bounded local context — each worker sees only its slice and returns only results. This is the paper's context sharding, and it's the whole cost story: no context overflow and low per-agent tokens are the same lever, which is why hundreds of agents stay cheap.
Structural anti-groupthink — an independent verifier tries to refute each finding before it counts.
Synthesis — one final tool-free pass reconciles everything into the report.

End to end:

  repo
  │
  ▼
  1 · orchestrator       designs the team from a repo map
  │
  ├─ worker              2 · a parallel wave; each worker
  ├─ worker                  sees only its own slice and
  └─ worker                  returns only results
  │
  ▼
  3 · verifiers          a skeptic refutes each finding
  │
  ▼
  4 · synthesizer        one pass writes the report
  │
  ▼
  report.md + findings.json

The orchestrator clones and maps the repo, then designs its own team and dispatches workers in parallel waves; a worker's status and any unreported files route back so a follow-up wave can fill gaps. Findings are deduped before they reach the verifiers, and the verifier stage is optional per brief.

Workers are ephemeral: a persona is registered once, but every task spawns a fresh agent with no memory between tasks. A worker's scratch context — everything it read and grepped — stays local and is discarded; only structured, schema-valid results return (invalid items are dropped, not trusted). In a real audit run the orchestrator invented personas like injection-filesystem unprompted, wrote their system prompts itself, and fanned tasks out across them.

Failure is loud where it matters: a dead orchestrator call or a run that dispatches zero workers raises an error and exits non-zero — it never ships a vacuous report. Failed workers warn and continue with partial coverage (reported in summary.json).

Briefs: pointing the engine at a task

The engine is generic — the loop is byte-identical for every task. A brief (~50 lines of data) makes it task-specific with three levers: the prompts each role gets, the result schema workers must emit (enforced at submit_results), and the tools each role is granted plus dedupe/verify hooks. Two ship in the box:

Brief	Point it at a repo, get…	Verifier
`audit`	a triaged bug/security report — `findings.json` (severity, `file:line`, fix) + `report.md`	yes (adversarial)
`onboarding`	an architecture/onboarding guide — `sections.json` (purpose, components, deps) + `report.md`	no

The onboarding brief required zero engine changes — proof the loop is task-agnostic. Other briefs you could write: a dependency review (one worker per dependency: drift, advisories, upgrade risk), a refactor plan (workers map usage per module, the synthesizer sequences the steps), or wide research (one worker per source, verifiers refute unsupported claims).

Run modes

Two orchestration interfaces, plus a single-agent baseline:

--interface kimi (default) — the literal tool surface Kimi K2.5/K2.6 were RL-trained on: create_subagent(name, system_prompt) lets the orchestrator author each specialist persona, then assign_task(agent, prompt) dispatches free-text tasks in parallel. Decomposes by task; each sub-agent self-gathers its own context with read_file/grep, as in the paper.
--interface structured — the orchestrator instead calls dispatch_workers([{role, focus, paths}]) and the harness preloads each worker's files. Decomposes by scope; fast and deterministic on large repos.
--solo — one agent, no orchestration, the whole repo in one large context: the paper's single-agent baseline. Same verify/synthesize tail, so the outputs are shape-identical to a swarm run — point both at the same repo and compare findings, latency, and tokens.

Tools (read-only v1)

Every tool is non-mutating — the swarm reads and analyses, never changes the target, so it's safe to point at any repo. Beyond the engine's own tools (dispatch_workers or create_subagent+assign_task, submit_results, submit_verdict), a brief grants its workers capability tools:

Tool	Description
`read_file`	Read a repo file to follow an import/definition
`grep`	Regex-search the repo to trace a value to its sink
`run_sast`	Run static analysers (bandit/semgrep/…) — read-only
`check_advisory`	Look up a dependency's CVEs on OSV (keyless)
`web_search` / `read_page`	Ground a finding against docs/advisories (opt-in)

Running It

Using the Doubleword CLI

Install the dw CLI and log in:

dw login

Clone, set up, and see the full workflow:

dw examples clone swarm
cd swarm
dw project setup
dw project info

Run a brief:

# Audit a GitHub repo (shallow-cloned automatically)
dw project run audit -- --repo psf/requests --max-files 20

# Document a local directory instead — no remote needed
dw project run onboarding -- --path ./my-service

# Plan only: print the repo map and dispatch plan, no API calls
dw project run audit -- --repo psf/requests --dry-run

Print the latest run's report:

dw project run report

The model is a runtime parameter (default moonshotai/Kimi-K2.6, alias k2.6):

# Different model, or cheap workers under a strong orchestrator/synthesizer
dw project run audit -- --repo psf/requests -m k2.5
dw project run audit -- --repo psf/requests --worker-model <cheaper-model>

# The scope-partition interface (the default is the paper's task-based kimi)
dw project run audit -- --repo psf/requests --interface structured

# The single-agent baseline, for comparison
dw project run audit -- --repo psf/requests --solo

Useful knobs while iterating: -v prints a per-call line (role, agent, elapsed, tokens) and the dispatch plan, -vv adds each agent's tool calls; --verify-votes 3 runs a verifier panel with majority vote, --no-verify skips verification; --max-files, --max-agents, and --max-waves cap the run's budget. Setting SERPER_API_KEY (or passing --enable-search) lets workers ground findings against the web.

Inside the cloned project you can also call the CLI directly — uv run swarm run audit --repo …, uv run swarm briefs — and running outside the dw CLI entirely just needs DOUBLEWORD_API_KEY set.

What a run produces

Each run writes results/<brief>-<slug>/:

File	Contents
`report.md`	the synthesized, human-readable report
`findings.json` / `sections.json`	the structured results (machine-readable)
`swarm-tree.json`	the agents the orchestrator spawned — roles, scopes, status
`summary.json`	model, tokens, cost, coverage, and critical/total step counts

The step counts mirror the paper's critical-steps metric: total is every agent turn the run made, critical is the longest dependent path. Their ratio is the swarm's own parallelism score — how much the orchestrator actually spread the work, scored the way the paper scores it.

Cost Comparison

The control-layer audit, side by side:

	Tokens	Cost	Notes
Solo agent (Claude Opus, 1M window)	~300M	~$300	projected from a metered run at ~7% coverage; 95% cache reads
Swarm (Kimi K2.6 on Doubleword)	5.6M	~$6.70	measured · 348 requests · 18 confirmed vulnerabilities

There's one more flag worth knowing. This workload is throughput-bound, not latency-bound — you're running hundreds of agents at once, so what matters is when the whole wave lands, not any single call. Switching to the flex tier makes individual calls slower but holds global throughput, so end-to-end wall-clock stays roughly the same at ~30% lower cost:

dw project run audit -- --repo psf/requests --service-tier flex --background

To measure the trade-off yourself, compare runs the identical workload in both tiers and writes a wall-clock / token / cost table to results/<slug>/analysis.md:

dw project run compare -- --repo psf/requests --max-files 20

In-tool cost is computed from the API's reported token usage; treat dw usage as the source of truth for actual spend:

dw usage --since $(date +%Y-%m-%d)