AI coding agents spend most of their tokens navigating, not thinking. When you ask Claude, Copilot, or Cursor about your codebase, the agent runs a chain of grep searches, glob matches, and file reads — often 10+ tool calls — before it has enough context to answer. Each call adds tokens. Each turn re-sends the full conversation. The meter runs.
We wanted to know: what happens if you replace that search chain with a single semantic retrieval call? Not in theory — on a real codebase, with a real question, measuring real tokens and real dollars.
The experiment
We ran the same question through two search strategies on the same LLM (Claude Opus 4), against the same production codebase.
The codebase: A production Python application — FastAPI backend, Celery workers, PostgreSQL, multi-layer processing pipeline, ~850 files. A real codebase with real complexity, not a tutorial project.
The question: “How does the [core processing] pipeline work?”
This is a targeted architecture question. Answering it well requires finding the pipeline entry point, understanding a multi-stage processing flow, deduplication logic, feature flag gating, and post-processing conversion. The answer lives across multiple files — source code, tests, configuration, and operational documentation.
We cleared context between runs so neither approach had prior knowledge. Both started from scratch.
Approach 1: Standard search (grep + glob + read)
The agent used its normal toolkit — regex grep to find relevant files, glob patterns to discover pipeline-related paths, then sequential file reads. It ran 12 tool calls across 6 LLM roundtrips, organized into 4 parallel batches:
- Initial scan: 2 grep patterns + 2 glob patterns + 1 timestamp (5 parallel calls)
- Deep reads: 4 file reads in parallel (pipeline source, legacy pipeline, runbook, tests)
- Targeted follow-up: grep for the main function signature
- Chunked reads: 2 more reads to cover the 700+ line pipeline file
Grep matched 40+ files. The agent read 6 deeply. It consumed approximately 30,000 tokens of raw file content, which ballooned to ~140,000 estimated billable tokens across turns as conversation context accumulated.
Result: 37.8 seconds. Accurate, complete answer covering all pipeline stages, the deduplication logic, feature flag system, and operational details from the runbook.
Approach 2: Mnemosyne (semantic retrieval)
One command:
mnemosyne query "how does the abc pipeline work" --budget 3000
Returned 33 ranked chunks from 7 files. 2,991 tokens. The agent read the retrieval output, synthesized the answer, done. Three total tool calls (start timer, query, stop timer). Three LLM roundtrips.
Result: 15.9 seconds. Accurate answer covering all pipeline stages, deduplication, fallback handling, internal metrics tracking, and data provenance. The retrieval engine surfaced both implementation code and test files, giving the LLM a cross-sectional view of the architecture.
The numbers
| Metric | Standard | Mnemosyne | Delta |
|---|---|---|---|
| Wall clock | 37.8s | 15.9s | 2.4x faster |
| Search tool calls | 10 | 1 | 10x fewer |
| Retrieval tokens | ~30,000 | 2,991 | 10x fewer |
| Est. billable tokens | ~140,000 | ~25,000 | 5.6x fewer |
| Est. cost (Opus 4) | $2.21 | $0.53 | 4.2x cheaper |
| Signal:noise ratio | 6/40 = 0.15 | 7/7 = 1.00 | 6.7x better |
| LLM roundtrips | 6 | 3 | 2x fewer |
| Answer quality | Complete | Complete | Equivalent |
Answer quality
Both approaches produced complete, architecturally accurate answers. We verified each answer against the source code and documentation. Both correctly identified:
- The multi-stage pipeline flow and entry point
- Each processing stage’s role and outputs
- Deduplication and fallback logic
- Feature flag gating and configuration
- Post-processing conversion and output contract
The difference was in what supplementary details each approach surfaced. Standard search, having read the operational runbook directly, included more configuration specifics. Mnemosyne, having retrieved test files alongside source code, surfaced internal metrics and data provenance details that the standard approach missed. Different strengths, same core accuracy.
The signal:noise advantage
The most striking difference wasn’t speed or cost — it was precision. Standard search matched 40+ files but only read 6 of them. That means 85% of the file matches were noise: irrelevant results the agent had to evaluate and discard. Every discarded match still consumed tokens and time.
Mnemosyne surfaced 7 files. All 7 were directly relevant. A signal:noise ratio of 1.0 vs 0.15 — 6.7x better targeting. The LLM spent its entire token budget on useful content instead of filtering through directory listings, vendored libraries, and tangentially-related test files.
This is where semantic retrieval fundamentally changes the economics: the agent doesn’t need to search at all. It receives pre-ranked, compressed context and goes straight to synthesis.
The cost argument
At Opus 4 pricing ($15/MTok input, $75/MTok output), a single question costs $2.21 with standard search and $0.53 with Mnemosyne. Over a 20-question coding session, that’s $44.20 vs $10.60 — a $33.60 difference per session.
For a team of 5 engineers running 3 sessions per day, that’s roughly $504 per day, or $10,500 per month in token savings — with equivalent answer quality.
How this scales
The gap widens as codebases grow. Standard search cost scales with navigation difficulty — more files, more tool calls, more tokens spent finding the right content. Mnemosyne’s cost scales only with answer size — the retrieval engine handles navigation locally, outside the LLM’s token budget.
In our previous benchmark, we saw the same pattern: as query complexity increased, standard tool costs doubled while Mnemosyne costs grew only with the amount of relevant content. The harder the question and the larger the codebase, the more Mnemosyne saves.
For teams that want maximum depth on critical queries, the optimal strategy is straightforward: use Mnemosyne for the first pass, then run a targeted file read if the answer needs a specific detail. Two tool calls instead of twelve. $0.80 per query instead of $2.21.
Mnemosyne is open source. Zero dependencies, AGPL-3.0 (commercial licenses available). Works with any LLM or agent framework. The benchmark data and methodology described here are reproducible.
GitHub: castnettech/mnemosyne-engine → PyPI: mnemosyne-engine →
pip install mnemosyne-engine
mnemosyne init && mnemosyne ingest
mnemosyne query "how does the auth system work?"
Benchmark disclaimer
Results reflect a single query against a specific codebase (production Python, ~850 files). Performance varies by project size, language, and query complexity. Token estimates are approximate — Claude Code does not expose per-request token counters. Cost calculated at published Opus 4 API pricing. Accuracy scored by point-by-point fact-checking against source code. Mnemosyne is provided as-is under the AGPL-3.0 license.