How We Cut LLM Context Waste by 74% — and Open-Sourced the Tool

We were building internal tools — AI agents that read codebases and answer questions about them. The agents worked, but the bills didn't make sense. We traced it to one problem: context waste. Every time an agent needed to understand a module, it read entire files, directory listings, grep results — 40–70% of the tokens it consumed contributed nothing to the answer. We were paying for the AI to read boilerplate.

So we built a tool to fix it. Then we realized it was useful enough to open-source.

Benchmark 1: Simple query

We started with a straightforward test. Same question, same codebase, two approaches.

The target: a production Python codebase — 844 files, FastAPI backend, React frontend, PostgreSQL, Celery workers, multiple service layers. A real project, not a toy.

The question: "Tell me the purpose of this project and how it works."

Approach 1: Standard AI agent tooling

The agent used its normal workflow — file search to find relevant files, content search to locate key functions, then reading specific files. Six tool calls. Navigated four files manually. Consumed roughly 9,200 tokens of context — but approximately 5,000 of those were file path listings from directory searches that provided zero answer content. Pure noise.

Approach 2: Mnemosyne

One command. mnemosyne query "Tell me the purpose of this project and how it works"

Returned 28 ranked chunks, 2,409 tokens. Every token was content — no file listings, no navigation overhead. The engine found relevant sections across project documentation, technical specifications, architecture notes, and sprint plans — several of which the standard approach never reached because the agent stopped after finding "enough" in the first few files.

The results

The numbers across every metric:

Metric	Standard Tools	Mnemosyne	Change
Tool calls	6	1	−83%
Context tokens	9,200	2,409	−74%
Noise tokens	5,000	0	−100%
Wall clock	~27s	~0.2s	−99%
Answer quality	Complete	Complete	Equivalent

74%

fewer tokens

99%

faster queries

dependencies

The answer quality was equivalent. Both approaches produced complete, accurate descriptions of the project's purpose and architecture. The difference was cost: one approach consumed almost four times the tokens and took over a hundred times longer to deliver the same result.

But this was an easy question. What happens when the query gets hard?

Benchmark 2: Complex domain-specific query

For the second benchmark, we chose a query that requires deep codebase understanding: "How can I improve the accuracy of the detection and classification pipeline?"

This is not a "what does the project do" question. Answering it requires finding a specific detection module buried in a 1,933-line file, understanding a dual regex/ML pipeline architecture, tracing pattern classification logic, and connecting implementation details to architecture documentation. The kind of question a developer asks when they're about to make changes.

Approach 1: Standard AI agent tooling

The agent needed 13 sequential tool calls across 72 seconds. It searched for detection-related symbols across the source tree (32 files matched), narrowed to core files, then read the primary detector module in six separate passes because the file was too large for a single read. It traced pattern definitions, discovered the ML pipeline branch, read negation and gating logic, and finally pulled in the architecture documentation for context.

Total: 13 tool calls, ~18,500 tokens consumed, 1,046 lines of source code read across 6 files.

Approach 2: Mnemosyne

One command. 7 ranked chunks. 4,131 tokens. 1.2 seconds.

The retrieval engine found the detection patterns, the ML pipeline, the classification logic, the architecture overview, the technical spec, and the data models — all in a single query. It surfaced both implementation code and architectural documentation, providing the "what it does" and "how to improve it" context together.

Results comparison

Metric	Standard Tools	Mnemosyne	Change
Tool calls	13	1	−92%
Context tokens	18,500	4,131	−78%
Wall clock	~72s	~1.2s	−98%
Answer quality	Deep	Deep	Equivalent

The pattern: complexity scales cost for standard tools, not for Mnemosyne

Metric	Simple Query	Complex Query	Increase
Baseline tool calls	6	13	+117%
Baseline tokens	9,200	18,500	+101%
Baseline time	~27s	~72s	+167%
Mnemosyne tool calls	1	1	+0%
Mnemosyne tokens	2,409	4,131	+71%
Token savings	74%	78%	Better

Standard tool costs doubled as query complexity increased. Mnemosyne's cost grew only with answer size, not navigation difficulty. The harder the question, the more Mnemosyne saves.

78%

fewer tokens

98%

faster

92%

fewer tool calls

dependencies

Projected over a 10-query complex coding session: standard tools consume ~185,000 context tokens. Mnemosyne consumes ~41,000. That is 144,000 tokens saved per session — real money for teams paying per-token for AI services.

How it works

Mnemosyne indexes a codebase into a searchable knowledge base. When queried, it runs a hybrid retrieval pipeline — combining full-text search (BM25), term-frequency analysis (TF-IDF), symbol name matching, usage history, and four other signals — then fuses them using Reciprocal Rank Fusion and re-ranks by value-per-token. Chunks that carry more information per token are preferred over large, dilute blocks.

A four-stage compression pipeline strips boilerplate while preserving function signatures, control flow, and documentation. The result is a context payload where every token earns its place.

The interface is three commands:

mnemosyne init
mnemosyne ingest
mnemosyne query "Tell me about the authentication flow"

init creates the index structure. ingest processes your codebase into ranked, searchable chunks. query returns the most relevant content, compressed and ready to feed into any LLM context window. No configuration files, no YAML, no setup wizards.

The engine is written in pure Python with zero external dependencies. No vector databases, no embedding services, no API keys. It runs on a laptop, in a CI pipeline, or inside any tool that can call a Python function. The retrieval pipeline is fast enough that it adds negligible latency — the 0.2-second query time in our benchmark includes the full retrieval, ranking, and compression cycle.

Mnemosyne is available now. Zero dependencies, AGPL-3.0 licensed (commercial licenses available), works with any LLM or agent framework.

GitHub: castnettech/mnemosyne-engine →

Get started in four commands:

pip install mnemosyne-engine
mnemosyne init
mnemosyne ingest
mnemosyne query "How does the auth system work?"

Benchmark disclaimer

Benchmark results reflect a specific codebase and query. Performance varies by project size, language distribution, and query type. Mnemosyne is a developer tool provided as-is under the AGPL-3.0 license (commercial licenses available). See the project LICENSE for full terms.

LLM Context Compression Token Reduction Code Retrieval Open Source BM25 TF-IDF Developer Tools Python