The Problem We Set Out to Solve
LLM coding agents on large repositories burn 40–70% of their tokens on irrelevant context. We measured this across our own production codebases — FastAPI backends, React frontends, Celery workers, multi-layer processing pipelines — and the pattern was consistent. The agent spends most of its budget navigating, not thinking.
The standard approaches each have a fundamental limitation:
- Grep / ripgrep returns keyword matches with no understanding of relevance. A search for
validatein a large codebase returns hundreds of results — mostly noise. The agent burns tokens evaluating and discarding irrelevant matches. - Vector-only search (dense embeddings) captures semantic similarity but misses exact keyword matches. If you search for a specific function name, embeddings may rank a semantically similar but unrelated function higher than the exact match.
- AST-only search finds structural elements — function definitions, class hierarchies — but loses semantic meaning. It cannot answer "how does the authentication flow work?" because that question spans documentation, configuration, and implementation code.
No single signal type is sufficient. We needed a system that combined all of them — keyword relevance, statistical importance, structural awareness, usage patterns, predictive prefetch, and semantic similarity — into a single retrieval call that returns ranked, compressed, token-budget-aware context.
That system is Mnemosyne.
Timeline: What We Published and When
Every claim below is verifiable through immutable timestamps on PyPI, GitHub, and DEV.to. We are documenting the timeline here because prior art matters — especially in a space where multiple teams are building similar tools simultaneously.
| Date | Milestone | Verifiable At |
|---|---|---|
| March 29, 2026 | First PyPI release (mnemosyne-engine v0.3.0). GitHub repository created at github.com/castnettech/mnemosyne. |
PyPI release history |
| March 30, 2026 | v1.0.0 stable release on PyPI — production-ready, full six-signal pipeline, AST-aware compression, configurable token budgets. | PyPI v1.0.0 |
| March 31, 2026 | Public technical writeup on DEV.to documenting 73% token reduction on an 829-file production codebase. | DEV.to article |
| April 2, 2026 | v1.0.4 with verified publishing. Head-to-head benchmark published showing 2.4x speed improvement and 5.6x fewer tokens vs standard retrieval. | PyPI v1.0.4 |
All release timestamps are immutable on PyPI and verifiable by anyone. The GitHub commit history, PyPI upload timestamps, and DEV.to publication dates form a complete, independently auditable record of when each component of this architecture was published.
The Six-Signal Architecture
This is the core technical innovation in Mnemosyne. Rather than relying on a single retrieval method, the engine runs six independent signal extractors on every query and fuses the results into a single ranked list.
The six signals
- BM25 full-text search — Keyword relevance with term frequency normalization and document length adjustment. BM25 is the standard in information retrieval for a reason: it handles exact keyword matches with well-understood statistical properties. When a developer searches for
parse_config, BM25 finds it. - TF-IDF scoring — Statistical importance across the corpus. A term that appears in 3 out of 800 files is more informative than one that appears in 400. TF-IDF surfaces the distinctive terms — the ones that actually differentiate one module from another.
- Symbol name matching — Direct function, class, and method name search. This is the fastest path to structural elements. When you need
AuthenticationMiddleware, symbol search finds the definition immediately, regardless of how it is described in comments or documentation. - Usage frequency — How often a symbol is referenced across the codebase. A function called from 47 locations is more architecturally significant than one called from 2. Usage frequency surfaces the code that actually matters to the system — the load-bearing functions, the shared utilities, the central abstractions.
- Predictive prefetch — Anticipating what the agent will need next based on query patterns and co-occurrence statistics. If a developer asks about the validation pipeline, they will likely need the error handling and the data model next. Prefetch reduces round-trips by including probable follow-up context in the first retrieval.
- Optional dense embeddings — Vector similarity for semantic queries. When the question is conceptual rather than keyword-specific — "how does error recovery work?" — embeddings capture semantic relationships that keyword-based methods miss. This signal is optional because the other five handle the majority of queries without requiring an embedding model.
Reciprocal Rank Fusion
The six signals are fused through Reciprocal Rank Fusion (RRF) — a well-studied technique from the information retrieval research literature (Cormack, Clarke, & Butt, 2009). RRF combines ranked lists from multiple retrieval systems without requiring score calibration. Each system produces a ranked list; RRF assigns a score based on rank position, then merges the lists.
This is not a simple weighted average of raw scores. Raw scores from different retrieval methods are not comparable — a BM25 score of 4.2 and an embedding cosine similarity of 0.87 exist on entirely different scales. Weighted averaging of incompatible scores produces unreliable results. RRF avoids this problem by operating on ranks, not scores. A document ranked #1 by BM25 and #3 by embeddings receives a predictable, calibration-free combined rank.
The principled fusion of multiple retrieval signals is what separates Mnemosyne from tools that bolt a single embedding index onto a code parser. Each signal compensates for the others' blind spots. BM25 catches exact keywords that embeddings miss. Symbol search finds the function you named. Usage frequency surfaces the code that actually matters. Prefetch reduces round-trips. Together, they cover the full retrieval surface.
AST-aware compression
After ranking, Mnemosyne applies AST-aware compression to the selected chunks. The compression pipeline strips boilerplate — import blocks, blank lines, redundant comments, getter/setter patterns — while preserving the elements that carry information:
- Function and method signatures (name, parameters, return types)
- Control flow structure (conditionals, loops, exception handling)
- Documentation strings and inline comments that explain why, not what
- Class hierarchy and interface definitions
The result stays within a configurable token budget. The developer (or the agent framework) specifies how many tokens are available, and Mnemosyne fills that budget with the highest-value content it can find — ranked, compressed, and ready for the LLM context window.
How This Compares
Multiple teams are building LLM code retrieval tools. The following table compares publicly documented features as of April 2026. Where documentation is unclear or unavailable, that is noted.
| Feature | Mnemosyne | cocoindex-code | CodeGrok MCP | claude-context (Zilliz) | Augment Context Engine |
|---|---|---|---|---|---|
| Retrieval signals | 6 (BM25 + TF-IDF + symbol + usage + prefetch + embeddings) | 2 (AST + embeddings) | 2 (AST + embeddings) | 1 (embeddings) | Unknown (proprietary) |
| Fusion method | Reciprocal Rank Fusion | None documented | None documented | None documented | Proprietary |
| AST-aware compression | Yes — preserves signatures, control flow | Partial (tree-sitter chunks) | Partial (tree-sitter chunks) | No | Unknown |
| Token budget control | Configurable per-query | No | No | No | No |
| Dependencies | Zero (pure Python + SQLite) | Requires sentence-transformers | Requires ChromaDB + nomic-ai model | Requires Milvus/Zilliz | Requires Augment account |
| Runs offline | Yes | Yes (after model download) | Yes (after model download) | No (cloud service) | No (cloud service) |
| License | AGPL-3.0 (commercial available) | Apache-2.0 | MIT | Apache-2.0 | Proprietary |
| First public release | March 29, 2026 | March 12, 2026 | January 2026 | ~2026 | February 2026 |
Feature data sourced from public documentation and repository READMEs as of April 2, 2026. "None documented" means we found no public description of a fusion method — the tool may use one internally. We welcome corrections.
Why Architecture Matters More Than Hype
Single-signal retrieval systems have fundamental blind spots. An embedding-only system will always struggle with exact keyword matches. An AST-only system will always miss semantic context. A grep-based system will always return noise proportional to codebase size. These are not implementation bugs — they are architectural limitations.
Consider a concrete example. A developer asks: "How does the rate limiter work?" In a large codebase:
- BM25 finds every file containing "rate limiter" or "rate limit" — including configuration files, test fixtures, changelog entries, and the actual implementation.
- TF-IDF identifies which of those files are distinctively about rate limiting versus merely mentioning it. The implementation file scores high; the changelog entry scores low.
- Symbol search finds
RateLimiter,check_rate_limit(),RateLimitMiddleware— the structural definitions, regardless of how they are described in prose. - Usage frequency reveals that
check_rate_limit()is called from 23 endpoints — making it architecturally central — while_reset_bucket()is an internal helper called from one place. - Predictive prefetch includes the Redis configuration and the middleware registration, because rate limiting queries are statistically followed by questions about storage backend and integration points.
- Embeddings (if enabled) catch conceptually related code that uses different terminology — "throttle," "backpressure," "circuit breaker."
No single signal produces that complete picture. The combination does.
We didn't build Mnemosyne to win a benchmark. We built it because our own agents were burning tokens. The architecture exists because we needed every signal type to get retrieval right.
The benchmark results — 73% token reduction, 2.4x speed improvement, 5.6x fewer tokens in head-to-head testing — are a consequence of the architecture, not the goal. A system that retrieves the right content on the first call, compressed to fit the token budget, will always outperform a system that navigates iteratively through trial and error.
What's Next
- MCP server wrapper LIVE —
pip install mnemosyne-mcpthenclaude mcp add mnemosyne -- mnemosyne-mcpto integrate directly with Claude Code and other MCP-compatible agent frameworks. Zero config, stdio transport, lazy engine initialization. PyPI: mnemosyne-mcp → - Head-to-head benchmarks against competing tools on standardized codebases — reproducible, published methodology, shared test fixtures.
- Language coverage expansion — deeper AST support for additional languages beyond the current Python, JavaScript, TypeScript, and Go coverage.
Mnemosyne is open source. Zero dependencies, AGPL-3.0 (commercial licenses available). Works with any LLM or agent framework. The architecture and benchmark data described here are reproducible.
GitHub: castnettech/mnemosyne-engine → PyPI: mnemosyne-engine → PyPI: mnemosyne-mcp →
# Core engine
pip install mnemosyne-engine
mnemosyne init && mnemosyne ingest
mnemosyne query "how does the auth system work?"
# Claude Code integration (MCP)
pip install mnemosyne-mcp
claude mcp add mnemosyne -- mnemosyne-mcp
Disclaimer
Comparison data reflects publicly available documentation as of April 2, 2026. Feature sets evolve; this table is a snapshot, not a permanent verdict. Benchmark results reflect specific codebases and queries — performance varies by project size, language, and query complexity. Mnemosyne is provided as-is under the AGPL-3.0 license. See the project LICENSE for full terms.