How We Designed Mnemosyne: Six Retrieval Signals, One Engine

The Problem We Set Out to Solve

LLM coding agents on large repositories burn 40–70% of their tokens on irrelevant context. We measured this across our own production codebases — FastAPI backends, React frontends, Celery workers, multi-layer processing pipelines — and the pattern was consistent. The agent spends most of its budget navigating, not thinking.

The standard approaches each have a fundamental limitation:

Grep / ripgrep returns keyword matches with no understanding of relevance. A search for validate in a large codebase returns hundreds of results — mostly noise. The agent burns tokens evaluating and discarding irrelevant matches.
Vector-only search (dense embeddings) captures semantic similarity but misses exact keyword matches. If you search for a specific function name, embeddings may rank a semantically similar but unrelated function higher than the exact match.
AST-only search finds structural elements — function definitions, class hierarchies — but loses semantic meaning. It cannot answer "how does the authentication flow work?" because that question spans documentation, configuration, and implementation code.

No single signal type is sufficient. We needed a system that combined all of them — keyword relevance, statistical importance, structural awareness, usage patterns, predictive prefetch, and semantic similarity — into a single retrieval call that returns ranked, compressed, token-budget-aware context.

That system is Mnemosyne.

Timeline: What We Published and When

Every claim below is verifiable through immutable timestamps on PyPI, GitHub, and DEV.to. We are documenting the timeline here because prior art matters — especially in a space where multiple teams are building similar tools simultaneously.

Date	Milestone	Verifiable At
March 29, 2026	First PyPI release (mnemosyne-engine v0.3.0). GitHub repository created at `github.com/castnettech/mnemosyne`.	PyPI release history
March 30, 2026	v1.0.0 stable release on PyPI — production-ready, full six-signal pipeline, AST-aware compression, configurable token budgets.	PyPI v1.0.0
March 31, 2026	Public technical writeup on DEV.to documenting 73% token reduction on an 829-file production codebase.	DEV.to article
April 2, 2026	v1.0.4 with verified publishing. Head-to-head benchmark published showing 2.4x speed improvement and 5.6x fewer tokens vs standard retrieval.	PyPI v1.0.4

All release timestamps are immutable on PyPI and verifiable by anyone. The GitHub commit history, PyPI upload timestamps, and DEV.to publication dates form a complete, independently auditable record of when each component of this architecture was published.

The Six-Signal Architecture

This is the core technical innovation in Mnemosyne. Rather than relying on a single retrieval method, the engine runs six independent signal extractors on every query and fuses the results into a single ranked list.

The six signals

BM25 full-text search — Keyword relevance with term frequency normalization and document length adjustment. BM25 is the standard in information retrieval for a reason: it handles exact keyword matches with well-understood statistical properties. When a developer searches for parse_config, BM25 finds it.
TF-IDF scoring — Statistical importance across the corpus. A term that appears in 3 out of 800 files is more informative than one that appears in 400. TF-IDF surfaces the distinctive terms — the ones that actually differentiate one module from another.
Symbol name matching — Direct function, class, and method name search. This is the fastest path to structural elements. When you need AuthenticationMiddleware, symbol search finds the definition immediately, regardless of how it is described in comments or documentation.
Usage frequency — How often a symbol is referenced across the codebase. A function called from 47 locations is more architecturally significant than one called from 2. Usage frequency surfaces the code that actually matters to the system — the load-bearing functions, the shared utilities, the central abstractions.
Predictive prefetch — Anticipating what the agent will need next based on query patterns and co-occurrence statistics. If a developer asks about the validation pipeline, they will likely need the error handling and the data model next. Prefetch reduces round-trips by including probable follow-up context in the first retrieval.
Optional dense embeddings — Vector similarity for semantic queries. When the question is conceptual rather than keyword-specific — "how does error recovery work?" — embeddings capture semantic relationships that keyword-based methods miss. This signal is optional because the other five handle the majority of queries without requiring an embedding model.

Reciprocal Rank Fusion

The six signals are fused through Reciprocal Rank Fusion (RRF) — a well-studied technique from the information retrieval research literature (Cormack, Clarke, & Butt, 2009). RRF combines ranked lists from multiple retrieval systems without requiring score calibration. Each system produces a ranked list; RRF assigns a score based on rank position, then merges the lists.

This is not a simple weighted average of raw scores. Raw scores from different retrieval methods are not comparable — a BM25 score of 4.2 and an embedding cosine similarity of 0.87 exist on entirely different scales. Weighted averaging of incompatible scores produces unreliable results. RRF avoids this problem by operating on ranks, not scores. A document ranked #1 by BM25 and #3 by embeddings receives a predictable, calibration-free combined rank.

The principled fusion of multiple retrieval signals is what separates Mnemosyne from tools that bolt a single embedding index onto a code parser. Each signal compensates for the others' blind spots. BM25 catches exact keywords that embeddings miss. Symbol search finds the function you named. Usage frequency surfaces the code that actually matters. Prefetch reduces round-trips. Together, they cover the full retrieval surface.

AST-aware compression

After ranking, Mnemosyne applies AST-aware compression to the selected chunks. The compression pipeline strips boilerplate — import blocks, blank lines, redundant comments, getter/setter patterns — while preserving the elements that carry information:

Function and method signatures (name, parameters, return types)
Control flow structure (conditionals, loops, exception handling)
Documentation strings and inline comments that explain why, not what
Class hierarchy and interface definitions

The result stays within a configurable token budget. The developer (or the agent framework) specifies how many tokens are available, and Mnemosyne fills that budget with the highest-value content it can find — ranked, compressed, and ready for the LLM context window.

How This Compares

Multiple teams are building LLM code retrieval tools. The following table compares publicly documented features as of April 2026. Where documentation is unclear or unavailable, that is noted.

Feature	Mnemosyne	cocoindex-code	CodeGrok MCP	claude-context (Zilliz)	Augment Context Engine
Retrieval signals	6 (BM25 + TF-IDF + symbol + usage + prefetch + embeddings)	2 (AST + embeddings)	2 (AST + embeddings)	1 (embeddings)	Unknown (proprietary)
Fusion method	Reciprocal Rank Fusion	None documented	None documented	None documented	Proprietary
AST-aware compression	Yes — preserves signatures, control flow	Partial (tree-sitter chunks)	Partial (tree-sitter chunks)	No	Unknown
Token budget control	Configurable per-query	No	No	No	No
Dependencies	Zero (pure Python + SQLite)	Requires sentence-transformers	Requires ChromaDB + nomic-ai model	Requires Milvus/Zilliz	Requires Augment account
Runs offline	Yes	Yes (after model download)	Yes (after model download)	No (cloud service)	No (cloud service)
License	AGPL-3.0 (commercial available)	Apache-2.0	MIT	Apache-2.0	Proprietary
First public release	March 29, 2026	March 12, 2026	January 2026	~2026	February 2026

Feature data sourced from public documentation and repository READMEs as of April 2, 2026. "None documented" means we found no public description of a fusion method — the tool may use one internally. We welcome corrections.

Why Architecture Matters More Than Hype

Single-signal retrieval systems have fundamental blind spots. An embedding-only system will always struggle with exact keyword matches. An AST-only system will always miss semantic context. A grep-based system will always return noise proportional to codebase size. These are not implementation bugs — they are architectural limitations.

Consider a concrete example. A developer asks: "How does the rate limiter work?" In a large codebase:

BM25 finds every file containing "rate limiter" or "rate limit" — including configuration files, test fixtures, changelog entries, and the actual implementation.
TF-IDF identifies which of those files are distinctively about rate limiting versus merely mentioning it. The implementation file scores high; the changelog entry scores low.
Symbol search finds RateLimiter, check_rate_limit(), RateLimitMiddleware — the structural definitions, regardless of how they are described in prose.
Usage frequency reveals that check_rate_limit() is called from 23 endpoints — making it architecturally central — while _reset_bucket() is an internal helper called from one place.
Predictive prefetch includes the Redis configuration and the middleware registration, because rate limiting queries are statistically followed by questions about storage backend and integration points.
Embeddings (if enabled) catch conceptually related code that uses different terminology — "throttle," "backpressure," "circuit breaker."

No single signal produces that complete picture. The combination does.

We didn't build Mnemosyne to win a benchmark. We built it because our own agents were burning tokens. The architecture exists because we needed every signal type to get retrieval right.

The benchmark results — 73% token reduction, 2.4x speed improvement, 5.6x fewer tokens in head-to-head testing — are a consequence of the architecture, not the goal. A system that retrieves the right content on the first call, compressed to fit the token budget, will always outperform a system that navigates iteratively through trial and error.

What's Next

MCP server wrapper LIVE — pip install mnemosyne-mcp then claude mcp add mnemosyne -- mnemosyne-mcp to integrate directly with Claude Code and other MCP-compatible agent frameworks. Zero config, stdio transport, lazy engine initialization. PyPI: mnemosyne-mcp →
Head-to-head benchmarks against competing tools on standardized codebases — reproducible, published methodology, shared test fixtures.
Language coverage expansion — deeper AST support for additional languages beyond the current Python, JavaScript, TypeScript, and Go coverage.

Mnemosyne is open source. Zero dependencies, AGPL-3.0 (commercial licenses available). Works with any LLM or agent framework. The architecture and benchmark data described here are reproducible.

GitHub: castnettech/mnemosyne-engine → PyPI: mnemosyne-engine → PyPI: mnemosyne-mcp →

# Core engine
pip install mnemosyne-engine
mnemosyne init && mnemosyne ingest
mnemosyne query "how does the auth system work?"

# Claude Code integration (MCP)
pip install mnemosyne-mcp
claude mcp add mnemosyne -- mnemosyne-mcp

Disclaimer

Comparison data reflects publicly available documentation as of April 2, 2026. Feature sets evolve; this table is a snapshot, not a permanent verdict. Benchmark results reflect specific codebases and queries — performance varies by project size, language, and query complexity. Mnemosyne is provided as-is under the AGPL-3.0 license. See the project LICENSE for full terms.

LLM Code Retrieval Six-Signal Architecture Reciprocal Rank Fusion BM25 AST Compression Token Reduction Open Source Prior Art

How We Designed Mnemosyne: Six Retrieval Signals, One Engine, and Why Architecture Matters