Our Work -- What we build, and how we build it

Capabilities

Where the work tends to land.

We take the code seriously regardless of the domain. These are the areas where we currently do most of our work — but the list grows, and we're always happy to discuss problems that don't fit any of the boxes below.

Web and API systems

Backends, admin tools, operator consoles, public APIs, internal dashboards. Python, TypeScript, Go. Clean architecture, thorough tests, clear deployment paths.

Data pipelines and storage

Ingestion, transformation, search, retrieval. Postgres, SQLite, full-text search, vector stores. Batch and streaming. The boring infrastructure that everything else depends on.

AI and retrieval systems

RAG pipelines, context engines, LLM orchestration, MCP servers, local-model integration. We've built enough of these to know where the sharp edges are.

Document intelligence

PDF ingestion, OCR, layout analysis, structured extraction with provenance. Moving unstructured data into systems that can act on it.

Developer tooling

CLIs, libraries, SDKs, build infrastructure, test harnesses. Some of it open-source, most of it private. Tools that other engineers actually want to use.

On-prem and private deployments

Packaging, installation, operations runbooks. When data can't leave the network, we ship software that runs inside it — single container, full appliance, whatever fits.

Animated diagram of the Mnemosyne ecosystem: the core retrieval engine feeds three delivery scenarios - standalone CLI, Claude Code via Model Context Protocol, and Ollama for local LLMs - connected by a shared tool-call loop

"The context window is the most expensive resource in any LLM application. Most tools treat it like it's unlimited. We built one that treats every token like it costs money — then made it work everywhere developers actually write code."

Developer Infrastructure — Open Source

A retrieval engine, an MCP server, and a local LLM bridge. Three packages. One idea.

We were building internal AI agents that read codebases and answered questions about them. The agents worked, but the token bills didn't make sense. We traced it to one problem: context waste. Every time an agent needed to understand a module, it consumed entire files, directory listings, and search results — 40 to 70 percent of the tokens contributed nothing to the answer.

So we built mnemosyne-engine: a retrieval engine that indexes a codebase and returns only the chunks that matter, ranked by value-per-token, within a configurable budget. The engine runs a hybrid pipeline — BM25, TF-IDF, symbol matching, usage frequency, and four additional signals — all fused through Reciprocal Rank Fusion and re-ranked by a cost model that penalizes low-information content. A four-stage compression pipeline strips boilerplate while preserving function signatures, control flow, and documentation. We benchmarked it against a real 844-file production codebase: 74–78% fewer tokens, 92–98% faster queries, equivalent answer quality.

Then we built the hands. mnemosyne-mcp is a Model Context Protocol server that plugs the engine straight into Claude Code, Cursor, Zed, and any other MCP-compatible host. mnemosyne-ollama is a lightweight MCP host that runs the engine fully local against any tool-capable Ollama model — Llama, Qwen, Gemma, Mistral, Phi — with one command and zero configuration. Same engine, three delivery vectors, no cloud required in any of them.

All three are open source under AGPL-3.0 (commercial licenses available), pure Python, zero runtime dependencies beyond their direct protocol needs. 293 tests in the engine alone. Published with OIDC trusted-publisher attestation on PyPI. No telemetry, no analytics, no phone-home. The developers building the next generation of AI tools shouldn't have to solve context waste from scratch — and they shouldn't have to send their code to a third party to get the benefit.

GitHub → Read the integration post → Read the benchmark →

Healthcare Compliance

Giving chart review teams the clarity they actually need.

Medicare Advantage organizations deal with enormous volumes of clinical documentation. Coding teams work through scanned PDFs — hundreds of pages per chart — looking for diagnosis codes that support risk adjustment. It's meticulous, slow, and the stakes are real. A missed code means lost revenue. An incorrect code means audit risk.

The existing workflow was almost entirely manual. Coders scrolled through poor-quality scans, mentally tracking conditions across dozens of encounter notes. There was no structured way to surface evidence for negated or historically mentioned conditions. And sending any of this data to a cloud service was off the table — protected health information stays on-prem, full stop.

We built an on-premise pipeline that ingests chart PDFs, applies text extraction with OCR fallback, assesses page quality, and runs clinical code detection with explicit negation modeling. Every detected code maps to the relevant risk adjustment category using the organization's selected payment year model. Evidence is extracted with page-level and character-offset provenance — so a reviewer can trace any finding back to exactly where it appears in the original document.

The key design choice was conservatism. When a binding is ambiguous, the system flags it rather than asserting it. Reviewers work from a structured report of candidates and flags, not a blank page. Their time shifts from searching to validating — and every finding is independently auditable.

Design choices, in the clear

Conservative binding (ambiguous = flag, not assertion). Explicit negation modeling. Page-level and character-offset provenance. Synthetic evaluation packs for regression testing. Operator-configurable confidence thresholds. On-premise only — zero data egress by default. Read the full healthcare evidence framework →

Serene morning sky over a calm lake — representing clarity and calm in healthcare compliance automation

"The shift isn't from manual to automated — it's from searching to validating. The reviewer's expertise stays central. The system just makes sure nothing gets buried in a 400-page chart."

Deep teal ocean with gentle waves and sunlight — representing the steady flow of regulatory change intelligence

"Regulatory change doesn't pause while you're busy building. The question isn't whether something changed — it's whether you noticed in time."

Regulatory Intelligence

Keeping pace with a landscape that never slows down.

Compliance teams at regulated organizations — healthcare operators, financial services firms, technology companies handling personal data — are responsible for tracking a fragmented landscape of federal guidance, state-level privacy laws, and enforcement actions. It's a lot. Sources are scattered, formats vary, and by the time a material change shows up in informal channels, the window for response may already be closing.

Manual monitoring works until it doesn't. Someone misses a Federal Register update. A state AG issues guidance that nobody catches for three weeks. An enforcement action establishes a new precedent that affects your deployment model. The cost of these misses isn't theoretical — it's operational.

We built a system that monitors over 50 federal and state regulatory sources daily. Changes are classified by urgency, filtered to the subscriber's configured industry and state profile, and delivered as structured weekly briefs. Each item includes a plain-language summary, a color-coded action checklist, and the original source reference. Urgent items surface immediately as priority alerts.

The result is that compliance and legal teams work from a structured action list rather than raw documents. High-urgency changes — new conformity requirements, final rules, enforcement guidance — surface immediately. Teams respond faster because the signal-to-noise ratio is dramatically better. And every item has full source attribution, so nothing is taken on faith.

Source attribution on every item. Urgency classification with explicit criteria. Operator-configurable industry and state filters. No assertion without plain-language explanation. Structured action checklist format that enforces human review before action.

Habits across all our work

Different codebases, one set of habits.

The domains and tech stacks change from project to project, but if you sit in the room while we build any of them, you will hear the same arguments over and over.

Show your work

Every output points back to where it came from. An extracted finding lands on a page and a line. A retrieval result lands on a file and a range. A ranked decision lands on its scoring inputs. If we can't show it, we don't ship it.

Flag the weird stuff

Ambiguous input gets a flag, not a confident answer. This is unfashionable — modern AI tools love to sound sure — but the cost of a wrong answer in the places we work is much higher than the cost of "please check this one".

Same input, same output

Determinism is a feature. You should be able to rerun the same query tomorrow and get the same result. Non-deterministic pieces (model outputs, ranking randomness) are isolated, labeled, and tested against regression packs.

Your hardware, your rules

All three products can run fully on customer infrastructure. Mnemosyne is open source, so it's trivially local. CodaFend runs on the hospital's servers. RuleBrief runs wherever the compliance team wants it. Cloud is a deployment choice we don't make for you.

Humans stay in the loop

Every tool we build is decision support, not decision replacement. The coder still codes, the compliance officer still approves, the developer still reads the code. We try to remove the boring parts while keeping the hard judgment where it belongs.

Boring, auditable code

We'd rather write three clear lines than one clever one. We keep test coverage high, dependencies low, and commit histories readable. Most of what we ship is ordinary software with a bit of ML attached — and we think that's the right ratio.

What we build,
and how we build it.