[WIP] Zero-Leak AI: An Email Advisor That Never Phones Home
Documenting the work on an on-premise RAG system using LEANN, Ollama, and hybrid search — where no email ever leaves the building.
Work in progress — documenting the build.
Choosing a vector database#
The two hard constraints were: everything runs on-premise, and the hardware is modest — a Proxmox VM with no discrete GPU, just integrated graphics. Cloud-hosted vector databases were out by definition. That left self-hosted options.
I looked at a few, ChromaDB being the most serious contender. It’s well-documented, easy to set up, and the go-to recommendation for self-hosted RAG. But the more I dug into resource requirements, the more I worried about disk usage. a friend had roughly 38,000 emails spanning years of operations. Traditional vector databases pre-compute and store every embedding — and those embeddings add up. On a VM where disk, RAM, and CPU are all shared with other workloads, I wanted something leaner.
Then I stumbled onto LEANN.
LEANN#
LEANN takes a fundamentally different approach to vector storage. Instead of pre-computing and storing every embedding, it builds a graph structure and computes embeddings on-demand during search. It calls this “graph-based selective recomputation.” The trade-off is latency — searches take ~500ms instead of ~10ms — but the storage savings are dramatic. LEANN’s own benchmarks claim 60 million text chunks fit in 6GB instead of 201GB with a traditional approach.
What sold me wasn’t just the disk savings — it was that the whole design seemed built for exactly the kind of low-grade hardware I was working with. No GPU required. CPU-only by design. Modest memory footprint. my friend’s Proxmox setup had no GPU passthrough, and I wasn’t about to propose a hardware upgrade for what was essentially a knowledge retrieval tool.
I won’t pretend I ran a rigorous benchmark between ChromaDB and LEANN on this specific dataset. The decision was more pragmatic than that: LEANN had a novel approach to the exact constraints I was dealing with, and it was worth a shot. If it didn’t work out, ChromaDB was the fallback.
The architecture#
Here’s what the production system looks like:
┌─────────────────────────────────────┐
│ Email Sources │
│ • Office 365 via IMAP │
│ • Outlook PST archives │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Ubuntu 24.04 VM (Proxmox) │
│ │
│ Host Tools (Python) │
│ • IMAP fetcher (live emails) │
│ • PST loader (historical archive) │
│ • FTS5 index builder (keywords) │
│ │
│ LEANN (installed directly) │
│ • intfloat/multilingual-e5-base │
│ • Graph-based vector index │
│ • On-demand embedding computation │
│ │
│ Ollama (installed directly) │
│ • Mistral 7B │
│ • Swedish-language response gen │
│ │
│ Systemd Services │
│ • email-watcher (polls every 2min) │
│ • telegram-bot (interactive Q&A) │
│ │
│ Deployed & managed via Ansible │
└─────────────────────────────────────────┘
Everything runs directly on the VM — no Docker in production. (Docker Compose exists in the repo for local development, but the production deployment installs LEANN and Ollama natively via Ansible playbooks and runs them as systemd services.) The advisor calls LEANN’s CLI via subprocess for searches, though LEANN has since added leann serve — a FastAPI-based HTTP server — as a cleaner integration path.
The flow is straightforward: emails come in, get indexed both as vectors (LEANN) and keywords (SQLite FTS5), and when a staff member asks a question — either through Telegram or triggered automatically by a new incoming email — the system retrieves relevant historical emails and generates advice using a local LLM.
Every component runs on their hardware. The only external network call is to the Telegram API, and that’s sending the advice, not the source emails.
Ingestion: teaching the system to read email#
The first challenge was getting 38,000 emails into a format LEANN could work with. Emails came from two sources: live IMAP (Office 365) and historical Outlook PST archives.
PST parsing#
TODO: Detail the PST ingestion — library used, edge cases encountered, how historical archives were handled differently from live IMAP.
Email normalization#
Both get normalized into the same format:
[EMAIL | From: user@example.com | Subject: Issue with order #4821 | Date: 2024-03-15]
Hi,
We placed an order last week but haven't received a confirmation.
Can you look into it?
That metadata prefix turned out to be critical — more on that later.
The E5 prefix trap#
The embedding model I chose, intfloat/multilingual-e5-base, has excellent Swedish support. But it has a quirk that cost me a day of debugging: it requires specific prefixes on input text. Documents need "passage: " prepended, and queries need "query: ". Without these prefixes, retrieval quality silently degrades. No errors, just bad results.
LEANN handles this through CLI flags (--embedding-prompt-template "query: "), but it’s the kind of thing that makes you question your sanity before you find the answer.
Search: from 10% to 100% accuracy#
Baseline: pure vector search (10% accuracy)#
The initial setup used LEANN’s vector search exclusively. On my test suite of 10 queries covering common scenarios, it found the right reference email once out of ten attempts.
The failure mode was telling: a query for a specific domain term that appeared literally in email subjects — returned zero results. Vector search was looking for semantic similarity but missing exact keyword matches. The embedding space was too fuzzy for domain-specific terminology.
Adding BM25: hybrid search (80% accuracy)#
The fix was adding a parallel keyword search using SQLite FTS5 with BM25 scoring. Fast (~10ms), reliable for exact terms, and trivial to set up alongside the existing SQLite database.
I combined the two result sets using Reciprocal Rank Fusion (RRF):
def reciprocal_rank_fusion(bm25_results, vector_results, k=60):
scores = {}
for rank, doc in enumerate(bm25_results):
scores[doc.id] = scores.get(doc.id, 0) + 2.0 / (k + rank + 1) # 2x weight
for rank, doc in enumerate(vector_results):
scores[doc.id] = scores.get(doc.id, 0) + 1.0 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
BM25 gets double weight because for this domain, keyword matches are more reliable signals than semantic similarity. A user searching for a specific term plus a name wants emails containing those exact words.
One subtlety: stopword filtering matters. Without it, common words dilute the keyword signal. I built a stopword list to exclude from the FTS5 index.
Another: FTS5’s default AND matching is too strict. "term1 AND term2 AND term3" finds nothing if those terms aren’t in the same document. Switching to OR-based queries with relevance ranking fixed this.
The metadata trick (100% accuracy)#
The jump from 80% to 100% came from something I initially dismissed as cosmetic: injecting the email metadata (sender, subject, date) directly into the document text before embedding.
[EMAIL | Score: 0.87 | From: Support Desk | Subject: Re: Issue with order #4821 | Date: 2024-03-15]
This does two things: it gives the vector model more signal to work with (subject lines are information-dense), and it gives the LLM source attribution for free in the context window. The improvement was immediate.
| Metric | Baseline | +Hybrid | +Metadata |
|---|---|---|---|
| Found reference | 10% | 80% | 100% |
| Partial match | 30% | 20% | 0% |
| Miss | 60% | 0% | 0% |
Note: Since building this, LEANN has added native BM25 hybrid search on
main(not yet in a tagged release as of v0.3.6). If you’re starting fresh, you may not need to roll your own — the approach described above is now available as a built-in feature.
The advisor: turning retrieval into advice#
Retrieval is only half the job. The other half is generating useful advice in Swedish from the retrieved emails.
The system uses Mistral 7B running through Ollama — entirely local. The system prompt enforces a specific structure:
- Summary of the incoming issue
- Historical handling — how similar issues were resolved before
- Sources — which specific emails informed the advice
TODO: Show the actual system prompt.
Temperature is set to 0.3 — low enough for consistent, focused responses but not so low that it becomes robotic.
The system responds in the same language as the query. If staff write in Swedish, they get Swedish back. If someone writes in English, the response comes in English. This is handled purely through the system prompt, no language detection libraries needed.
Over-retrieval#
I deliberately over-retrieve: top_k=15 with a relevance threshold of 0.50. The LLM filters the noise. A few false positives in the context window is less damaging than missing the one email that actually matters.
For queries that need multiple retrieval rounds, LEANN has since added a leann react command — a ReAct agent that iteratively searches and reasons before answering.
The Telegram bot#
TODO: How does the bot work? What does a conversation look like? Show the UX — a sample exchange between a staff member and the bot.
Deduplication: don’t process the same email twice#
A subtle but important detail: the email watcher polls the inbox every two minutes, so it needs to remember which emails it has already processed. It tracks IMAP UIDs between polling cycles, ensuring each incoming email triggers the advisor exactly once. Without this, every poll would re-advise on the same inbox contents — a quick way to fill up the Telegram chat with duplicate recommendations.
Production deployment#
The whole system runs on a Proxmox VM:
- Ubuntu 24.04 LTS
- 4-8 vCPU (adjustable)
- 8-16GB RAM
- 100-200GB disk
- Two systemd services: email watcher and Telegram bot
- Deployed and managed via Ansible
TODO: Detail the Ansible playbook structure — how services are configured, how deploys work.
The OpenMP headache#
LEANN v0.3.6 has an OpenMP threading bug on multi-core Linux that causes hangs. The workaround is blunt but effective:
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export FAISS_NUM_THREADS=1
export OMP_WAIT_POLICY=PASSIVE
These are set in the Ansible playbooks, systemd unit files, and the Docker Compose setup used for local dev. Every entry point. Because you only need to forget it in one place to spend a fun afternoon debugging why nothing responds.
Note: This has been fixed upstream on
main(post-v0.3.6). LEANN now auto-sets these thread limits on Linux when the user hasn’t configured them. If you’re running a newer build, the manual workaround above is no longer needed.
Index build times#
- FTS5 keyword index: ~3 minutes for 38,000 emails
- LEANN vector index: Several hours for the same set (CPU-bound)
The vector index is a one-time cost per data refresh. Incremental updates aren’t supported yet, so a full rebuild is needed when new email batches are added. Not ideal, but acceptable for a system that processes historical data in bulk.
Data privacy#
-
No external API calls with email content. The embedding model runs locally via LEANN. The LLM runs locally via Ollama. There is no moment where email text is sent to a remote server.
-
On-premise storage only. Emails live in
data/emails/, the vector index indata/index/, the keyword index indata/emails.db. All on the VM’s local disk. -
Git isolation. The
.gitignoreexcludes all data directories, credentials, and indexes. The repository contains code, not data. -
Credential separation. IMAP passwords in
.env, SSH keys in.credentials/, Proxmox credentials in Ansible vault. None in source code. -
Controlled output channel. The only data that leaves the system is advisor recommendations sent via Telegram — never the source emails themselves. And Telegram is optional.
What I’d do differently#
Incremental indexing. Rebuilding the entire LEANN index for every batch of new emails is wasteful. A proper incremental pipeline would make the system more practical for daily email processing. LEANN has since added Merkle tree-based file change detection on main — not full incremental re-indexing yet, but the foundation for knowing which files actually changed between builds.
Better evaluation. My test suite has 10 synthetic queries covering 5 scenarios. That’s enough to validate the approach but not enough for real confidence. A proper evaluation framework with actual staff feedback would be the next step.
TODO: Show the test scenarios and evaluation script.
Streaming responses. The 6-16 second query latency is dominated by LLM inference on CPU. Streaming the response token-by-token would make it feel faster even if wall-clock time stays the same.
Model upgrades. Mistral 7B works well for Swedish, but the local LLM landscape moves fast. A newer model with better instruction following could improve advice quality without changing the architecture.
LEANN since this build#
This project was built against LEANN v0.3.6. Since then, 49 commits have landed on main (no tagged release yet) that are worth knowing about if you’re considering a similar setup:
- Native hybrid search — BM25 + vector fusion is now built in
- HTTP API server —
leann serveexposes search over REST, replacing the need for subprocess calls - ReAct agent — multiturn retrieval with iterative reasoning
- Merkle tree file syncing — change detection for smarter rebuilds
- Linux OpenMP fix — the threading workaround is now automatic
- Comprehensive ingestion — extractors for Word, Excel, PowerPoint, mindmaps, and a multi-layer PDF fallback chain
- Gemini & Qwen Code readers — ingest AI chat history into the RAG pipeline
Several of these address pain points described in this article. The project is actively maintained and moving fast — check the LEANN repo for the latest.