The Architecture of Local-First AI Memory: No Cloud, No Keys, No Read-Time LLMs

\ A field guide to the storage layer, the write path, and the retrieval design behind a memory that lives in one SQLite file on your disk. TL;DR AI coding agents forget everything between sessions. The common fix is a hosted memory service: your project history goes to someone else's cloud, gets embedded by their model, and comes back through an API key you pay for per token. I built the opposite. PMB keeps an agent's memory in one SQLite file on your machine, with vectors in a LanceDB file next to it. No account, no key, no network call from the engine. The whole design falls out of one rule I refused to break: no LLM call on the read path. Reads happen constantly, so they have to be free, local, and fast enough to be invisible. Result: reads land in tens of milliseconds (35 ms p50 warm), writes return in under a millisecond, 94.5% LoCoMo recall@10, zero telemetry, Apache 2.0. This is a follow-up to my first post on how I built it. That one was about the techniques and the numbers. This one is about the architecture: how the pieces fit, and why each one is shaped the way it is. The one decision that shaped everything An AI agent is brilliant for exactly one session. Close the window and it forgets the bug you fixed on Tuesday, the lesson you taught it about your pricing rules, the fact that you moved to a new city last month. Next session it guesses again, and you re-explain. The popular answer is a memory product in the cloud. It works, but it puts your project history on a server you do not own, and it puts a model call in the middle of every read. I wanted memory that behaved like a local index: something you copy with cp , that runs with no key, that never phones home. So I started from a single constraint and let it dictate the rest: No LLM call on the read path. Memory gets read all the time. Every message can trigger a lookup, sometimes several per turn. If a read costs a model call, four bad things happen at once. It gets slow (hundreds of milliseconds, not tens). It costs money on every keystroke-sized interaction. It leaks your context to a third party. And it becomes non-deterministic, so the same query can return different memories on different days. Reads had to be free, offline, and quick enough that nobody notices them. That rule cascades into every layer below. Storage: two files, and both of them are yours Every event PMB stores lives as a row in SQLite . That is the source of truth: facts, decisions, lessons, goals, indexed code symbols, PDF chunks, the lot. SQLite gives you transactions, full-text search, and a single portable file, with no server to run. Vectors live next to it in a LanceDB file. When the engine needs semantic similarity it queries LanceDB; when it needs the canonical record it reads SQLite. Splitting them keeps each store doing the one thing it is good at, and keeps the hot path off the embedding store when a lexical match is enough. Both are plain files on disk under ~/.pmb/<workspace>/ . You can copy a workspace to a USB drive, push it to git, or drop it in Dropbox. Two agents can point at the same workspace: SQLite runs in WAL mode with a busy-timeout set automatically, so Claude Code and Cursor can read and write the same memory without stepping on each other. There is no migration story to a cloud later, because there is no cloud. The portability is the architecture, not a feature bolted on top. The write path returns before the work is finished Here is the tension. Embedding a piece of text means loading a roughly 90 MB model and running it. That is tens of milliseconds at best, and you cannot make an agent wait on it mid-turn. But the agent also should not lose the write if it moves on. The answer is an asynchronous write path with a durable hand-off: record_batch(items=[...]) returns in < 1 ms | v SQLite INSERT synchronous, durable, the source of truth | v embed queue (SQLite-backed) survives process death | v background thread: embed -> LanceDB upsert The MCP tool writes the row to SQLite and returns immediately, in under a millisecond. The expensive part (embedding plus the vector insert) happens later on a background thread. The queue itself is backed by SQLite rather than living only in memory, so if the process dies between the row write and the embedding, the item is still there to be picked up. The record is never lost, and the agent never blocks. This is the write-side mirror of the read-path rule: keep the model off the critical path, push the slow work to where nobody is waiting on it. The read path: four signals, fused, zero model calls This is where the "no LLM on read" rule earns its keep. A single retrieval method is never enough for memory: Lexical search nails exact tokens (a function name, an error string, a ticket id) but misses paraphrases. Dense vectors catch meaning ("where do I live" finding user.city = Tampa ) but drift on rare literals. The entity graph knows that a file, a project, and a person are connected even when no single query term joins them. So PMB runs them together and fuses the results, with no language model anywhere in the loop: recall("the pricing bug from last Tuesday") | +-- BM25 lexical (SQLite full-text) +-- dense vector (LanceDB, cosine similarity) +-- entity-graph diffusion (Personalized PageRank, gated by query intent) +-- optional cross-encoder rerank | v Reciprocal Rank Fusion + importance weight + recency decay | v top_k results 35 ms p50 warm, no API call Each method produces a ranked list. Reciprocal Rank Fusion blends those lists into one, which is robust precisely because it cares about an item's rank in each method rather than trying to make incompatible scores comparable. On top of fusion, two cheap signals nudge the order: an importance weight (a health fact outranks a passing opinion) and a recency decay (older memories fade unless they keep getting reinforced). The graph step is Personalized PageRank , a multi-hop diffusion that pulls in neighbors of a strong hit. It is powerful and a little expensive, so it is gated by the detected intent of the query: questions that actually need multi-hop reasoning get it, simple lookups skip it. A cross-encoder reranker is available but off by default, because in measurement it regressed the benchmark. The honest default beat the fancier one. The whole thing is deadline-bounded . Recall starts local and fast, and if confidence is low it escalates only with cheap local steps (a wider candidate pool, a local rerank when that model is already warm). It never blocks on a model, and it always returns within its time budget, handing back the best result found so far if it runs out of time. A read can be fast or it can be thorough, but it can never hang. That guarantee is only possible because there is no network call hiding inside it. One call instead of five: prepare() The cleanest retrieval engine in the world is useless if the agent does not call it. The real failure mode of agent memory is not storage, it is that the agent skips the lookup. Every bit of friction (one more tool to choose, one more round trip) is a reason for the model to wing it instead. So the primary entry point collapses what would be five calls into one: prepare(message="fix the LoadGuard pricing bug") returns in 4-16 ms -> project_context facts, lessons (rules to follow), decisions, open goals -> lessons procedural rules matching this message, each with an id -> recent_activity the last 24h, for session continuity -> open_goals what you are currently pursuing -> active_arcs the narrative threads the project is living in One call, a handful of milliseconds, and the agent shows up to the task already informed instead of asking you to repeat yourself. The cost of using memory has to be lower than the cost of guessing, or the memory goes unused. Designing for that is as much a part of the architecture as the index underneath it. Facts that change: keyed memory and time travel Most "memory" is append-only, which breaks the moment a fact changes. You move cities. Your project switches its database. If memory just piles up, an old truth and a new truth sit side by side and the agent cannot tell which one holds now. PMB models changeable attributes explicitly: record_keyed_fact("user", "city", "Warsaw") # months later record_keyed_fact("user", "city", "Tampa") The Warsaw row is not overwritten and not deleted. It gets a valid_to timestamp, and Tampa becomes the current value. The history stays queryable, so keyed_fact_as_of("2026-01-01") still answers "Warsaw." Negation works the same way: telling the agent a fact is no longer true closes the value with valid_to rather than erasing it. This is a small schema decision with a large payoff. The agent gets a single current answer to "where do I live," and you never lose the timeline of how things got here. Forgetting is a feature, not an afterthought A memory that only grows becomes noise. The same decision recorded five different ways is worse than recording it once, because retrieval starts returning near-duplicates and the signal blurs. So forgetting is built in at four layers of strictness: Exact text match is merged on sight. High cosine similarity (about 0.92 and up) is auto-merged as the same memory. The borderline band (roughly 0.80 to 0.92) is flagged for a cheap verification pass later, never on the hot path. Whatever is left surfaces in the dashboard for a human to merge with one click. Alongside dedup, an importance decay quietly lowers the weight of memories that stop being reinforced, so the index ages gracefully instead of treating a year-old aside as breaking news. None of this hard-deletes anything by default: archived items drop out of recall but stay on disk, so a wrong call is always reversible. Two integration layers: MCP for capability, hooks for reliability The last architectural idea is that one integration surface is not enough. MCP gives the agent the tools: prepare , recall , record_batch , keyed facts, indexing, and the rest. That is the capability layer. But capability alone still depends on the model choosing to use it, and soft instructions in a rules file get skipped under pressure. So there is a second layer of lifecycle hooks that force-feed memory at the protocol level, with no model cooperation required: On every user message , a fast, multilingual classifier (cheap, sub-millisecond) fetches the matching memory and injects it before the model thinks. The agent never has to decide to call recall . On every tool the agent runs , a one-line action journal records what happened (a single SQLite insert, no model, no vectors), filtering out noise like reads and ls . On session start , after the context window compacts, a restore step rebuilds "where you left off" from what the session recorded, so the agent picks the thread back up instead of re-asking you. On turn end , two things run: a deterministic check of which surfaced rules actually showed up in the agent's work, and an ambient write that journals the turn if the agent forgot to, but only when the work clears a quality bar driven by results. MCP is what the agent can do. Hooks are what happens regardless . Reliability lives in the gap between the two, and you need both layers to close it. The principle I would defend in a review Here is the design opinion underneath all of it, the one I would argue for in any review. Fast, bounded jobs can use cheap, deterministic tools. The sub-millisecond intent classifier on the hook path is regex, and that is the right call: the set of intents is small and known, and a regex is instant and free. But the moment you are trying to enumerate something open-ended , like every way a human might state a current fact, or every synonym for an attribute, across every language your users speak, hardcoded keyword and pattern lists are the wrong tool. They are brittle, they do not scale, and they turn into multilingual whack-a-mole where every fix reveals two more gaps. The architecture leans on that distinction everywhere. Bounded gate? A list is fine. Open-ended understanding? Lean on embeddings and structure, not on a longer list of words. Knowing which side of that line a problem sits on is most of the design. Local-first is the architecture, not the limitation It is tempting to read "runs on your machine, no API key" as a list of things this memory gives up. It is the opposite. Refusing the cloud is what forced the read path to be genuinely fast, the writes to be genuinely async, and the storage to be genuinely portable. The constraint did the design work. Your agent's memory is some of the most personal data you have: what you are building, how you think, what you keep getting wrong. It belongs in a file you own, not a row in someone else's table. PMB is open source under Apache 2.0. You can read every line, reproduce the numbers, and fork it if you disagree with a single decision in this post. Repo: github.com/oleksiijko/pmb First post (the techniques and the benchmarks): read it on HackerNoon Reproduce the headline numbers yourself: pip install pmb-ai python scripts/benchmarks/benchmark_locomo.py --n-conversations 10 python scripts/benchmarks/mega_stress_test.py \

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook

EA just built a full advertising platform inside its games, and 120 million players are the audience

The Next Web

TechnologyJun 15, 2026 · 1 min

EA just built a full advertising platform inside its games, and 120 million players are the audience

The Next Web

How much RAM does your PC need in 2026? My advice after using Windows and Mac for years

ZDNet

TechnologyJun 15, 2026 · 1 min

How much RAM does your PC need in 2026? My advice after using Windows and Mac for years

ZDNet

After two failed bids, ISRO to attempt another PSLV launch by June-end, early July

Indian Express

TechnologyJun 15, 2026 · 1 min

After two failed bids, ISRO to attempt another PSLV launch by June-end, early July

Indian Express

85% of IT teams claim every AI agent is under control. Only 42% actually know who owns them.

VentureBeat

TechnologyJun 15, 2026 · 1 min

85% of IT teams claim every AI agent is under control. Only 42% actually know who owns them.

VentureBeat

The Architecture of Local-First AI Memory: No Cloud, No Keys, No Read-Time LLMs

Related stories

EA just built a full advertising platform inside its games, and 120 million players are the audience

How much RAM does your PC need in 2026? My advice after using Windows and Mac for years

After two failed bids, ISRO to attempt another PSLV launch by June-end, early July

85% of IT teams claim every AI agent is under control. Only 42% actually know who owns them.