Don't Build RAG for Your AI's Memory. Build a Forgetting Machine.

In 1942, Borges wrote about Ireneo Funes, a young man who, after falling from a horse, lost the ability to forget. Funes could reconstruct every day of his life in perfect detail, but reconstructing a single day took him an entire day. Borges's point was that Funes could not actually think. To think is to forget a difference, to generalize, to abstract, and a mind that keeps everything has nothing left over to think with. Funes was fiction. Then the Soviet neuropsychologist Alexander Luria spent thirty years studying a man who came close to being a real one. Solomon Shereshevsky could memorize almost without limit, but he struggled with abstraction and metaphor; figurative language slipped past him, and his wife sometimes had to explain what a word like "nothing" meant. The boundless memory and the trouble with abstraction were the same trait seen from two sides. I work on an app where AI characters carry an ongoing, open-ended story with the reader. The characters are supposed to remember and react, so that the story responds to you and the people in it know what you have done. Last quarter, I improved their memory. I made them remember more, more faithfully, for longer. And the characters got worse. I had built Funes by accident. This piece is how I unbuilt him: the version that failed, the summary tree that replaced it, the race conditions that nearly sank it in production, and the memory science that, in hindsight, predicted the whole arc. The shape of the problem The app is an interactive fiction engine. You and a cast of characters write a story together, message by message, in a chat-like view. A single story can run for thousands of messages across months. The whole appeal is continuity. The character you argued with in chapter one should still be cold with you in chapter forty. The constraint is the one everyone building on LLMs hits eventually. A context window is finite. A story is not. By message two thousand you cannot put the story in the prompt, so you have to put something in the prompt, and the entire engineering problem is choosing what. "Give the characters memory" is really "decide what to throw away." I just didn't understand that yet. Why I didn't build RAG When most engineers hear "give the LLM memory," they reach for the same tool: embed the past, store the vectors, and at generation time, retrieve the top-k most semantically similar chunks. Retrieval-augmented generation. It is the default, and for a question-answering assistant it is the right default. If a user asks about your refund policy, you want the most relevant paragraph, wherever it lives. A story is not a knowledge base, and this is the part that took me an embarrassingly long time to internalize. When a character needs to know what happened so far, the wrong answer is the set of past moments most similar to the present one. Narrative meaning lives in order and causation, not similarity. A line of dialogue lands in chapter forty because of a promise made in chapter one and a betrayal in chapter twenty, three events that are not semantically similar to each other or to the present moment. Retrieval would surface three different scenes where someone stood in the rain, because those are embedded near each other, and miss the causal spine entirely. Retrieval gives you relevant fragments pulled out of sequence. A story needs the arc. So I did not build RAG. I built something that summarizes, and all the interesting failures came from how it summarizes. Version one: the running summary The first real version was the obvious one. Every hundred and fifty messages or so, take the previous summary, staple the new messages onto it, and ask a model to fold the whole thing back down into a single summary. A running recap. The prompt was essentially "here is the story so far, here is what just happened, give me the new story so far." // the v1 instinct: one summary blob, re-summarized forever const { summary } = await summarizeContext(ctx, { messages: newMessages, previousSummary: latestSummary?.summary, }); await db.insertInto("summaries").values({ chatId, anchorMessageId, summary, // a single ever-growing, ever-recompressed blob }).execute(); At generation time the prompt got that one summary blob plus the most recent hundred raw messages. Clean, bounded, and it worked for short stories. Then a tester wrote in about a character who had forgotten something he should not have been able to forget. Early in their story, a few hundred messages back, the character had made a specific, load-bearing promise. It was the emotional engine of the whole arc. By the time the story crossed a thousand messages, the character behaved as though the promise had never happened. Worse than forgotten: smoothed over, as if it had been sanded out of his personality. It had been. A running summary is a game of telephone played against yourself. Every cycle, a 150-to-250-word blob has to represent the entire story before the recent window. As the story grows, each early event gets re-summarized, then the summary of it gets re-summarized, each pass a little lossier. A promise from chapter one survives as "they grew close," then as "they have history," then as nothing. The blob has a fixed budget and an unbounded job, so the oldest and most foundational events decay first, because they have been through the compressor the most times. That is the Funes problem flipped over. Funes could not forget anything. My characters forgot the wrong things: they kept a smooth, recent, plausible blur and dropped the load-bearing specifics. Both failures have the same root. Neither system had a policy for what to keep. Version two: a forgetting pyramid The fix was to stop keeping one summary and start keeping a tree. Instead of a single blob, the engine keeps a small ordered set of summaries at different depths . Recent history is covered by shallow, detailed summaries. When the shallow summaries pile up past a limit, the two oldest adjacent ones get merged, or compacted, into one deeper summary. A deeper summary covers more story within the same word budget, which is another way of saying it is allowed to be vaguer. The further back an event is, the coarser its representation. The merge is the whole idea, and it is small: // merge two adjacent summaries into one deeper one const compacted = await compactSummaries(ctx, { summaries: [older.summary, newer.summary], }); await db.insertInto("summaries").values({  chatId,  depth: older.depth + newer.depth, // depth accumulates as we compress  oldestMessageIdInclusive: older.oldestMessageIdInclusive,  newestMessageIdExclusive: newer.newestMessageIdExclusive,  summary: compacted.summary, }).execute(); A depth-one summary covers one block of story. Merge two, and you get depth-two covering twice the story in the same 250 words. Merge again, and you get depth-four. The pyramid only ever holds a handful of entries, so the prompt stays bounded no matter how long the story runs. At generation time, the engine walks the chain from the deepest, oldest summary down to the shallowest, newest one, then appends the recent raw messages. Coarse past, fine present, live now. If that structure feels familiar, it should. It is close to how the brain is thought to store the past. The complementary learning systems theory, laid out by McClelland, McNaughton, and O'Reilly in 1995, describes two cooperating memory systems: a fast one in the hippocampus that captures specific, recent episodes in detail, and a slow one in the neocortex that integrates across many episodes into generalized, schematic knowledge. Old memories are gradually consolidated from the first into the second through repeated reinstatement. Recent and detailed, old and generalized, the distant past is rewritten a little each time it is revisited. The pyramid is that idea with a database and a summarization prompt standing in for biology, and the resemblance is not a coincidence so much as convergence on the only design that fits the constraint. The summarization prompt does a surprising amount of the work, and it is mostly instructions about what not to lose: Summarize what happened in this chunk of the story in 150-250 words. Your summary will be appended to the running summary. Summarize ONLY the new chunk. Do not repeat prior events. Keep specific details: exact names, dates, locations. Write so your summary reads as a natural continuation. "Keep specific details: exact names, dates, locations" is there because the model's instinct, left to itself, is to abstract. It will write "they discussed their past" instead of "she admitted she had lied about the letter." Abstraction is the enemy near the present, and the friend in the deep past, and the pyramid exists to be deliberate about which is which. The few-shot examples in the prompt, for what it is worth, are plot summaries of Harry Potter and the Sorcerer's Stone and Pride and Prejudice , so the model learns "good summary" from the kind of recap a careful reader would write. The twist: memory writes are a distributed system Here is the part I did not see coming, and the part I would most want a past version of myself to know. Summarizing is slow and expensive because you make extra model calls that the user is not waiting on. So summarization runs in the background, fired opportunistically whenever a generation notices the message count has crossed the threshold, guarded by a lock so two requests do not summarize the same story at once. The moment your memory became an asynchronous background write, it inherited every problem a database has. One afternoon I shipped the hierarchical version, and within the hour two distinct bugs surfaced. The first was duplicate-key crashes, because two workers were racing to write a summary anchor for the same point in the same story. The second was subtler: characters getting double memories, the same stretch of plot summarized twice and concatenated, because a worker was reading a stale cached copy of the summary chain that did not yet include a compaction another worker had just written. The fixes are unglamorous, and they are exactly what you would write for any concurrent system. Make the write idempotent so a race produces an upsert instead of a crash: q .insertInto("chat_content_summaries_v2") .values({ /* ... */ }) .onDuplicateKeyUpdate((eb) => ({ summary: values(eb.ref("summary")) })) .execute(); Read past the cache when you are about to make a decision based on what is already there: // If a user rewinds before an anchor, the cache would not include the // compacted summary so we would re-run the summary compaction. const summaries = await getSummaryEntries(ctx, {  chatId,  anchorMessageId,  noCache: true, }); And give the lock enough room that a slow summarization does not expire its own lock mid-flight, and let a second worker in. We bumped it from 120 to 180 seconds for exactly that reason. None of this is novel distributed-systems work, and that is the point. "AI memory" sounds like a prompt-engineering problem, and the headline architecture is. But the thing you actually operate is a write-heavy little database with caching, locking, and idempotency requirements, sitting in the hot path of something users are watching in real time. Rewinds, where users undo the story and branch in a different way, turned out to be the single richest source of these bugs because they make the past mutable, and almost nothing in a summarization pipeline expects the past to change. Forgetting is a feature, and the brain agrees The reward for all of it is a kind of behavior that is hard to get any other way. A reader brought back a character they had not spoken to in a couple of thousand messages. The character did not quote their old conversations back at them, because it could not; the transcript was long gone, compacted into a deep summary that knew only the gist. What it had was the shape: that things had ended badly, that there was an unpaid debt between them. It was wary in a way that felt earned, without reciting a single specific line. That is the behavior you want, and perfect recall is what destroys it. A character that can quote you verbatim from three months ago does not feel like it remembers you. It feels like it has been surveilling you. Memory research has a clean account of why the gist is the part worth keeping. Fuzzy-trace theory, developed by Valerie Reyna and Charles Brainerd, holds that we encode every experience as two parallel traces: a verbatim trace of surface detail and a gist trace of meaning. The two decay at different rates. Verbatim memory fades quickly; gist is far more durable, which is why you remember the point of an argument long after you have lost its exact words. My deep summaries are gist traces. The raw recent messages are the verbatim trace that has not faded yet. The pyramid is a forgetting curve with the verbatim layer deliberately allowed to fall away first. And the larger claim, that forgetting is not a malfunction, has direct support. In a 2017 review in Neuron , Blake Richards and Paul Frankland argue that the goal of memory is not fidelity to the past but good decisions in the present, and that transience, the controlled loss of information, serves that goal in two ways. It reduces the influence of outdated details and prevents overfitting to specific episodes, which lets you generalize to new situations. They draw a direct parallel to machine learning: a system that memorizes its training data perfectly has overfit. Funes had overfit to his own life. My first memory system had overfit to the transcript. The fix, in both the brain and the engine, is to forget on purpose and in the right order. One aside on the model People assume the prose comes from the largest, most advanced frontier model available. It does not. For the in-character writing, we use a small, open-weights model run hot, at high temperature. For this specific job, it beats the large, heavily aligned models, and not by a little. The big assistant-tuned models are sanded toward a helpful, agreeable neutrality that is poison for a character with edges. They break character to be nice, they hedge, they flatten. A smaller model with less of that polish, run hot, stays in voice and takes narrative risks the safe models will not. Summarization is a different job and gets a different, cheaper, faster model, because compressing a story is a comprehension task rather than a creative one. Matching the model to the task, small-and-feral for voice and small-and-literal for memory, has been worth more than any single upgrade to a frontier model. Bigger is not the axis that matters here. What I would tell someone starting Two things I wish I had known on day one. First, memory for an assistant and memory for a character are different problems wearing the same word. An assistant wants to recall: fetch the relevant fact, the more faithfully the better. A character wants continuity: preserve the arc, keep a few load-bearing specifics, and let the rest blur. Optimize a character for recall and you get Funes, technically remembering everything and behaviorally unbearable. The useful question is never "how do I store more," it is "what is my forgetting policy?" Cognitive science is encouraging here, because it suggests that a good forgetting policy is not a compromise. It is what memory is for . Second, the architecture is the easy half. A summary tree is a weekend of design. The other half, the half that pages you, is that you have quietly built a concurrent, cached, mutable-history datastore and dropped it into a real-time path, and it will fail in all the ordinary ways concurrent datastores fail. Budget for that half. It is where the time actually goes. If you want the deeper version of the brain side of this, the two papers to read are Richards and Frankland on the persistence and transience of memory, and McClelland, McNaughton, and O'Reilly on complementary learning systems , which, between them, have quietly shaped how much of machine-learning memory gets designed. Engineering keeps rediscovering what biology settled on: a memory worth having is mostly a well-tuned machine for forgetting.

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook