Memory That Outlives the Context Window

An assistant that forgets everything between chats can’t be trusted with anything ongoing.

I run a fleet of AI coding agents in my homelab, and for a while they all had the same flaw: every session, they forgot everything. You’d tell an agent “it’s haven, not the NUC, the NUC was retired” and it would nod along, fix it, and then three days later a fresh session would confidently SSH into a machine that no longer existed. Context windows are big now, but they’re not forever, and the moment a long session compacts or a new one starts, all that context is gone.

Just write it down

Claude Code hands you a CLAUDE.md file out of the box. It sits at the root of your project, and Claude reads it at the start of every session. Mine started small and grew into a small constitution: who I am, which clusters live where, the IPs, the “never force push” rule, the “don’t chain bash commands or you’ll bypass auto-approval” rule. Anything I found myself explaining twice, I wrote down once.

I split the bulkier stuff out into rules/*.md files (dotfiles management, the permission system, how to talk to Google Docs) and Claude pulls those in too. Static instruction files are version-controlled, human-readable, and there’s no infrastructure to run. If you’ve got a fact that’s true every single session (my name, my network layout, the rule that I prefer terse answers), a static file is where it belongs.

The token tax, and the appeal of search

Stuff everything into static files and Claude reads all of it every session, relevant or not. The “how to recover a corrupted k3s etcd database” runbook gets loaded into context even when I’m asking about a Hugo blog post. Every fact you add is a fact that every future session pays for in tokens, forever. My CLAUDE.md plus rules had crept past the point where it was a meaningful chunk of the window before I’d even said hello.

And most of what I wanted to remember wasn’t constitution-grade anyway. It was the long tail. “Ollama needs FLASH_ATTENTION=1 or MoE prefill crawls at 10 tokens/sec.” “The agent-deck launch -m flag silently drops the prompt, launch then send instead.” “ZFS labelclear fails on whole-disk pools, target the partition.” Hundreds of little hard-won facts, any one of which I might need maybe once a month. Loading all of them into every session to cover the one I happen to need is absurd.

What I wanted was search. Store the facts somewhere, and at the relevant moment retrieve the handful that matter by meaning, not by keyword. That’s a vector store: embed each memory into a vector, embed the query into a vector, and pull back the nearest neighbours by cosine similarity. Ask “why is the GPU prefill slow” and you get the FLASH_ATTENTION fact back even if you never typed the word “flash”, because the embedding knows those concepts live near each other.

I got Claude to build it on memory-mcp-service SQLite (the sqlite-vec extension, which gives you vector search inside plain old SQLite, no separate database server to babysit). Internally (I don’t worry about this too much right now) - three tables have to stay in sync for an entry to be whole: memories for the text, memory_embeddings (a vec0 virtual table holding a FLOAT[768] per memory for the cosine search), and memory_content_fts, an FTS5 table for plain full-text matching when I want it. The embeddings come from a local model, nomic-embed-text running on Ollama, which spits out 768-dimensional vectors. These private memories stay on my own hardware, which I like.

Make it an MCP server so scripts can read it too

I exposed it to Claude as an MCP server. MCP is the protocol Claude Code uses to talk to external tools, so wrapping the memory store as one means Claude gets memory_store and memory_search as first-class tools it can call mid-conversation, same as it calls anything else.

I’ve got a lot of shell scripts and a scheduler doing autonomous work in the background, and I wanted those to hit the same memory. One store, many readers. A script triaging my task queue at 3am should be able to look up the same fact a Claude session learned last Tuesday. A standalone service makes that possible.

Wiring it into the agent’s lifecycle

Having the tools available is one thing. Getting the agent to use them, without me nagging, is another. Claude Code’s hooks (little scripts that fire at points in a session’s lifecycle) are where I got a bit too clever.

Three moments where memory matters:

  • Session start. Before the agent does anything, brief it. Pull a few relevant memories and inject them as context so it starts the session already knowing the lay of the land.
  • Mid-session. After I send a message, nudge the agent to consider whether it just learned something worth keeping, and to go search if my message smells like it references past work.
  • Just before compaction. When a long session is about to roll the context window over and lose everything, that’s the last chance to save what happened. So a precompact hook fires right before the window compacts.

The briefing and the nudge are read-side: they fetch and remind. The precompact hook, in my first design, was a write. It would reach straight into the SQLite-vec database and drop in a “session marker” memory recording that this session had hit compaction, so I’d have a breadcrumb trail of what happened when.

The precompact write was a mistake.

How it got noisy

It’s just a little marker, right? A hook can’t write a good memory. It doesn’t know what the session was about in any compressed, meaningful way, only that the session hit compaction. So what it wrote was the database equivalent of “something happened here.” “Session hit compaction.” Over and over, every time any long session rolled over.

Top-N semantic recall is the catch. The store can hold as many entries as it likes, but when I search, I only get back the few nearest neighbours. Those slots are precious. Every contentless session marker sitting in the database is a candidate that can shoulder a real memory out of the top results. When I finally went and counted, there were around 154 of these noise entries clogging things up.

Paring it back to the part that works

The fix was to flip who does the writing.

The hooks got demoted to read-only. They brief at session start and they nudge mid-session, and that’s all. No hook writes a memory to the store anymore. I deleted the precompact write entirely and cleared out the ~154 markers it had left behind.

The writing moved to where the judgment is: the agent itself, writing real memories. I landed on a set of write-ahead triggers, a fancy name for a simple habit. The agent scans my message for a small set of patterns and, if it spots one, writes the memory before it answers me:

  • A correction (“it’s X, not Y”): store the corrected fact.
  • A preference (“always do it this way”): store the preference and why.
  • A decision (“let’s go with the haven VM”): store the decision and the reasoning.
  • A specific value (a port, a path, a version, an IP): store it and where it applies.

The “write before responding” ordering matters more than it looks. In the moment, the fact feels obvious, it’s right there in the conversation, of course the agent knows it. But that’s the context that evaporates at compaction. Writing it down first, before the reply, catches it before it’s gone. The trigger is my input, not the agent’s memory of its own intentions, so there’s nothing for it to “remember to do.”

Static files for the always-true constitution, the vector store for the searchable long tail, hooks that only ever read and remind, and the agent itself as the only thing that writes, and only when there’s real content worth keeping. The noise problem hasn’t come back.

Summary

The agents remember things now, across sessions, across compactions, and mostly the right things. Next I want to look at whether the briefing should pull memories by recency as well as similarity, because some facts go stale and a six-month-old “current” IP is its own kind of lie.