Rethinking AI Memory: Beyond the Flat File

Nine months of building a context system for AI-assisted development had produced one clear lesson: the problem is not storage, it's retrieval. Every approach I had tried — flat markdown, structured markdown, a PostgreSQL schema with a JSONB blob — was a better or worse answer to the question "where do I put this?" The question I actually needed to answer was "how do I get the right context to the right model at the right time without thinking about it?"

Those are different questions, and the second one is harder.

The Memory Problem in AI Systems

Human memory works through association. You remember something not because you filed it under the right label, but because it's connected to things you're currently thinking about. Activation spreads through a network of related concepts. The relevant memories surface without deliberate retrieval effort — they're just there when the context makes them applicable.

The systems I had built were filing cabinets. Well-organized ones, but still filing cabinets. Retrieval required explicit queries. The system didn't know what I was currently thinking about unless I told it. The automatic surfacing I wanted — context that appeared because it was relevant, without me asking — was not a property of any storage system I had looked at.

Getting from "better filing cabinet" to "associative memory" required a retrieval architecture that was aware of working context: what files I had open, what I was asking about, what task I was in the middle of. That's more than a database with good indexing. That's a system that observes the current context and uses it as the query.

The Vector Search Foundation

The pgvector approach I had already implemented was pointing in the right direction. Vector similarity search is the mechanism that makes semantic retrieval work — queries that surface related content based on meaning rather than exact keyword overlap. But the retrieval was still manual in my current implementation. I ran a query. The system returned results. I decided what to do with them.

Making it automatic required closing the loop: observe what I'm working on, construct a query from that observation, run the retrieval, inject the results into the AI session. Each step was tractable engineering. Putting them together into something that ran without intervention was the design challenge.

The observation step was the interesting one. What signals indicate what I'm currently working on?

  • Currently open files in the editor (available via IDE APIs)
  • Recently modified files (available via git)
  • The current prompt or question I'm about to send to a model
  • The current directory and project root

A retrieval system that embedded these signals and used them to query the knowledge base would produce context that was relevant to the current task without me deciding what to ask for. That's the core idea behind what would eventually become the retrieval layer of the system I was building.

The Retrieval-Augmented Generation Pattern

The academic term for this pattern is retrieval-augmented generation, or RAG. The machine learning community had been publishing on it for a few years. The idea is straightforward: before sending a query to a language model, retrieve relevant documents from a knowledge base and include them in the prompt. The model generates its response with that additional context in view.

What made the academic implementations not directly useful for my situation was their assumption about the knowledge base: large, general, relatively static. Wikipedia articles. Product documentation. The kind of content where a standard embedding model produces reliable similarity scores.

My knowledge base was small, highly specific, rapidly changing, and contained a mix of structured facts, business rules, and technical domain knowledge that general embedding models hadn't been trained to handle well. The retrieval quality on general-purpose embeddings against specialized content was inconsistent. Some queries returned excellent results. Others missed relevant entries because the embedding space didn't represent the domain vocabulary accurately.

Fine-tuning the embedding model for my specific domain was on the table, but felt like too much investment for a system that was still being validated. The simpler path was hybrid retrieval — vector search combined with keyword search, with a merge step that combined the ranked results from both approaches.

The Injection Architecture

The injection layer — the piece I still hadn't built — needed to sit between my knowledge base and whatever model I was talking to. The inputs were: current working context (open files, recent changes, current prompt). The output was: an enriched prompt with relevant retrieved context prepended.

Doing this automatically, without interrupting the workflow, meant either hooking into the IDE extension system or running a local proxy that intercepted API calls and enriched them in transit. The proxy approach was more general — it didn't require a specific IDE integration — but it added latency and complexity. The IDE extension approach was faster but locked to a specific editor.

I started sketching the proxy architecture. A local HTTP server that accepted the same API contract as the model provider, enriched the request with retrieved context, forwarded to the real provider, and returned the response. Model-agnostic, editor-agnostic, and transparent to the client application.

That design decision — a local proxy as the injection layer — would shape everything that came after it. It was the right call, and it was also more work than I initially estimated. Month ten ended with the architecture clear and the implementation not yet started.

If you've implemented something similar — a local retrieval layer that enriches model prompts automatically — I'd like to compare notes on where the design got complicated. As always, I'm here to help.

Read more