Generative AI applications donโt need bigger memory, but smarter forgetting. When building LLM apps, start by shaping working memory.
You delete a dependency. ChatGPT acknowledges it. Five responses later, it hallucinates that same deprecated library into your code. You correct it againโit nods, apologizesโand does it once more.
This isnโt just an annoying bug. Itโs a symptom of a deeper problem: LLM applications donโt know what to forget.
Developers assume generative AI-powered tools are improving dynamicallyโlearning from mistakes, refining their knowledge, adapting. But thatโs not how it works. Large language models (LLMs) are stateless by design. Each request is processed in isolation unless an external system supplies prior context.
That means โmemoryโ isnโt actually built into the modelโitโs layered on top, often imperfectly. If youโve used ChatGPT for any length of time, youโve probably noticed:
- It remembers some things between sessions but forgets others entirely.
- It fixates on outdated assumptions, even after youโve corrected it multiple times.
- It sometimes โforgetsโ within a session, dropping key details.
These arenโt failures of the modelโtheyโre failures of memory management.
How memory works in LLM applications
LLMs donโt have persistent memory. What feels like โmemoryโ is actually context reconstruction, where relevant history is manually reloaded into each request. In practice, an application like ChatGPT layers multiple memory components on top of the core model:
- Context window: Each session retains a rolling buffer of past messages. GPT-4o supports up to 128K tokens, while other models have their own limits (e.g. Claude supports 200K tokens).
- Long-term memory: Some high-level details persist across sessions, but retention is inconsistent.
- System messages: Invisible prompts shape the modelโs responses. Long-term memory is often passed into a session this way.
- Execution context: Temporary state, such as Python variables, exists only until the session resets.
Without external memory scaffolding, LLM applications remain stateless. Every API call is independent, meaning prior interactions must be explicitly reloaded for continuity.
Why LLMs are stateless by default
In API-based LLM integrations, models donโt retain any memory between requests. Unless you manually pass prior messages, each prompt is interpreted in isolation. Hereโs a simple example of an API call to OpenAIโs GPT-4o:
import { OpenAI } from "openai";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: "You are an expert Python developer helping the user debug." },
{ role: "user", content: "Why is my function throwing a TypeError?" },
{ role: "assistant", content: "Can you share the error message and your function code?" },
{ role: "user", content: "Sure, here it is..." },
],
});
Each request must explicitly include past messages if context continuity is required. If the conversation history grows too long, you must design a memory system to manage itโor risk responses that truncate key details or cling to outdated context.
This is why memory in LLM applications often feels inconsistent. If past context isnโt reconstructed properly, the model will either cling to irrelevant details or lose critical information.
When LLM applications wonโt let go
Some LLM applications have the opposite problemโnot forgetting too much, but remembering the wrong things. Have you ever told ChatGPT to โignore that last part,โ only for it to bring it up later anyway? Thatโs what I call โtraumatic memoryโโwhen an LLM stubbornly holds onto outdated or irrelevant details, actively degrading its usefulness.
For example, I once tested a Python library for a project, found it wasnโt useful, and told ChatGPT I had removed it. It acknowledged thisโthen continued suggesting code snippets using that same deprecated library. This isnโt an AI hallucination issue. Itโs bad memory retrieval.
Anthropicโs Claude, which offers prompt caching and persistent memory, moves in the right direction. Claude allows developers to pass in cached prompt fragments using pre-validated identifiers for efficiencyโreducing repetition across requests and making session structure more explicit.
But while caching improves continuity, it still leaves the broader challenge unsolved: Applications must manage what to keep active in working memory, what to demote to long-term storage, and what to discard entirely. Claudeโs tools help, but theyโre only part of the control system developers need to build.
The real challenge isnโt just adding memoryโitโs designing better forgetting.
Smarter memory requires better forgetting
Human memory isnโt just about rememberingโitโs selective. We filter details based on relevance, moving the right information into working memory while discarding noise. LLM applications lack this ability unless we explicitly design for it. Right now, memory systems for LLMs fall into two flawed categories:
- Stateless AI: Completely forgets past interactions unless manually reloaded.
- Memory-augmented AI: Retains some information but prunes the wrong details with no concept of priority.
To build better LLM memory, applications need:
- Contextual working memory: Actively managed session context with message summarization and selective recall to prevent token overflow.
- Persistent memory systems: Long-term storage that retrieves based on relevance, not raw transcripts. Many teams use vector-based search (e.g., semantic similarity on past messages), but relevance filtering is still weak.
- Attentional memory controls: A system that prioritizes useful information while fading outdated details. Without this, models will either cling to old data or forget essential corrections.
Example: A coding assistant should stop suggesting deprecated dependencies after multiple corrections.
Current AI tools fail at this because they either:
- Forget everything, forcing users to re-provide context, or
- Retain too much, surfacing irrelevant or outdated information.
The missing piece isnโt bigger memoryโitโs smarter forgetting.
GenAI memory must get smarter, not bigger
Simply increasing context window sizes wonโt fix the memory problem. LLM applications need:
- Selective retention: Store only high-relevance knowledge, not entire transcripts.
- Attentional retrieval: Prioritize important details while fading old, irrelevant ones.
- Forgetting mechanisms: Outdated or low-value details should decay over time.
The next generation of AI tools wonโt be the ones that remember everything. Theyโll be the ones that know what to forget. Developers building LLM applications should start by shaping working memory. Design for relevance at the contextual layer, even if persistent memory expands over time.


