How AI Remembers Across Hundreds of Turns

On the twentieth message, the character forgets its first promise. An hour ago the user mentioned that they "have a cat named Nabi," yet now the character asks who Nabi is. The flaw that a dazzling opening line conceals surfaces at exactly this point, once the conversation runs long.

In extended conversation, memory is not a nice-to-have feature. It is the product's reliability itself. And this problem—contrary to a common misconception—is not solved by a "bigger model" or a "longer context window." The LongMemEval benchmark (Wu et al., ICLR 2025) reports that feeding a multi-session conversation history directly into a model drops accuracy by roughly 30%, and that under harder settings the decline reaches 30–60%. Putting more in does not make the model remember more; it makes the model remember less.

This article is about how to close that gap. Rather than specific internal configuration values, it focuses on the constraints the design had to overcome and the ways it overcame them.

There is a reason we confronted this problem head-on while building Mellowz. Mellowz is a novel-style AI character chat platform. Users build stories across hundreds of turns—one-on-one with a single character, or with several characters together in one room. Each character has a lorebook and a scenario that capture its personality, voice, and worldview, and these settings must not waver no matter how long the conversation runs. In a novel-style experience, memory is the sense that the story never breaks, and that is the core of the product's value. The techniques discussed below are ones that academia and industry have openly established; on top of them, we will also note the lens through which Mellowz views this problem.

The Simplest Answer: Feed the Entire History Every Turn

The most intuitive way to implement memory is to drop the entire conversation so far into the prompt, verbatim, on every turn. It works well for short conversations, which is why many services start here.

The problem is that the advertised one-million-token window is an "architectural ceiling," not a "reliable working length." Fitting inside the window and the model actually using it well are two different things. The brute-force approach of injecting everything soon runs into three walls.

The Three Walls of Brute-Force Injection

First, cost grows roughly quadratically. Commercial chat APIs are stateless, so every turn resends the entire conversation history. The history grows linearly in length, but because the whole of it is resent each turn, the cumulative billed input tokens grow roughly quadratically. The cumulative bill for a 100-turn conversation reaches thousands of times that of a single turn.

Second, time to first token increases. A transformer's self-attention spends O(n²) time and memory per layer in the sequence length n (Vaswani et al., 2017). The longer the input, the more prefill dominates the time to the first token of the response (TTFT). In relationship-driven conversation, response latency is fatal.

Third, quality collapses even inside the window. "Lost in the Middle" by Liu et al. (TACL 2024) demonstrated a U-shaped curve: models make good use of the beginning and end of the context but miss information placed in the middle. Follow-up work measures this more precisely. In NoLiMa (Modarressi et al., ICML 2025), GPT-4o's effective accuracy fell from 99.3% to 69.7% at 32K tokens, and 10 of the 12 models tested dropped below half of their short-context baseline at 32K. Chroma's "Context Rot" report (2025) showed, across 18 models including GPT-4.1, Claude 4, and Gemini 2.5, that performance degrades as the input grows longer—well before reaching the window's limit.

These three are distinct problems, but they do not operate in isolation; they compound.

So Memory Is Not "More Memory" but "Retrieval"

Here we have to flip the perspective once. We redefine memory not as an endlessly growing chat log, but as a queryable store.

A reference librarian makes the intuition concrete. A character's every past conversation is like an entire library collection—you cannot lay all of it on the desk at once. A good librarian listens to the question and, by way of the index, brings only the few most relevant volumes. And a librarian who only grabs "the book just returned" is a poor one. Even an old book may be the most precisely fitting answer to the question at hand. Memory is the same. Something should surface because it is relevant, not because it is recent.

Once you accept this shift, the design problem splits in two: what to store and how, and what to retrieve each turn.

Remembering Like a Computer: Tiered Memory and Paging

The work that showed this structure most clearly is MemGPT (Packer et al., 2023, later renamed Letta in September 2024). MemGPT borrows its design from operating-system memory management.

[ Main context = RAM ]            fast but small
  - system instructions
  - recent messages (FIFO queue)
  - writable working memory
        ▲   ▼   page-in / page-out via function calls
[ External context = disk ]       large but slow
  - archival store (vector search)
  - searchable full conversation history

The fixed context window corresponds to physical memory (RAM), and the external store to disk. Just as an operating system makes memory larger than RAM appear "as if it were there," the agent behaves as if it had memory larger than the window. The decisive leap is that the agent becomes its own memory manager. The model decides directly what to keep inside and what to send out. The analogy stops here. From this point on, we speak in precise terms—main context, archive, paging.

What to Recall: Scoring the Retrieval

Retrieval is not simple semantic-similarity search. The case that designed this part most cognitively is Stanford's Generative Agents (Park et al., UIST 2023). This work assigns each memory a weighted sum of three scores.

Recency: decays exponentially over time (decay factor 0.995, measured in the simulation's internal time).
Importance: assigned by the model from 1 to 10 at the moment the memory is recorded. Brushing your teeth is a 1; a breakup or passing an exam is close to a 10.
Relevance: the cosine similarity between the embeddings of the current query and the memory.

All three weights are set to 1, and each score is normalized to between 0 and 1 before being summed. The point is that ranking is not done by recency alone. An old but important and relevant memory beats just-exchanged but irrelevant small talk. (As an aside, Generative Agents ran on a GPT-3.5-class model back in 2023. The contribution lies in this architecture, not in the base model.)

Retrieval in practice usually blends not one method but two. Keyword search (BM25), which captures lexical precision, and dense-embedding search (DPR, Karpukhin et al., EMNLP 2020—which leads BM25 by 9–19 percentage points in top-20 retrieval accuracy), which captures meaning broadly, are run together, and their two rankings are fused with RRF (Cormack et al., SIGIR 2009). The constant k commonly used sits near 60, but this is a default that hardened into implementation convention rather than a value the original paper advocated. Hybrid became the default because the two methods cover each other's failure cases. Keywords are weak on synonyms; embeddings are weak on exact proper nouns.

What to Discard: Summarization Compression and Kinds of Memory

The counterpart to retrieval is writing. Since you cannot pile up the conversation wholesale, you distill the flowing conversation into semantic units and compress it. This task, called rolling summarization, is essentially a trade-off between information loss and cost-and-scale. The more you compress, the cheaper and more scalable it becomes—but detail is lost.

The more important decision, however, is dividing memory into kinds. Cognitive science long ago distinguished episodic memory from semantic memory (Tulving, 1972), and agent design borrows the same split (CoALA, Sumers et al., TMLR 2024). The two kinds should be handled in exactly opposite ways.

Events are immutable and append-only. "I changed jobs last week" is something that happened, and it is not overwritten later.
Facts and preferences are subject to update and version control. "I like coffee" is a belief, and it may change next month.

Mixing the two accumulates contradictions. If an old instruction to "answer concisely" and a new instruction to "explain in detail" both remain with equal weight, the model wavers between them or ignores both. This is the core failure mode that produces hallucination and consistency breakdown in long conversations.

In Mellowz, too, this distinction governs character consistency. The events that happen to a character and the preferences that get updated about the user are memories of different natures, and they must be handled differently. What you preserve as immutable fact and what you leave as updatable belief determine whether the character remains the same person even after hundreds of turns.

The "Personality" That Collapses Independently of Memory: Persona Drift

We have to separate two failures that are often lumped together. Forgetting what the user said is a failure of recall. A character gradually growing different from itself is a drift of persona. They are different problems and demand different solutions.

"Measuring and Controlling Persona Drift" by Li et al. (COLM 2024) measured that the persona stability of LLaMA2-chat-70B drops sharply around the eighth conversational round, with the model increasingly coming to mimic the user's tone. Paradoxically, there are also reports that larger models drift more (Choi et al., 2024, across 9 LLMs). An important implication follows here. No matter how well you retrieve, you fix recall but not drift. Drift requires its own prescriptions—periodic re-injection of the character definition, attention re-weighting, consistency training. And no single technique eliminates this problem entirely. It only reduces it.

Context Engineering: Allocating a Finite "Attention Budget"

Every technique so far—tiered memory, score-based retrieval, summarization compression, type distinction—is bound together by a single discipline. In 2025, that discipline acquired a name: context engineering.

Two sentences defined the field. Tobi Lütke called it "the art of providing all the context for the task to be plausibly solvable by the LLM" (June 2025), and Andrej Karpathy framed it as "the delicate art and science of filling the context window with just the right information for the next step." Anthropic's engineering article (2025) treats context as a "finite resource with diminishing returns"—an attention budget—and proposes, as a core principle, choosing at each step "the smallest possible set of high-signal tokens."

Think of a suitcase. Even if you buy a bigger bag, cramming everything in means you cannot find the thing you actually need. A longer context window is likewise no substitute for curation.

One might object that "context engineering is, in the end, just RAG, memory, and agent orchestration rebundled under a new name." It is a fair point. The real value, however, lies in unifying scattered techniques into a single design discipline: the allocation of a finite attention budget.

Why a Character Appears to "Remember"

Caching is the last piece of this picture. So as not to reprocess the same prefix over and over, commercial APIs offer prompt caching (OpenAI gives a 50% discount on cached input for prefixes over 1024 tokens; Anthropic prices cache reads at 0.1× the base input). But change a single byte of the prefix and everything after it is invalidated. A subtle change in a timestamp, a UUID, or tool ordering silently breaks the cache. Choosing what to put in context affects not only quality but the unit price directly.

All of these layers are invisible to the user. They must remain so. A well-built memory system does not reveal its own existence; it leaves only the sense that "this character remembers me."

This sense is exactly what Mellowz aims for. Come back after a few days and the character knows the previous story; let the settings accumulate and the character holds its own world; let the conversation pass hundreds of turns and the response stays within manageable speed and cost. The result is to make the user feel that "this character knows me."

The conclusion is simple. AI does not recall everything. It appears to remember because it selects the right thing each turn. Memory is the result not of the model's size but of the system's design.

References

Liu et al. (2024), "Lost in the Middle," TACL 12 — aclanthology.org
Modarressi et al. (2025), "NoLiMa," ICML — arXiv:2502.05167
Wu et al. (2025), "LongMemEval," ICLR — arXiv:2410.10813
Packer et al. (2023), "MemGPT" (→ Letta) — arXiv:2310.08560
Park et al. (2023), "Generative Agents," UIST — arXiv:2304.03442
Li et al. (2024), "Measuring and Controlling Persona Drift," COLM — arXiv:2402.10962
Karpukhin et al. (2020), "Dense Passage Retrieval," EMNLP — aclanthology.org
Anthropic (2025), "Effective context engineering for AI agents" — anthropic.com

Mellowz is available at mellowz.ai. For inquiries about our technology and product, please write to ceo@vibecompany.work.

How AI Remembers Across Hundreds of Turns: Long-Term Memory and Context Engineering in Character Chat