Insights

The memory problem: what nobody tells you about AI agents in production

Article

March 9, 2026

“It would have taken me a year to put together the work you’ve done in 2 months”

SVP, Chief Clinical Officer

The enterprise appetite for AI agents is real.

Gartner predicts 40% of enterprise apps will feature AI agents by 2026, up from less than 5% in 2025.
McKinsey's State of AI report says 78% of organizations now use AI in at least one business function.
Deloitte found that 25% of enterprises running GenAI deployed agents in 2025, and that number is expected to double by 2027.

But Gartner also predicts that over 40% of those agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear business value, and/or inadequate risk controls.

Three out of four technology leaders surveyed by BCG say they fear "silent failure" - spending real money on AI that doesn't deliver real imp-act, or worse, that introduces subtle errors nobody catches until the damage is done.

The root cause of a lot of this fear is memory.

Large language models are stateless. Every single interaction starts from scratch. The "memory" we hear vendors talk about is not something these models do natively. It's a complex, often brittle system bolted onto the side. And when it breaks, it tends to break quietly.

This article is about what that actually means in practice. We’ll discuss the engineering tradeoffs, the architectural options, and the security questions you need to be asking before you put an agent into production.

The illusion of the infinite context window

Model providers are in an arms race over context windows. The implicit promise is simple: just feed your agent the entire history and let it figure things out.

But reality is different. Anthropic published research on what they call "context rot". The gist: as the context window fills up, the model's ability to accurately recall specific information degrades.

This is because every token you add creates n² pairwise relationships that compete for the model's attention. Context, Anthropic argues, should be treated as a finite resource, with diminishing marginal returns.

An analogy might be helpful. A context window isn't a filing cabinet where you can keep adding folders and retrieve any one on demand. It's more like a crowded room. The more people you cram in, the harder it gets to have a meaningful conversation with any one of them.

In fact, researchers at AAAI found that ChatGPT's effective working memory capacity is "strikingly similar" to humans, roughly 7±2 items, regardless of how large the context window technically is.

So we have models with 128,000-token windows that functionally remember about as much as you do when someone reads you a phone number (an exaggeration, but you get the point.)

RAG: the fragile chain everybody's building on

Most production agent memory today runs on some variant of Retrieval-Augmented Generation (RAG). The basic idea sounds reasonable. You store past interactions, retrieve the relevant ones when needed, and then inject them back into the prompt.

But as Dan Giannone laid out in a detailed breakdown, the actual chain of operations is long and fragile:

First, something has to get flagged as "important" enough to save. That's usually a heuristic or a classifier, and it's making judgment calls about what matters.
Then the snippet gets embedded as a vector and stored in a database.
Later, the agent has to decide it should query its memory at all. A similarity search returns what looks closest.
Those snippets get injected back into the prompt.

If any single link in that chain breaks, the whole mechanism fails. And the problem is, when it fails, it often looks like it's working. The agent doesn't say "I don't know." It produces a confidently wrong answer, and nobody can easily trace where the error came from.

Four less talked about memory problems

Beyond the fragility of the retrieval pipeline, there are deeper structural issues with how most agent memory systems work today.

Vector databases store snippets, not understanding. They hold text fragments. They can tell you which fragments are semantically similar to a query. What they can't do is represent the relationships between people, projects, decisions, and timelines that make up real enterprise context. Knowing that "Q3 revenue was $14M" and "the CFO flagged margin pressure" are both in the database doesn't mean the agent understands the connection between them.
Retrieval /= reasoning. The memory system and the agent's reasoning loop are two different things. The agent can't learn from its own retrieval failures. It can't notice that it keeps pulling the wrong context and adjust. There's no feedback loop, just a pipe that dumps snippets into a prompt and hopes for the best.
There's no concept of time, decay, or forgetting. Every memory persists with equal weight, forever. A conversation from six months ago sits right next to one from yesterday. There's no consolidation, no natural process for stale information to fade. This creates an ever-growing pile of context that can pollute future prompts with outdated or irrelevant information.
Long-running agents have amnesia between sessions. Anthropic described this well: running a long-term agent is like staffing a software project with engineers who work in shifts, except each new engineer arrives with absolutely no memory of what happened on the previous shift.

Choosing a memory architecture: the real tradeoffs

There's no universal answer here. The right architecture depends on your use case, your data complexity, and what kinds of failures you can tolerate.

That said, it helps to understand what's actually available, because most of the guidance out there is either theoretical, only tested at demo scale, or ignores the realities of running this stuff in production.

A couple of the more common approaches:

Vector store (memory as retrieval)

This is the default. Store text as high-dimensional embeddings, retrieve by cosine similarity.

It's fast, scales well, and integrates natively with LLM pipelines. For document retrieval and Q&A over unstructured knowledge bases, it works fine.

The problems show up when you need relational understanding, temporal awareness, or complex queries that go beyond "find me something similar to this." Metadata filtering degrades performance. And you inherit all the context rot issues we discussed above.

Best for: Document retrieval, straightforward Q&A, simple conversational agents where losing some context isn't catastrophic.

Knowledge graph (memory as structure)

Store information as nodes and edges - who said what about whom, when, and why.

This gives you explicit relationship modeling, multi-hop reasoning, and temporal awareness. For domains with highly structured, interconnected data like fraud detection, supply chain analysis, or complex organizational knowledge, graphs are powerful.

The catch is upfront cost. Schema design is significant. Knowledge graphs don't handle unstructured content well, and they have no native semantic search. You're trading flexibility for precision.

Best for: Knowledge management, compliance, any domain where the relationships between entities matter more than the raw text.

Hybrid (vector + graph)

Combine a vector store for semantic search with a knowledge graph for structured reasoning.

In theory, you get the best of both. In practice, you're managing and synchronizing two complex database systems. Consistency between them is a real engineering challenge, and queries that span both can be slow.

Best for: Mission-critical agents that need both semantic and relational understanding, assuming you have the team and budget to maintain the complexity.

Observational memory (the new kid)

This one is interesting.

Published in February 2026, the approach uses two background agents - an Observer and a Reflector - that continuously compress conversation history into a structured, dated log that stays in the context window. No external retrieval step at all.

VentureBeat reported this approach scored 94.87% on the LongMemEval benchmark, and can reduce costs by roughly 10x through compression and prompt caching.

By eliminating the retrieval step entirely, in theory you eliminate a major failure point. The downsides are that it's still new, requires running additional background agents, and the reflection process can invalidate the cache (reducing those cost savings).

Best for: Long-running, multi-turn conversational agents where context accuracy is the priority and you're willing to be an early adopter.

File-based memory (a surprisingly effective option)

Simple text files, structured progress logs, timestamps. Think of it as a git history for your agent's state.

Letta's research found that simple file-based systems actually outperformed more specialized approaches in some benchmarks. Anthropic's own recommendation for long-running coding agents uses structured progress files. And the State and Memory paper argues that finite-state automata with explicit state tracking is often all you need for procedural tasks.

It's dead simple, human-readable, and easy to debug. It doesn't scale for complex multi-agent systems, and it won't give you semantic search. But for procedural workflows with clear sequential steps, don't overthink it.

Best for: Procedural automation, long-running coding tasks, predictable workflows where transparency and debuggability matter more than sophistication.

A practical crawl/walk/run approach for getting started

If you're an enterprise leader thinking about deploying agents, here's how we'd think about staging the work.

Start stateless (crawl). Begin with agents that don't need memory. Low-risk, self-contained tasks where every interaction is independent. Use this phase to build your internal competency around prompt engineering, observability, and cost management. Get good at measuring what agents actually do before you add the complexity of state.
Add simple state (walk). For internal, procedural tasks, experiment with file-based or state machine memory. The transparency and human-readability of these approaches make them the right place to learn how state management works in practice. You can actually inspect what the agent "remembers," which is more than you can say for most vector-based systems. Anthropic's own engineering team recommends this pattern for long-running agents for a reason.
Invest in robust architecture (run). For customer-facing, mission-critical, or complex reasoning agents, you need to invest in a hybrid or observational memory architecture. This isn't a weekend project. It requires a dedicated team and real investment in both technology and governance. But for the use cases that justify it, the difference between an agent with good memory and one with bad memory is the difference between a tool people trust and one they quietly stop using.

Memory is hard, but solvable.

Building production-grade AI agents with reliable memory is genuinely hard.

The research is moving fast - observational memory, Google's work on scaling agent systems across 180 configurations, the Mem0 framework showing 26% accuracy gains with 90% token savings - but we're still in the early chapters of figuring this out.

What we do know is that memory can't be an afterthought. It's not a feature you bolt on after the demo works. It's the foundation that determines whether your agent is a reliable tool or an expensive liability.

The memory problem is solvable. But solving it requires treating it with the same rigor you'd apply to any other piece of production infrastructure: architecture, security, governance, and a healthy respect for what can go wrong.

‍

Partner with Us

In today’s data-driven landscape, the ability to harness and transform raw data into actionable insights is a powerful competitive advantage.

Making better decisions leads to measurably better outcomes. With a solid data and AI foundation, businesses can innovate, scale, and realize limitless opportunities for growth and efficiency.

We’ve built our Data & AI capabilities to help empower your organization with robust strategies, cutting-edge platforms, and self-service tools that put the power of data directly in your hands.

Self-Service Data Foundation 

Empower your teams with scalable, real-time analytics and self-service data management.

Data to AI

Deliver actionable AI insights with a streamlined lifecycle from data to deployment.

AI Powered Engagement

Automate interactions and optimize processes with real-time analytics and AI enabled experiences.

Advanced Analytics & AI

Provide predictive insights and enhanced experiences with AI, NLP, and generative models.

MLOps & DataOps

Provide predictive insights and enhanced experiences with AI, NLP, and generative models.

Recent Insights

Advisory

The Layer Above the Vendor: Why Yield Management Needs an Analytics Stack You Own

Article

Why individual productivity gains with AI aren't translating into business value (and what to do about it)

Healthcare

Data-Driven Development of a Patient Engagement Application

We partnered with a healthcare provider to build a scalable patient engagement app with real-time insights and secure document management. Leveraging advanced data analytics, the platform ensured continuous improvement in patient care and operations.

Professional Services

Navigating Trust in Emerging Technologies

A multinational firm analyzed public sentiment on emerging technologies using AI and NLP. The insights revealed privacy concerns and opportunities, helping the client prioritize investments in ethical practices and transparency.

View More Insights

Ready to embrace transformation?

Let’s explore how our expertise and partnerships can accelerate impact for your organization.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.