Working Memory — AI Agent Glossary

Definition

Working memory in AI agents is the totality of information visible to the LLM during a single inference call—the content of the context window. It is the agent's "active mind": everything it can reason about is bounded by what fits within this space. Working memory is transient; once the inference call completes, none of its contents persist unless explicitly written to an external storage system. Unlike human working memory, which is limited by neural capacity, AI working memory is limited by a hard token budget determined by the model's maximum context length.

Engineering Context

Working memory is the most expensive and constrained memory tier. Engineers must explicitly manage what goes into context: system prompt (~500-2000 tokens), retrieved knowledge (~1000-4000 tokens), conversation history (~500-2000 tokens), and current task state (~200-500 tokens). Compress aggressively using summarization rather than naive history appending. Token budget allocation is a first-class engineering concern. Context overflow—when input exceeds the model's maximum context length—causes hard failures or silent truncation depending on the serving framework. Design agent prompts with explicit budget allocations and use token counting libraries (tiktoken for OpenAI-compatible models) to enforce limits programmatically before inference.

Related Terms

Context Window Episodic Memory Retrieval-Augmented Generation Tokenization Chunking

Building production AI agents?

We design and implement deterministic AI agent systems for enterprise teams.

Start Assessment