Latency — AI Agent Glossary

Definition

Latency in AI agent systems refers to the delay between issuing an LLM inference request and receiving a usable response. It is typically decomposed into two components: time-to-first-token (TTFT)—the delay before the first output token arrives, which determines perceived responsiveness—and total generation time, which scales with output length. In multi-step agentic pipelines, latency compounds across each LLM call, tool execution, and retrieval step, making it a critical system design parameter.

Engineering Context

Latency is the primary user experience metric for AI agents. P95 latency (the 95th percentile, experienced by 5% of requests) is the engineering target, not average latency. For synchronous agents: target P95 < 3s for interactive workflows. Key levers: model size (smaller = faster), quantization, caching, batching, and streaming (start rendering output before generation completes). For multi-step agents, each tool call adds round-trip latency; parallelize tool calls where possible to reduce total pipeline time. Prompt caching (reusing KV-cache for static system prompt prefixes) can reduce TTFT by 60-80% for long system prompts.

Related Terms

Throughput Inference Endpoint Model Serving Model Quantization GPU Compute

Building production AI agents?

We design and implement deterministic AI agent systems for enterprise teams.

Start Assessment