Deployment & Infrastructure

Latency

The elapsed time from when an agent sends an LLM request to when it receives a complete response, measured as time-to-first-token (TTFT) and total generation time.

Definition

Latency in AI agent systems refers to the delay between issuing an LLM inference request and receiving a usable response. It is typically decomposed into two components: time-to-first-token (TTFT)—the delay before the first output token arrives, which determines perceived responsiveness—and total generation time, which scales with output length. In multi-step agentic pipelines, latency compounds across each LLM call, tool execution, and retrieval step, making it a critical system design parameter.

Engineering Context

Latency is the primary user experience metric for AI agents. P95 latency (the 95th percentile, experienced by 5% of requests) is the engineering target, not average latency. For synchronous agents: target P95 < 3s for interactive workflows. Key levers: model size (smaller = faster), quantization, caching, batching, and streaming (start rendering output before generation completes). For multi-step agents, each tool call adds round-trip latency; parallelize tool calls where possible to reduce total pipeline time. Prompt caching (reusing KV-cache for static system prompt prefixes) can reduce TTFT by 60-80% for long system prompts.

Related Terms

Building production AI agents?

We design and implement deterministic AI agent systems for enterprise teams.

Start Assessment