Inference — AI Agent Glossary

Definition

Inference is the process of running a trained LLM to generate outputs from new inputs. Unlike training (which updates model weights using gradient descent on large datasets), inference uses the fixed weights of a trained model to produce predictions. Every LLM API call is an inference operation: the model receives a tokenized input prompt, runs a forward pass through its transformer layers to generate token probabilities, samples the next token, and repeats until the response is complete. Inference is the production-time activity that constitutes all AI agent computation in live systems.

Engineering Context

Inference is the dominant cost driver in production AI systems. Key metrics to monitor: time-to-first-token (TTFT, critical for streaming UX—users perceive the start of response, not total generation time), tokens-per-second throughput (output speed), and cost per 1M tokens (input and output priced separately). Inference optimization strategies include model quantization (reducing precision from fp16 to int8/int4 to reduce memory and increase throughput), KV-cache management (caching attention key/value pairs for repeated prefixes), request batching, and prompt caching (supported by OpenAI and Anthropic for system prompts reused across calls). Cloud inference trades operational simplicity for cost and data control; on-premise inference requires GPU hardware investment but enables full data privacy.

Related Terms

Large Language Model Temperature Tokenization Context Window

Building production AI agents?

We design and implement deterministic AI agent systems for enterprise teams.

Start Assessment