Architecture

Retrieval-Augmented Generation (RAG)

An architecture that dynamically retrieves relevant documents or data at inference time and injects them into the LLM context, grounding responses in up-to-date, verifiable sources.

Definition

Retrieval-Augmented Generation (RAG) is an architecture that dynamically retrieves relevant documents or data at inference time and injects them into the LLM context, grounding responses in up-to-date, verifiable sources. Rather than relying solely on knowledge encoded in the model's weights during training, RAG queries an external knowledge base at runtime, retrieves the most relevant content, and provides it directly to the model as context. This makes RAG-powered agents capable of answering questions about private, proprietary, or recently updated information without retraining the model.

Engineering Context

RAG is the primary pattern for giving LLMs access to private or frequently updated knowledge bases without fine-tuning. A RAG pipeline consists of: (1) an indexing phase where documents are chunked, embedded, and stored in a vector database, and (2) a retrieval phase where the user query is embedded, similar chunks are retrieved via approximate nearest neighbor search, and the top chunks are injected into the prompt. Key engineering decisions include chunking strategy (semantic vs. fixed-size), embedding model quality, similarity metric (cosine vs. dot product), reranking (cross-encoder reranking improves precision significantly), and how many chunks to inject within the context budget. Hybrid search combining dense embeddings with BM25 sparse retrieval often outperforms either method alone.

Related Terms

Building production AI agents?

We design and implement deterministic AI agent systems for enterprise teams.

Start Assessment