Enterprise document analysis is one of the highest-value applications for AI agents. RFPs, contracts, and technical specifications contain critical information that, if misinterpreted, can cost millions. In this tutorial, we'll build a document risk analyzer that extracts facts without inventing them.
The "Ground Truth First" Architecture
The key insight is simple: never let the LLM generate facts. Instead, we use a retrieval-augmented generation (RAG) approach where the LLM can only reference content that exists in the source document.
Step 1: Document Ingestion
First, we need to extract text from documents while preserving structure. We use pymupdf for PDFs and python-docx for Word documents.
import fitz # pymupdf
from dataclasses import dataclass
@dataclass
class DocumentChunk:
text: str
page: int
section: str
clause_id: str | None
def parse_pdf(file_path: str) -> list[DocumentChunk]:
doc = fitz.open(file_path)
chunks = []
for page_num, page in enumerate(doc):
text = page.get_text()
# Split by section headers
sections = extract_sections(text)
for section in sections:
chunks.append(DocumentChunk(
text=section.content,
page=page_num + 1,
section=section.header,
clause_id=extract_clause_id(section.content)
))
return chunks
Step 2: Risk Pattern Library
We maintain a library of known risk patterns. These aren't generated by the LLM - they're curated by legal and engineering teams.
RISK_PATTERNS = {
"ambiguous_sla": {
"keywords": ["reasonable", "best effort", "commercially reasonable"],
"severity": "HIGH",
"description": "SLA terms that lack measurable commitments"
},
"unlimited_liability": {
"keywords": ["unlimited liability", "no cap on damages"],
"severity": "CRITICAL",
"description": "Clauses exposing unlimited financial risk"
},
"unilateral_termination": {
"keywords": ["terminate at any time", "sole discretion"],
"severity": "MEDIUM",
"description": "One-sided termination rights"
},
"ip_assignment": {
"keywords": ["all intellectual property", "work product"],
"severity": "HIGH",
"description": "Broad IP transfer clauses"
}
}
Step 3: The LangGraph Agent
Here's where LangGraph shines. We define a state machine that ensures the agent follows a strict analysis path.
from langgraph.graph import StateGraph, END
from typing import TypedDict
class AnalysisState(TypedDict):
document_chunks: list[DocumentChunk]
matched_patterns: list[dict]
risk_assessments: list[dict]
confidence_scores: dict
needs_human_review: bool
def build_analyzer_graph():
graph = StateGraph(AnalysisState)
# Define nodes
graph.add_node("pattern_match", pattern_matching_node)
graph.add_node("context_analysis", context_analysis_node)
graph.add_node("confidence_score", confidence_scoring_node)
graph.add_node("human_checkpoint", human_review_node)
# Define edges
graph.add_edge("pattern_match", "context_analysis")
graph.add_edge("context_analysis", "confidence_score")
graph.add_conditional_edges(
"confidence_score",
route_by_confidence,
{
"high_confidence": END,
"low_confidence": "human_checkpoint"
}
)
graph.add_edge("human_checkpoint", END)
graph.set_entry_point("pattern_match")
return graph.compile()
Step 4: Guardrails That Prevent Hallucination
The critical guardrail: every claim the LLM makes must reference a specific location in the source document.
ANALYSIS_PROMPT = """
You are analyzing a document for risk clauses.
STRICT RULES:
1. ONLY reference text that appears in the provided chunks
2. ALWAYS include the exact quote and page number
3. If uncertain, output "AMBIGUITY_ALERT" instead of guessing
4. Never infer information not explicitly stated
Document chunks:
{chunks}
Identified pattern: {pattern}
Provide analysis in this exact format:
- Quote: "[exact text from document]"
- Location: Page X, Section Y
- Risk Level: {severity}
- Explanation: [why this is risky]
- Confidence: [HIGH/MEDIUM/LOW]
"""
def validate_response(response: str, chunks: list[DocumentChunk]) -> bool:
"""Verify that quoted text exists in source document"""
quotes = extract_quotes(response)
for quote in quotes:
if not any(quote in chunk.text for chunk in chunks):
raise HallucinationError(f"Quote not found: {quote}")
return True
Step 5: Output Format
The final output is a structured risk report that humans can verify.
Results in Production
We've deployed this architecture for clients analyzing RFPs and contracts. Key metrics:
- 0 hallucinations in 10,000+ analyzed documents
- 4-6 hours saved per RFP review
- 92% accuracy on risk clause identification
- 100% audit trail for compliance
Key Takeaways
- Never let LLMs generate facts - Use RAG to ground all outputs
- Validate every claim - Check that quoted text exists in source
- Score confidence - Route uncertain outputs to humans
- Maintain audit trails - Log every state transition
Want us to build this for your documents?
Book a 2-week sprint and we'll deploy a custom document analyzer for your workflow.
Start Assessment