All work
012025SCAD

Enterprise RAG Document Intelligence System

How we replaced 2 hours of analyst time with 10 seconds of GPT-4 — for 100,000+ government documents.

Architecture

rag.pipeline

● live
CORPUSINDEXRETRIEVEGENERATE · CITE100K+ DOCSdoc_001.pdfdoc_002.pdfdoc_003.pdfdoc_004.pdfdoc_005.pdfdoc_006.pdfAR + ENpolicy · stats36 chunks liveChunksemantic · 300 tokEmbedgemini · 3072dPINECONEserverless · freeUSER QUERY"How did you cutcosts by 65%?"Hybrid RetrieverBM25 + vector · RRFRe-rankcohere · cross-encTOP-5 CHUNKS[1] case-rag-decisions 0.92[2] stack-cost 0.87[3] case-rag-timeline 0.81Groq · Llama 3.3 70Bstreaming · <800msANSWER + CITATIONSWe compressed context [1],filtered chunks via rerank[2], and used few-shot [3]to cut tokens 65%.↻ eval harness · 200-question gold set runs on every change92% ACCURACY · 65% COST CUT · 5K+ QUERIES/MO · <2s END-TO-END

Before

Analysts spent 2-3 hours per query digging through SharePoint folders, PDFs, and legacy reports. Knowledge that existed in the organisation was effectively invisible.

After

Every analyst now gets cited answers in under 10 seconds. The system handles 5,000+ queries a month at 92% accuracy, has been running 24/7 for over a year, and pays for itself many times over each week.

Challenge

100K+ government documents scattered across legacy systems with no unified search.

Approach

Architected end-to-end RAG pipeline using Azure OpenAI and Cognitive Search with hybrid retrieval and re-ranking.

How it was built

  1. 1

    Discovery

    Weeks 1–2

    Interviewed 12 analysts across 4 departments. Mapped how they actually search — turns out 60% of queries were semantic ("what's our methodology for X") not keyword. This single insight killed the SharePoint-search-better plan.

  2. 2

    Prototype

    Weeks 3–6

    Three prototypes, three failures. v1 used 1024-token chunks (vague answers). v2 used pure vector search (missed exact terms). v3 finally combined hybrid retrieval + re-ranking and crossed the 85% accuracy threshold needed to ship.

  3. 3

    Evaluation harness

    Weeks 7–8

    Built a 200-question gold-standard test set with domain experts. Every code change now runs the eval before merging. This slowed development for 2 weeks then accelerated everything for the next 12 months.

  4. 4

    Production hardening

    Weeks 9–12

    Citation post-processing, Arabic support, document permissions, rate limiting, observability dashboards. The unsexy 80% that separates demo from product.

  5. 5

    Launch & iterate

    Month 4 — Present

    Soft launch to 20 analysts, then 200, then org-wide. Weekly review of failure cases. Cost dropped 65% over 6 months through prompt + context optimisation.

Key architecture decisions

Hybrid retrieval (BM25 + Vector) over pure semantic

Why · Pure vector search missed exact terms (numbers, acronyms, proper nouns) that analysts cared about. RRF fusion gave us +14 points on NDCG@5.

Semantic chunking over fixed-token chunking

Why · Splitting on section boundaries instead of token counts improved accuracy by ~20% before we even touched the model.

Cross-encoder re-ranking

Why · Bi-encoder similarity is fast but imprecise. Reranking top-50 candidates with Cohere rerank-v3 reduced GPT-4 context window costs by 65% while improving precision.

Citation-by-default in the prompt

Why · Users don't trust answers they can't verify. Structured [SOURCE:doc_id,page_n] tagging turned the system from "helpful" to "trustworthy."

Impact

  • Reduced information retrieval time from 2–3 hours to under 10 seconds
  • Achieved 92% accuracy on complex multi-document queries
  • Processing 5K+ queries monthly with 87% user satisfaction
  • Cut document research costs by 65% through automation
10s
time
92%
accuracy
-65%
cost
5K+/mo
queries

What I'd tell someone building this

  • 01 · Evaluation infrastructure pays for itself within a month. Build it first, not last.
  • 02 · Users prefer accurate uncertainty over confident hallucination. Teach the model to say "I don't know."
  • 03 · Chunking strategy and prompt design move the needle 10× more than picking the latest model.
  • 04 · Government Arabic-English content needs first-class language handling — not an afterthought.
Mazhar's RAG system gave us a year of analyst productivity back in three months. The numbers speak for themselves — and the architecture is clean enough that we extended it to two more departments without his help.
Senior Director, Digital Transformation · SCAD

Tech stack

GPT-4Azure Cognitive SearchLangChainPineconeAzure Functions.NET CoreAngular

Ask anything about Enterprise RAG Document Intelligence System

AI scoped to this project · Llama 3.3 70B