All work
192025SCAD

Prompt & RAG Evaluation Harness

Challenge

Every prompt or retrieval tweak was being shipped on vibes — there was no way to measure regressions on the RAG and NL-to-SQL systems.

Approach

Open-source-flavoured eval harness with ~400 graded queries, judge-LLM scoring, and a CI step that blocks regressions in retrieval recall and answer quality.

Impact

  • Caught 6 regressions before they hit production
  • Made prompt iteration data-driven instead of intuition-driven
  • Adopted as the gate on every RAG/NL-to-SQL change
2
suites
400+
queries
6 caught
regressions

Tech stack

GPT-4 (judge)PythonGitHub ActionsDVC

Ask anything about Prompt & RAG Evaluation Harness

AI scoped to this project · Llama 3.3 70B