192025SCAD

Prompt & RAG Evaluation Harness

Challenge

Every prompt or retrieval tweak was being shipped on vibes — there was no way to measure regressions on the RAG and NL-to-SQL systems.

Approach

Open-source-flavoured eval harness with ~400 graded queries, judge-LLM scoring, and a CI step that blocks regressions in retrieval recall and answer quality.

Impact

Caught 6 regressions before they hit production
Made prompt iteration data-driven instead of intuition-driven
Adopted as the gate on every RAG/NL-to-SQL change

suites

400+

queries

6 caught

regressions

Tech stack

GPT-4 (judge)PythonGitHub ActionsDVC

Ask anything about Prompt & RAG Evaluation Harness

AI scoped to this project · Llama 3.3 70B