All work
192025SCAD
Prompt & RAG Evaluation Harness
Every prompt or retrieval tweak was being shipped on vibes — there was no way to measure regressions on the RAG and NL-to-SQL systems.
Open-source-flavoured eval harness with ~400 graded queries, judge-LLM scoring, and a CI step that blocks regressions in retrieval recall and answer quality.
- Caught 6 regressions before they hit production
- Made prompt iteration data-driven instead of intuition-driven
- Adopted as the gate on every RAG/NL-to-SQL change
2
suites
400+
queries
6 caught
regressions
GPT-4 (judge)PythonGitHub ActionsDVC
Ask anything about Prompt & RAG Evaluation Harness
AI scoped to this project · Llama 3.3 70B