A golden set that started as five hand-curated pairs in Foundation has to scale to something more like 100–500 examples to give you real signal in production. That requires structure beyond "a folder full of JSON files."
Three buckets
- Coverage — one example per category your system should handle. Refreshed when you add new categories.
- Edge cases — examples that exercise specific failure modes you've already hit in production. Each one is a regression test.
- Live traffic samples — periodically sampled real inputs, with the team labeling the right answer. Keeps the set honest about what your system actually sees.
Auto-grading vs. human review
Auto-grade what you can: exact match on category, JSON-schema validity, length range. Use an "LLM-as-judge" for soft quality (helpfulness, tone) — but only after you've calibrated the judge against human labels.
Knowledge check
0/1 answered1. Which of these is the riskiest signal from an eval suite?
Discussion
0 commentsBe the first to start the conversation.