What you’ll have when you finish
- one eval definition
- one repeatable rubric
- one run group comparing candidate models on the same dataset
Before you start
- create an eval dataset with /guides/create-datasets-from-observed-traffic
- identify the current production model you want to compare against
Step 1: define the rubric around the task outcome
Do not start with a vague quality prompt like “is this good?” Start with the actual product outcome you need:- is the response correct?
- is it complete enough for the workflow?
- is it structured correctly?
- does it stay within the task constraints?
Step 2: quick-test the rubric on one example
Before launching a real run, preview the rubric on one conversation. This is where you find out whether:- the instructions are too vague
- the score range is too coarse or too broad
- the judge model is rewarding the wrong thing
Step 3: run the baseline comparison
Launch a run group with:- the eval dataset you created from observed traffic
- your current baseline model
- one or more candidate models
- a fixed judge model
Step 4: inspect both the summary and the rows
At the run-group level, compare:- average score
- score distribution
- completed vs failed rows
- original request
- model output
- judge score
- judge reasoning
Step 5: decide what the result means
If the candidate wins clearly:- rerun on a broader or larger sample
- decide whether the model is ready for rollout or training promotion
- fix the rubric
- improve the dataset
- or move into training
Verify it worked
You should now have:- one named eval definition
- one rubric you trust enough to rerun
- one run group you can use as a baseline for future changes
What to do next
Turn Eval Failures into a Training Run
Use failure patterns and low-scoring rows to guide model improvement.
Promote a Trained Model to Deployment
Use the baseline again before moving a trained model into production.