Skip to main content
Use this guide when you want a release gate that is based on your own workflow rather than abstract benchmark scores.

What you’ll have when you finish

  • one eval definition
  • one repeatable rubric
  • one run group comparing candidate models on the same dataset

Before you start

Step 1: define the rubric around the task outcome

Do not start with a vague quality prompt like “is this good?” Start with the actual product outcome you need:
  • is the response correct?
  • is it complete enough for the workflow?
  • is it structured correctly?
  • does it stay within the task constraints?
Use the built-in templates as a starting point, then tailor the rubric to your workflow.

Step 2: quick-test the rubric on one example

Before launching a real run, preview the rubric on one conversation. This is where you find out whether:
  • the instructions are too vague
  • the score range is too coarse or too broad
  • the judge model is rewarding the wrong thing

Step 3: run the baseline comparison

Launch a run group with:
  • the eval dataset you created from observed traffic
  • your current baseline model
  • one or more candidate models
  • a fixed judge model
Keep the judge model and rubric version stable while comparing candidate models.

Step 4: inspect both the summary and the rows

At the run-group level, compare:
  • average score
  • score distribution
  • completed vs failed rows
Then inspect row-level failures:
  • original request
  • model output
  • judge score
  • judge reasoning
This is where you learn whether the candidate is actually better or just better on a narrow slice.

Step 5: decide what the result means

If the candidate wins clearly:
  • rerun on a broader or larger sample
  • decide whether the model is ready for rollout or training promotion
If all candidates are weak:
  • fix the rubric
  • improve the dataset
  • or move into training

Verify it worked

You should now have:
  • one named eval definition
  • one rubric you trust enough to rerun
  • one run group you can use as a baseline for future changes

What to do next

Turn Eval Failures into a Training Run

Use failure patterns and low-scoring rows to guide model improvement.

Promote a Trained Model to Deployment

Use the baseline again before moving a trained model into production.