Build a Real-Traffic Eval Baseline

Use this guide when you want a release gate that is based on your own workflow rather than abstract benchmark scores.

What you’ll have when you finish

one eval definition
one repeatable rubric
one run group comparing candidate models on the same dataset

Before you start

create an eval dataset with /guides/create-datasets-from-observed-traffic
identify the current production model you want to compare against

Step 1: define the rubric around the task outcome

Do not start with a vague quality prompt like “is this good?” Start with the actual product outcome you need:

is the response correct?
is it complete enough for the workflow?
is it structured correctly?
does it stay within the task constraints?

Use the built-in templates as a starting point, then tailor the rubric to your workflow.

Step 2: quick-test the rubric on one example

Before launching a real run, preview the rubric on one conversation. This is where you find out whether:

the instructions are too vague
the score range is too coarse or too broad
the judge model is rewarding the wrong thing

Step 3: run the baseline comparison

Launch a run group with:

the eval dataset you created from observed traffic
your current baseline model
one or more candidate models
a fixed judge model

Keep the judge model and rubric version stable while comparing candidate models.

Step 4: inspect both the summary and the rows

At the run-group level, compare:

average score
score distribution
completed vs failed rows

Then inspect row-level failures:

original request
model output
judge score
judge reasoning

This is where you learn whether the candidate is actually better or just better on a narrow slice.

Step 5: decide what the result means

If the candidate wins clearly:

rerun on a broader or larger sample
decide whether the model is ready for rollout or training promotion

If all candidates are weak:

fix the rubric
improve the dataset
or move into training

Verify it worked

You should now have:

one named eval definition
one rubric you trust enough to rerun
one run group you can use as a baseline for future changes

What to do next

Turn Eval Failures into a Training Run

Use failure patterns and low-scoring rows to guide model improvement.

Promote a Trained Model to Deployment

Use the baseline again before moving a trained model into production.

Start Here

Guides

Reference

Tutorials

What you’ll have when you finish

Before you start

Step 1: define the rubric around the task outcome

Step 2: quick-test the rubric on one example

Step 3: run the baseline comparison

Step 4: inspect both the summary and the rows

Step 5: decide what the result means

Verify it worked

What to do next

Turn Eval Failures into a Training Run

Promote a Trained Model to Deployment

Start Here

Guides

Reference

Tutorials

​What you’ll have when you finish

​Before you start

​Step 1: define the rubric around the task outcome

​Step 2: quick-test the rubric on one example

​Step 3: run the baseline comparison

​Step 4: inspect both the summary and the rows

​Step 5: decide what the result means

​Verify it worked

​What to do next

Turn Eval Failures into a Training Run

Promote a Trained Model to Deployment

What you’ll have when you finish

Before you start

Step 1: define the rubric around the task outcome

Step 2: quick-test the rubric on one example

Step 3: run the baseline comparison

Step 4: inspect both the summary and the rows

Step 5: decide what the result means

Verify it worked

What to do next