Evaluate Overview

What Evaluate gives you
Where the inputs come from
Recommended workflow
When to run evals
Next steps

Evaluate turns datasets and rubrics into repeatable model comparisons. It is the quality loop that keeps platform workflows honest: before training, during training, and before deployment.

What Evaluate gives you

reusable eval definitions tied to a rubric
judge-model scoring for real outputs
repeatable comparisons across candidate models or checkpoints
a clean feedback loop into dataset revision and training

Where the inputs come from

Most teams start from data created in Observe:

live traffic captured through the proxy
historical uploads imported as JSONL
saved eval datasets built from filtered requests

That is why Observe usually comes first in the lifecycle.

Recommended workflow

Build or import a representative eval dataset.
Define the rubric and scoring range.
Run the eval against your baseline model.
Compare the baseline to a candidate model or trained checkpoint.
Use low-scoring examples to improve the next dataset or training run.

When to run evals

Before training to establish a baseline
During training to compare new checkpoints
Before deployment to confirm the model is ready for rollout

Next steps

Create your first eval

Define the rubric, scoring range, judge model, and default candidate models.

Run and compare

Launch run groups and compare multiple candidate models against the same dataset.

Build datasets

Start with representative traffic from Observe.

Train a better model

Use eval failures to drive fine-tuning or distillation.

View the platform workflow

See how evals fit into the broader lifecycle.

Talk to an engineer

Meet with us if you want help designing the rubric or evaluation strategy.

⌘I

Start Here

Workhorse Models

Guides

Reference

Tutorials

What Evaluate gives you

Where the inputs come from

Recommended workflow

When to run evals

Next steps

Create your first eval

Run and compare

Build datasets

Train a better model

View the platform workflow

Talk to an engineer

Start Here

Workhorse Models

Guides

Reference

Tutorials

​What Evaluate gives you

​Where the inputs come from

​Recommended workflow

​When to run evals

​Next steps

Create your first eval

Run and compare

Build datasets

Train a better model

View the platform workflow

Talk to an engineer

What Evaluate gives you

Where the inputs come from

Recommended workflow

When to run evals

Next steps