Skip to main content
Evaluate turns datasets and rubrics into repeatable model comparisons. It is the quality loop that keeps platform workflows honest: before training, during training, and before deployment.

What Evaluate gives you

  • reusable eval definitions tied to a rubric
  • judge-model scoring for real outputs
  • repeatable comparisons across candidate models or checkpoints
  • a clean feedback loop into dataset revision and training

Where the inputs come from

Most teams start from data created in Observe:
  • live traffic captured through the proxy
  • historical uploads imported as JSONL
  • saved eval datasets built from filtered requests
That is why Observe usually comes first in the lifecycle.
  1. Build or import a representative eval dataset.
  2. Define the rubric and scoring range.
  3. Run the eval against your baseline model.
  4. Compare the baseline to a candidate model or trained checkpoint.
  5. Use low-scoring examples to improve the next dataset or training run.

When to run evals

  • Before training to establish a baseline
  • During training to compare new checkpoints
  • Before deployment to confirm the model is ready for rollout

Next steps

Create your first eval

Define the rubric, scoring range, judge model, and default candidate models.

Run and compare

Launch run groups and compare multiple candidate models against the same dataset.

Build datasets

Start with representative traffic from Observe.

Train a better model

Use eval failures to drive fine-tuning or distillation.

View the platform workflow

See how evals fit into the broader lifecycle.

Talk to an engineer

Meet with us if you want help designing the rubric or evaluation strategy.