Skip to main content
Evals turn datasets and rubrics into repeatable model comparisons. They are the quality loop that keeps platform workflows honest: before training, during training, and before deployment.

What evals give you

  • reusable eval definitions tied to a rubric
  • judge-model scoring for real outputs
  • repeatable comparisons across candidate models or checkpoints
  • a clean feedback loop into dataset revision and training

Where the inputs come from

Most teams start from data created via traffic capture:
  • live traffic captured through the proxy
  • historical uploads imported as JSONL
  • saved eval datasets built from filtered requests
That is why capturing traffic usually comes first in the lifecycle.
  1. Build or import a representative eval dataset.
  2. Define the rubric and scoring range.
  3. Run the eval against your baseline model.
  4. Compare the baseline to a candidate model or trained checkpoint.
  5. Use low-scoring examples to improve the next dataset or training run.

When to run evals

  • Before training to establish a baseline
  • During training to compare new checkpoints
  • Before deployment to confirm the model is ready for rollout

Next steps

Datasets

Start with representative traffic or historical uploads.

Fine-tuning

Use eval failures to drive fine-tuning or distillation.

E2E Fine-tuning Guide

Go from eval failures to a completed training run.

Talk to an engineer

Meet with us if you want help designing the rubric or evaluation strategy.