What Evaluate gives you
- reusable eval definitions tied to a rubric
- judge-model scoring for real outputs
- repeatable comparisons across candidate models or checkpoints
- a clean feedback loop into dataset revision and training
Where the inputs come from
Most teams start from data created in Observe:- live traffic captured through the proxy
- historical uploads imported as JSONL
- saved eval datasets built from filtered requests
Recommended workflow
- Build or import a representative eval dataset.
- Define the rubric and scoring range.
- Run the eval against your baseline model.
- Compare the baseline to a candidate model or trained checkpoint.
- Use low-scoring examples to improve the next dataset or training run.
When to run evals
- Before training to establish a baseline
- During training to compare new checkpoints
- Before deployment to confirm the model is ready for rollout
Next steps
Create your first eval
Define the rubric, scoring range, judge model, and default candidate models.
Run and compare
Launch run groups and compare multiple candidate models against the same dataset.
Build datasets
Start with representative traffic from Observe.
Train a better model
Use eval failures to drive fine-tuning or distillation.
View the platform workflow
See how evals fit into the broader lifecycle.
Talk to an engineer
Meet with us if you want help designing the rubric or evaluation strategy.