What evals give you
- reusable eval definitions tied to a rubric
- judge-model scoring for real outputs
- repeatable comparisons across candidate models or checkpoints
- a clean feedback loop into dataset revision and training
Where the inputs come from
Most teams start from data created via traffic capture:- live traffic captured through the proxy
- historical uploads imported as JSONL
- saved eval datasets built from filtered requests
Recommended workflow
- Build or import a representative eval dataset.
- Define the rubric and scoring range.
- Run the eval against your baseline model.
- Compare the baseline to a candidate model or trained checkpoint.
- Use low-scoring examples to improve the next dataset or training run.
When to run evals
- Before training to establish a baseline
- During training to compare new checkpoints
- Before deployment to confirm the model is ready for rollout
Next steps
Datasets
Start with representative traffic or historical uploads.
Fine-tuning
Use eval failures to drive fine-tuning or distillation.
E2E Fine-tuning Guide
Go from eval failures to a completed training run.
Talk to an engineer
Meet with us if you want help designing the rubric or evaluation strategy.