How it works
- Define a rubric - describe what “good” looks like in plain English
- Pick a dataset - samples from captured traffic or uploaded JSONL
- Select models - the candidates you want to compare
- Run the eval - each sample goes through each model, and an LLM judge scores every output
- Compare results - side-by-side scores show which model wins
📍 TODO:MEDIA
Screenshot of the eval page showing the setup flow: rubric, dataset, model selection.
Key concepts
| Concept | Description |
|---|---|
| LLM-as-a-judge | A capable LLM reads your rubric, examines the model output, and returns a scored judgment. |
| Rubric | A plain English description of a quality dimension, scored numerically. Defines what “good” means for your use case. |
| Eval dataset | A stable, curated set of challenging examples that acts as your benchmark. Pick the hard cases. |
| Offline vs online | Offline evals run against collected samples. Online evals score live traffic as it flows through. Offline is available today; online is coming soon. |
| Train-eval splits | Training and eval data must never overlap. If a model trains on eval examples, the eval becomes meaningless. See the zero-overlap rule. |
Next steps
Writing rubrics
Create rubrics from templates, AI generation, or plain English.
Run a model comparison
Compare models head to head on your data.
How LLM-as-a-Judge works
Understand the evaluation mechanism.
Read the results
Interpret scores and make decisions.