Skip to main content
Model providers ship updates constantly and prompts drift. Eval gives you a repeatable way to measure model quality before and after every change. If you’re planning to fine-tune a custom model, run evals first. A validated rubric and eval dataset are prerequisites for training. They’re the measuring stick that determines when the model has learned enough, or when to stop to prevent overfitting.

How it works

  1. Define a rubric - describe what “good” looks like in plain English
  2. Pick a dataset - samples from captured traffic or uploaded JSONL
  3. Select models - the candidates you want to compare
  4. Run the eval - each sample goes through each model, and an LLM judge scores every output
  5. Compare results - side-by-side scores show which model wins

📍 TODO:MEDIA

Screenshot of the eval page showing the setup flow: rubric, dataset, model selection.

Key concepts

ConceptDescription
LLM-as-a-judgeA capable LLM reads your rubric, examines the model output, and returns a scored judgment.
RubricA plain English description of a quality dimension, scored numerically. Defines what “good” means for your use case.
Eval datasetA stable, curated set of challenging examples that acts as your benchmark. Pick the hard cases.
Offline vs onlineOffline evals run against collected samples. Online evals score live traffic as it flows through. Offline is available today; online is coming soon.
Train-eval splitsTraining and eval data must never overlap. If a model trains on eval examples, the eval becomes meaningless. See the zero-overlap rule.

Next steps

Writing rubrics

Create rubrics from templates, AI generation, or plain English.

Run a model comparison

Compare models head to head on your data.

How LLM-as-a-Judge works

Understand the evaluation mechanism.

Read the results

Interpret scores and make decisions.