> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inference.net/llms.txt
> Use this file to discover all available pages before exploring further.

# Eval

> Measure model quality with rubrics scored by LLM judges. Know which model is better and by how much.

Model providers ship updates constantly and prompts drift. Eval gives you a repeatable way to measure model quality before and after every change. They can also help compare model options for a given task.

If you're planning to [fine-tune a custom model](/platform/train/overview), run evals first. A validated rubric and eval dataset are prerequisites for training. They're the measuring stick that determines when the model has learned enough, or when to stop to prevent overfitting.

<Frame>
  <iframe style={{ width: "100%", aspectRatio: "16 / 9", border: 0, display: "block" }} src="https://www.youtube.com/embed/PTfmcGlbwN0?list=PLJzp7SN2tfJsRAU9VGSfSo60CyDJzqhLP&rel=0" title="Evaluate LLMs with LLM-as-a-Judge | Catalyst" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowFullScreen />
</Frame>

## How it works

1. **Define a [rubric](/platform/eval/write-a-rubric)** - describe what "good" looks like in plain English
2. **Pick a dataset** - samples from [captured traffic](/platform/datasets/build-from-traffic) or [uploaded JSONL](/platform/datasets/upload-a-dataset)
3. **Select models** - the candidates you want to compare
4. **Run the eval** - each sample goes through each model, and an [LLM judge](/platform/eval/llm-as-a-judge) scores every output
5. **[Compare results](/platform/eval/read-the-results)** - side-by-side scores show which model wins

## Key concepts

| Concept               | Description                                                                                                                                                                                  |
| --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **LLM-as-a-judge**    | A capable LLM reads your rubric, examines the model output, and returns a scored judgment.                                                                                                   |
| **Rubric**            | A plain English description of a quality dimension, scored numerically. Defines what "good" means for your use case.                                                                         |
| **Direct rubric**     | The LLM judge grades the model output directly against the rubric, without comparing it to a reference answer.                                                                               |
| **Adherence rubric**  | The LLM judge grades the model output based on how closely it matches a reference response.                                                                                                  |
| **Eval dataset**      | A stable, curated set of challenging examples that acts as your benchmark. Pick the hard cases.                                                                                              |
| **Offline vs online** | Offline evals run against collected samples. Online evals score live traffic as it flows through. Offline is available today; online is coming soon.                                         |
| **Train-eval splits** | Training and eval data must never overlap. If a model trains on eval examples, the eval becomes meaningless. See the [zero-overlap rule](/platform/datasets/overview#the-zero-overlap-rule). |

## Next steps

<CardGroup cols={2}>
  <Card title="Writing rubrics" icon="pen" href="/platform/eval/write-a-rubric">
    Create rubrics from templates, AI generation, or plain English.
  </Card>

  <Card title="Run a model comparison" icon="scale-balanced" href="/platform/eval/run-a-comparison">
    Compare models head to head on your data.
  </Card>

  <Card title="How LLM-as-a-Judge works" icon="gavel" href="/platform/eval/llm-as-a-judge">
    Understand the evaluation mechanism.
  </Card>

  <Card title="Read the results" icon="chart-column" href="/platform/eval/read-the-results">
    Interpret scores and make decisions.
  </Card>
</CardGroup>
