Skip to main content
An eval measures which model is better for your task, and by how much. You define a rubric that describes what “good” looks like, run your data through candidate models, and let an LLM judge score the outputs. This is how you know whether a smaller, cheaper model can replace the one you’re using today.

What you need

  • A dataset, either captured from live traffic via Observe or uploaded as a JSONL file
  • A rubric, a plain English description of what “good” looks like for your use case
  • Two or more models to compare

Step by step

1

Create an eval dataset

If you’ve already captured traffic, go to the Inference Viewer, filter to the requests that represent your task, and save them as an eval dataset. If you’re starting fresh, upload a JSONL file from the Datasets page.
2

Create a rubric

Go to the Evals page and create a new rubric. You have three options:
  • Generate from your dataset — let AI analyze your data and produce a rubric automatically
  • Start from a template — pick from pre-built rubrics for common quality dimensions
  • Write your own — describe what “good” looks like in plain English and set the scoring range
See Write a Rubric for details on the template language.
3

Run the eval

Select your rubric, your eval dataset, and the models you want to compare. You can choose from a wide range of models — OpenAI, Anthropic, open-source, or your own custom trained models if you have one. Click run. Each sample from the dataset runs through each model, and an LLM judge scores every output against your rubric.
4

Compare the results

The comparison view shows side-by-side scores across all models and samples. Use it to decide which model wins, or whether you need to iterate on the rubric.

📍 TODO:MEDIA

Screenshot or animation of the eval comparison view showing side-by-side model scores across samples.

Next steps

Write a rubric

Deep dive on rubric design, template variables, and scoring ranges.

Read the results

How to interpret the comparison view and decide which model wins.

Generate a rubric

Let AI create a rubric from your dataset or start from a template.

Train a custom model

No model good enough? Fine-tune one that is.