What you need
- A dataset, either captured from live traffic via Observe or uploaded as a JSONL file
- A rubric, a plain English description of what “good” looks like for your use case
- Two or more models to compare
Step by step
Create an eval dataset
If you’ve already captured traffic, go to the Inference Viewer, filter to the requests that represent your task, and save them as an eval dataset. If you’re starting fresh, upload a JSONL file from the Datasets page.
Create a rubric
Go to the Evals page and create a new rubric. You have three options:
- Generate from your dataset — let AI analyze your data and produce a rubric automatically
- Start from a template — pick from pre-built rubrics for common quality dimensions
- Write your own — describe what “good” looks like in plain English and set the scoring range
Run the eval
Select your rubric, your eval dataset, and the models you want to compare. You can choose from a wide range of models — OpenAI, Anthropic, open-source, or your own custom trained models if you have one. Click run. Each sample from the dataset runs through each model, and an LLM judge scores every output against your rubric.
Compare the results
The comparison view shows side-by-side scores across all models and samples. Use it to decide which model wins, or whether you need to iterate on the rubric.
📍 TODO:MEDIA
Screenshot or animation of the eval comparison view showing side-by-side model scores across samples.
Next steps
Write a rubric
Deep dive on rubric design, template variables, and scoring ranges.
Read the results
How to interpret the comparison view and decide which model wins.
Generate a rubric
Let AI create a rubric from your dataset or start from a template.
Train a custom model
No model good enough? Fine-tune one that is.