Read the Results

What the comparison shows
How to read the results
Making decisions

After an eval completes, the comparison view shows how each model performed across every sample and rubric dimension.

What the comparison shows

Side-by-side plots highlighting where models differ in quality
Full scores table across all models and samples
Per-sample breakdown so you can see where specific models excel or struggle

Eval comparison view showing side-by-side model scores and the scores table.

How to read the results

Look for:

Overall winner - which model has the highest average score across your rubric
Edge cases - samples where one model significantly outperforms another
Rubric dimensions - if you have multiple rubrics, check whether models trade off on different quality dimensions (e.g. one model is more accurate but another has better tone)

While the aggregate scores can inform you on how different models perform in a nutshell, it is recommended to analyze individual samples in the sample viewer. This will help to understand specific model quirks.

Making decisions

Which model to use in production - the one that best matches your quality criteria
Whether to train a custom model - if no off-the-shelf model scores well enough, fine-tuning is the next step
Whether the rubric needs work - if scores don’t align with your intuition, iterate on the rubric before changing models

Run a Model Comparison

Train

⌘I

Get Started

Observe

Datasets

Eval

Train

Deploy

Platform

What the comparison shows

How to read the results

Making decisions

Get Started

Observe

Datasets

Eval

Train

Deploy

Platform

Documentation Index

​What the comparison shows

​How to read the results

​Making decisions

What the comparison shows

How to read the results

Making decisions