Run a Model Comparison

Step by step
How the math works
Next steps

Pick a rubric, pick a dataset, pick models, and run. Catalyst handles execution and scoring.

Step by step

Select a rubric

Choose the rubric that defines your quality criteria.

Select an eval dataset

Choose the dataset containing your evaluation samples. This can come from captured traffic or a JSONL upload.

Select models

Pick one or more models to evaluate. You can choose from a wide range of models including OpenAI, Anthropic, open-source, or your own custom trained models.

Run the eval

Each sample from the dataset runs through each selected model. Each output gets scored by the LLM judge using your rubric.

Eval setup flow showing rubric, dataset, and model selection in the dashboard.

How the math works

The eval is a cross-product of samples and models:

10 samples across 3 models = 30 inference outputs
Each output gets scored = 30 judge calls
Results: per-sample scores for every model

Next steps

Once the eval completes, go to Read the Results to interpret the comparison view.

Writing Rubrics

Read the Results

⌘I

Get Started

Observe

Datasets

Eval

Train

Deploy

Platform

Step by step

How the math works

Next steps

Get Started

Observe

Datasets

Eval

Train

Deploy

Platform

Documentation Index

​Step by step

​How the math works

​Next steps

Step by step

How the math works

Next steps