How LLM-as-a-Judge Works

How it works
Choosing a judge model
Building the judge context
Cost
What the judge doesn’t do

LLM-as-a-judge is the evaluation mechanism in Catalyst. An LLM reads your rubric, looks at the output being evaluated, and returns a numerical score. These scores are then aggregated into a final evaluation result.

How it works

The judge model receives the rubric, the conversation context, and the output to judge
It evaluates the output against the rubric criteria
It returns a judgment with a numerical score within the rubric’s defined range

Choosing a judge model

Use the smartest available model as your judge. You want the judge to be more capable than the models being evaluated, since a weaker judge may not reliably distinguish quality differences. You select the judge model when running an eval.

Building the judge context

The judge is only given the context that you define in your rubric. Make sure to build the proper rubric and verify the sample contents during your first evaluation run. There is a defined way to create the variables that will be filled in during an execution of an eval.

Cost

Judge calls are full LLM inferences, so they have real cost. For offline evals with a bounded dataset, this is usually manageable. For online evals at scale (coming soon), sample rate controls will help manage cost.

What the judge doesn’t do

The judge scores against your rubric. It doesn’t independently decide what “good” means. If the rubric is vague or measures the wrong thing, the judge will faithfully score against bad criteria. This is why validating your rubric before using it in training is important.

Eval

Offline vs Online Evaluation

⌘I

Get Started

Observe

Datasets

Eval

Train

Deploy

Platform

How it works

Choosing a judge model

Building the judge context

Cost

What the judge doesn’t do

Get Started

Observe

Datasets

Eval

Train

Deploy

Platform

Documentation Index

​How it works

​Choosing a judge model

​Building the judge context

​Cost

​What the judge doesn’t do

How it works

Choosing a judge model

Building the judge context

Cost

What the judge doesn’t do