Skip to main content
LLM-as-a-judge is the evaluation mechanism in Catalyst. An LLM reads your rubric, looks at the output being evaluated, and returns a numerical score.

How it works

  1. The judge model receives the rubric, the conversation context, and the output to judge
  2. It evaluates the output against the rubric criteria
  3. It returns a judgment with a numerical score within the rubric’s defined range

Choosing a judge model

Use the smartest available model as your judge. You want the judge to be more capable than the models being evaluated, since a weaker judge may not reliably distinguish quality differences. You select the judge model when running an eval.

Cost

Judge calls are full LLM inferences, so they have real cost. For offline evals with a bounded dataset, this is usually manageable. For online evals at scale (coming soon), sample rate controls will help manage cost.

What the judge doesn’t do

The judge scores against your rubric. It doesn’t independently decide what “good” means. If the rubric is vague or measures the wrong thing, the judge will faithfully score against bad criteria. This is why validating your rubric before using it in training is important.