Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.inference.net/llms.txt

Use this file to discover all available pages before exploring further.

LLM-as-a-judge is the evaluation mechanism in Catalyst. An LLM reads your rubric, looks at the output being evaluated, and returns a numerical score. These scores are then aggregated into a final evaluation result.

How it works

  1. The judge model receives the rubric, the conversation context, and the output to judge
  2. It evaluates the output against the rubric criteria
  3. It returns a judgment with a numerical score within the rubric’s defined range

Choosing a judge model

Use the smartest available model as your judge. You want the judge to be more capable than the models being evaluated, since a weaker judge may not reliably distinguish quality differences. You select the judge model when running an eval.

Building the judge context

The judge is only given the context that you define in your rubric. Make sure to build the proper rubric and verify the sample contents during your first evaluation run. There is a defined way to create the variables that will be filled in during an execution of an eval.

Cost

Judge calls are full LLM inferences, so they have real cost. For offline evals with a bounded dataset, this is usually manageable. For online evals at scale (coming soon), sample rate controls will help manage cost.

What the judge doesn’t do

The judge scores against your rubric. It doesn’t independently decide what “good” means. If the rubric is vague or measures the wrong thing, the judge will faithfully score against bad criteria. This is why validating your rubric before using it in training is important.