Offline vs Online Evaluation

Offline evaluation

Offline eval is the standard flow: run a dataset through models, score the outputs, compare. You control what gets evaluated and when.

This is what’s available today. The inputs typically come from captured production traffic, but the outputs are re-generated. You select which models to run against rather than judging the original production outputs.

Online evaluation

Online evaluation is coming soon.

Online eval will score live production traffic as it flows through Catalyst. Instead of re-running samples through models, it judges the actual outputs your users are seeing.

How it differs from offline:

Sample rate controls - evaluate a percentage of traffic to manage cost

Real outputs - judge what your model actually produced, not a re-run

Continuous - always running, not triggered manually

Current limitation

Today, you always select a model to generate new outputs for evaluation. You can’t run a rubric directly against captured production outputs. This means offline evals answer “how would model X perform on this data?” rather than “how did my production model actually perform?”

Near-term, the platform will support judging captured production outputs directly against rubrics, bridging toward full online evaluation.

Get Started

Observe

Datasets

Eval

Train

Deploy

Platform

Offline evaluation

Online evaluation

Current limitation

Get Started

Observe

Datasets

Eval

Train

Deploy

Platform

Documentation Index

​Offline evaluation

​Online evaluation

​Current limitation

Offline evaluation

Online evaluation

Current limitation