Offline evaluation
Offline eval is the standard flow: run a dataset through models, score the outputs, compare. You control what gets evaluated and when.
This is what’s available today. The inputs typically come from captured production traffic, but the outputs are re-generated. You select which models to run against rather than judging the original production outputs.
Online evaluation
Online evaluation is coming soon.
Online eval will score live production traffic as it flows through Catalyst. Instead of re-running samples through models, it judges the actual outputs your users are seeing.
How it differs from offline:
- Sample rate controls - evaluate a percentage of traffic to manage cost
- Real outputs - judge what your model actually produced, not a re-run
- Continuous - always running, not triggered manually
Current limitation
Today, you always select a model to generate new outputs for evaluation. You can’t run a rubric directly against captured production outputs. This means offline evals answer “how would model X perform on this data?” rather than “how did my production model actually perform?”
Near-term, the platform will support judging captured production outputs directly against rubrics, bridging toward full online evaluation.