Catalyst Workflow

Catalyst is a platform for building and deploying task-specific AI models. Instead of relying on large general-purpose models for every task, Catalyst helps you collect production data, evaluate model quality, fine-tune smaller models optimized for your workload, and deploy them on dedicated infrastructure. The platform also provides access to open-source and Inference.net-trained models (like Schematron for structured data extraction) through an OpenAI-compatible API. Not every team goes through every stage. Many start with observability and evals alone. The platform is useful at every step — use only the parts you need, and add more as your requirements grow.

Observe

Record and analyze your production LLM traffic. Catalyst Gateway sits between your app and your LLM provider, capturing every request, response, cost, and latency metric with less than 10ms of overhead. Keep using any provider or model — Gateway is transparent. What you’ll do:

Integrate with your LLM provider to start capturing traffic
Define tasks to group LLM calls by objective (e.g. “summarize docs”, “classify tickets”)
Explore metrics for cost, latency, errors, and token usage across all your calls
Browse individual inferences to inspect raw requests and responses

Outcome: Full visibility into how your AI features perform in production — broken down by model, task, and provider.

Get started with Observe

Set up Gateway and start capturing LLM traffic.

Datasets

Curate collections of LLM inputs and outputs for evaluation and training. Datasets can come from your live production traffic captured through Observe, or from files you upload directly. What you’ll do:

Build datasets from traffic by filtering captured inferences and saving them
Upload your own data as JSONL files when you have curated examples or are migrating from another platform
Understand dataset formats and the schema your data needs to follow

Outcome: Clean, representative datasets scoped to specific tasks — ready to power evals and training.

Get started with Datasets

Build or upload your first dataset.

Eval

Measure model quality with rubrics scored by LLM judges. Define what “good” looks like for your use case, then score model outputs systematically across candidates. Evals tell you which model is better and by how much — so you can make decisions with data instead of intuition. What you’ll do:

Write a rubric that describes your quality criteria in plain English — from a template, AI generation, or scratch
Run a model comparison to score multiple models side by side on your dataset
Understand how LLM-as-a-judge scoring works under the hood
Read the results to interpret scores and decide which model wins

Outcome: A repeatable, data-driven way to measure model quality before and after every change — and a validated rubric that can guide training.

Get started with Eval

Define quality, measure it, and compare models.

Train

Fine-tune a task-specific model on your production data. The result is a model that’s smaller, faster, and cheaper to run than the general-purpose model it replaces — while being more accurate for your workload. You don’t need to be an ML engineer to use it. What you’ll do:

Choose a recipe — a pre-configured training setup with a vetted base model and optimized parameters
Launch a training run with your training dataset, eval dataset, and rubric
Monitor mid-training evals to track quality scores as the model learns

Outcome: A trained, task-specific model that’s been validated against your rubric — ready to deploy.

Get started with Train

Fine-tune a model on your data.

Deploy

Ship your trained model to a dedicated GPU with an OpenAI-compatible API. The API uses the same base URL and API key as the rest of the Inference platform — switching from an off-the-shelf model to your custom model is a one-line code change. What you’ll do:

Deploy a trained model to a dedicated GPU in a few clicks
Call your deployment using the same OpenAI-compatible SDK you already use
Manage and monitor your deployment lifecycle, scaling, and performance

Outcome: A production endpoint serving your custom model — and the beginning of the next improvement loop. Deploy, observe, eval, retrain.

Get started with Deploy

Ship your model to a dedicated GPU.

Pick your starting point

Record your first LLM call

Route traffic through the Catalyst gateway to automatically trace LLM calls and view metrics.

Run your first eval

Define quality, measure it, and compare models side by side.

Train and deploy a model

The full loop: data, training, and a production endpoint.

Use the Inference API

Access open-source and Inference.net models directly.

Get Started

Observe

Datasets

Eval

Train

Deploy

Platform

Observe

Get started with Observe

Datasets

Get started with Datasets

Eval

Get started with Eval

Train

Get started with Train

Deploy

Get started with Deploy

Pick your starting point

Record your first LLM call

Run your first eval

Train and deploy a model

Use the Inference API

Get Started

Observe

Datasets

Eval

Train

Deploy

Platform

Documentation Index

​Observe

Get started with Observe

​Datasets

Get started with Datasets

​Eval

Get started with Eval

​Train

Get started with Train

​Deploy

Get started with Deploy

​Pick your starting point

Record your first LLM call

Run your first eval

Train and deploy a model

Use the Inference API

Observe

Datasets

Eval

Train

Deploy

Pick your starting point