Skip to main content
Deploy gives you a dedicated GPU serving your fine-tuned model. The API is OpenAI-compatible, so switching from an off-the-shelf model to your custom model is a one-line code change. This is the last step in the loop and the beginning of the next one.

📍 TODO:MEDIA

Screenshot of the deployments dashboard showing a running deployment with status and endpoint info.

Key concepts

ConceptDescription
Dedicated GPUYour model runs on its own GPU. No shared infrastructure, no noisy neighbors. Compute is determined by the recipe used during training.
OpenAI-compatible APISame base URL, same API key, just swap the model parameter. Structured outputs, function calling, and all standard API features work the same way.
Self-serve vs ManagedSelf-serve: single GPU, deploy in clicks, great for validation and early production. Managed: multi-GPU, production scale.
The improvement loopDeploy → observe production performance → run evals to catch regressions → train the next version. The loop continues.

Two paths

PathBest for
Self-serveSingle dedicated GPU. Deploy in a few clicks. Great for validation and early production.
ManagedMulti-GPU provisioning sized to your traffic. Talk to the team for production-scale serving.
Self-serve is the default experience. Managed deployments are for when you need more than a single GPU.

What you can deploy today

  • Models trained on the Catalyst platform
  • Served via an OpenAI-compatible API (chat completions endpoint)
  • Same base URL and API key as the rest of the Inference API
Deploying off-the-shelf open source models and bringing your own already-trained models are coming soon. See Open Source Models for details.

Next steps

Deploy a trained model

Name it, click deploy, start serving.

Call your deployment

One line of code to switch over.

Manage and monitor

Monitor your deployment’s health and usage.

Scale to production

When you need more than a single GPU.