📍 TODO:MEDIA
Screenshot of the deployments dashboard showing a running deployment with status and endpoint info.
Key concepts
| Concept | Description |
|---|---|
| Dedicated GPU | Your model runs on its own GPU. No shared infrastructure, no noisy neighbors. Compute is determined by the recipe used during training. |
| OpenAI-compatible API | Same base URL, same API key, just swap the model parameter. Structured outputs, function calling, and all standard API features work the same way. |
| Self-serve vs Managed | Self-serve: single GPU, deploy in clicks, great for validation and early production. Managed: multi-GPU, production scale. |
| The improvement loop | Deploy → observe production performance → run evals to catch regressions → train the next version. The loop continues. |
Two paths
| Path | Best for |
|---|---|
| Self-serve | Single dedicated GPU. Deploy in a few clicks. Great for validation and early production. |
| Managed | Multi-GPU provisioning sized to your traffic. Talk to the team for production-scale serving. |
What you can deploy today
- Models trained on the Catalyst platform
- Served via an OpenAI-compatible API (chat completions endpoint)
- Same base URL and API key as the rest of the Inference API
Next steps
Deploy a trained model
Name it, click deploy, start serving.
Call your deployment
One line of code to switch over.
Manage and monitor
Monitor your deployment’s health and usage.
Scale to production
When you need more than a single GPU.