Skip to main content
In the dashboard, this workflow lives under Deployments.

What you choose when creating a deployment

The self-serve deployment flow starts with a few core decisions:
  • name - the human-readable deployment name
  • model - the model or model reference you want to serve
  • speed target - a higher-throughput versus lighter-capacity tradeoff
  • instance count - the initial serving footprint
Depending on your account, you may also see more advanced deployment settings.

Exact fields the dashboard submits

The current dashboard create flow in inference/apps/web/src/pages/dashboard/deployments/NewDeployment.page.tsx submits fields like:
FieldMeaning
nameHuman-readable deployment name
teamIdOwning team
modelIdSelected model
desiredInstancesInitial instance count
desiredTokensPerSecondThroughput target derived from the speed slider
publicModelIdentifierPublic deployment ID, if overridden or auto-generated
lockedConfigId / lockedConfigVersionOptional locked engine config
requirements.cardsOptional GPU requirements
flagOverridesOptional runtime flag overrides
environmentVariableOverridesOptional environment variable overrides
configFileOverridesOptional config file overrides
isServerlessDeploymentOptional public serverless toggle
serverlessCostPerMillionIn / serverlessCostPerMillionOutOptional public serverless pricing
The current speed slider maps into an approximate desiredTokensPerSecond range of 50 to 200.

Source-backed sample values

This sample mirrors the exact fields and value shapes used by the dashboard create flow.
{
  "name": "my-production-deployment",
  "teamId": "team_123",
  "modelId": "google/gemma-3-27b-it",
  "desiredInstances": 1,
  "desiredTokensPerSecond": 125,
  "publicModelIdentifier": "your-team/my-production-deployment-a1b2c3"
}
The public identifier format comes from DeploymentMetadata.tsx and defaults to teamSlug/name-randomId. Advanced metadata, locked config, GPU requirements, and override sections currently come from the superadmin-only advanced configuration UI.

Suggested workflow

  1. create a clear deployment name
  2. choose the model you want to serve
  3. set the initial speed and instance count
  4. create the deployment
  5. open the deployment overview and verify the public model identifier and status

After creation

Once the deployment exists, you can use the detail tabs to inspect:
  • Overview for deployment info and API usage
  • Instances for runtime capacity and instance status
  • Inferences for recent traffic served by the deployment
  • Settings for administrative actions like deletion

A practical starting point

Start smaller, validate the workload, and then scale instance count once you have real traffic and latency data.

Need help?

If you want help planning deployment topology, scaling, or rollout strategy, meet with our team.