Skip to main content
Most deployments are private: only the owning team can call them, and they bill by GPU capacity. A deployment can also be serverless-enabled by the Inference team, which makes it callable by every account on the platform and billed per token — or offered for free.

Calling a serverless deployment

Serverless deployments work exactly like any other model on the OpenAI-compatible API. Use the deployment’s model path as the model:
curl https://api.inference.net/v1/chat/completions \
  -H "Authorization: Bearer $INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inference-net/example-model",
    "messages": [{"role": "user", "content": "Hello, world!"}]
  }'
Streaming, structured outputs, and the other chat completions features work the same as on catalog models.

Billing

Serverless deployments are priced in USD per 1M tokens, with separate input and output rates set per deployment. Usage is billed to the calling team’s credit balance like any other serverless inference:
  • Requests are authorized against your credit balance up front; if the balance can’t cover the estimated cost, the API responds with 402.
  • The actual charge is settled when the inference completes, from the real token counts reported by the serving engine.
  • Failed inferences are never billed.
  • Charges appear in your usage dashboard under the deployment’s model path.

Free deployments

A serverless deployment with no prices set is public and free: anyone on the platform can call it and no credits are charged or required. Free deployments still count against your standard serverless rate limits.

Limits

Serverless deployment requests share your team’s serverless inference rate limits. Context-window limits are enforced by the deployment’s engine rather than the platform catalog, so an oversized prompt is rejected by the model itself.