> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inference.net/llms.txt
> Use this file to discover all available pages before exploring further.

# Serverless Deployments

> Platform-wide deployments billed per token, or offered for free.

Most deployments are private: only the owning team can call them, and they bill
by GPU capacity. A deployment can also be **serverless-enabled** by the
Inference team, which makes it callable by every account on the platform and
billed per token — or offered for free.

## Calling a serverless deployment

Serverless deployments work exactly like any other model on the
OpenAI-compatible API. Use the deployment's model path as the `model`:

```bash theme={"system"}
curl https://api.inference.net/v1/chat/completions \
  -H "Authorization: Bearer $INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inference-net/example-model",
    "messages": [{"role": "user", "content": "Hello, world!"}]
  }'
```

Streaming, structured outputs, and the other chat completions features work
the same as on catalog models.

## Billing

Serverless deployments are priced in **USD per 1M tokens**, with separate
input and output rates set per deployment. Usage is billed to the calling
team's credit balance like any other serverless inference:

* Requests are authorized against your credit balance up front; if the
  balance can't cover the estimated cost, the API responds with `402`.
* The actual charge is settled when the inference completes, from the real
  token counts reported by the serving engine.
* Failed inferences are never billed.
* Charges appear in your usage dashboard under the deployment's model path.

## Free deployments

A serverless deployment with **no prices set is public and free**: anyone on
the platform can call it and no credits are charged or required. Free
deployments still count against your standard serverless rate limits.

## Limits

Serverless deployment requests share your team's serverless inference rate
limits. Context-window limits are enforced by the deployment's engine rather
than the platform catalog, so an oversized prompt is rejected by the model
itself.
