Serverless Deployments

Most models on Inference.net are already serverless: the open-source and closed-source models in the catalog are callable by any account and billed per token. See the API quickstart for how to call them. This page covers serverless deployments, which extend that model to deployments. Most deployments are private: only the owning team can call them, and they bill by GPU capacity. A deployment can also be serverless-enabled by the Inference team, which makes it callable by every account on the platform and billed per token, or offered for free.

Calling a serverless deployment

Serverless deployments work exactly like any other model on the OpenAI-compatible API. Use the deployment’s model path as the model:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.inference.net/v1",
  apiKey: process.env.INFERENCE_API_KEY,
});

const response = await client.chat.completions.create({
  model: "inference-net/example-model",
  messages: [{ role: "user", content: "Hello, world!" }],
});

console.log(response.choices[0].message.content);

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.inference.net/v1",
    api_key=os.environ["INFERENCE_API_KEY"],
)

response = client.chat.completions.create(
    model="inference-net/example-model",
    messages=[{"role": "user", "content": "Hello, world!"}],
)

print(response.choices[0].message.content)

use serde_json::{json, Value};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let api_key = std::env::var("INFERENCE_API_KEY")?;

    let response: Value = reqwest::Client::new()
        .post("https://api.inference.net/v1/chat/completions")
        .bearer_auth(api_key)
        .json(&json!({
            "model": "inference-net/example-model",
            "messages": [{"role": "user", "content": "Hello, world!"}]
        }))
        .send()
        .await?
        .json()
        .await?;

    println!(
        "{}",
        response["choices"][0]["message"]["content"]
            .as_str()
            .unwrap_or_default()
    );
    Ok(())
}

package main

import (
	"bytes"
	"encoding/json"
	"fmt"
	"net/http"
	"os"
)

func main() {
	body, _ := json.Marshal(map[string]any{
		"model": "inference-net/example-model",
		"messages": []map[string]string{
			{"role": "user", "content": "Hello, world!"},
		},
	})

	req, _ := http.NewRequest("POST", "https://api.inference.net/v1/chat/completions", bytes.NewReader(body))
	req.Header.Set("Authorization", "Bearer "+os.Getenv("INFERENCE_API_KEY"))
	req.Header.Set("Content-Type", "application/json")

	resp, err := http.DefaultClient.Do(req)
	if err != nil {
		panic(err)
	}
	defer resp.Body.Close()

	var result struct {
		Choices []struct {
			Message struct {
				Content string `json:"content"`
			} `json:"message"`
		} `json:"choices"`
	}
	if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
		panic(err)
	}
	fmt.Println(result.Choices[0].Message.Content)
}

curl https://api.inference.net/v1/chat/completions \
  -H "Authorization: Bearer $INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inference-net/example-model",
    "messages": [{"role": "user", "content": "Hello, world!"}]
  }'

Streaming, structured outputs, and the other chat completions features work the same as on catalog models.

Billing

Serverless deployments are priced in USD per 1M tokens, with separate input and output rates set per deployment. Usage is billed to the calling team’s credit balance like any other serverless inference:

Requests are authorized against your credit balance up front; if the balance can’t cover the estimated cost, the API responds with 402.
The actual charge is settled when the inference completes, from the real token counts reported by the serving engine.
Failed inferences are never billed.
Charges appear in your usage dashboard under the deployment’s model path.

Free deployments

A serverless deployment with no prices set is public and free: anyone on the platform can call it and no credits are charged or required. Free deployments still count against your standard serverless rate limits.

Limits

Serverless deployment requests share your team’s serverless inference rate limits. Context-window limits are enforced by the deployment’s engine rather than the platform catalog, so an oversized prompt is rejected by the model itself.

Get Started

Gateway

Datasets

Eval

Deploy

Platform

Train

Calling a serverless deployment

Billing

Free deployments

Limits

​Calling a serverless deployment

​Billing

​Free deployments

​Limits

Calling a serverless deployment

Billing

Free deployments

Limits