Skip to main content
Background jobs are the right path when you want the normal inference request shape, but you do not need the response immediately.

How they work

  • send the request to the asynchronous path instead of the normal realtime path
  • get a generation identifier back immediately
  • poll for the result later, or use a webhook

Source-backed curl example

The request shape below matches the async test helpers in inference/apps/relay/tests/e2e/utils/inference-api.ts and the route mounted at /v1/slow/chat/completions.
curl https://api.inference.net/v1/slow/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $INFERENCE_API_KEY" \
  -d '{
    "model": "meta-llama/llama-3.2-3b-instruct/fp-16",
    "messages": [
      {"role": "system", "content": "You do exactly what the user asks."},
      {"role": "user", "content": "Respond with a single word of your choice and nothing else."}
    ],
    "max_tokens": 10,
    "stream": false
  }'
The async response includes an id. That is the generation identifier you use for polling.
curl https://api.inference.net/v1/generation/$GENERATION_ID \
  -H "Authorization: Bearer $INFERENCE_API_KEY"
Once the generation is complete, the response payload matches the same GenerationForClient shape that completion webhooks deliver in their data field.

Canonical path

Inference.net supports the slow background path. In some parts of the product and codebase you may also see async used as an alias.

Best fit

Use background jobs for:
  • cost-sensitive inference
  • non-interactive generation
  • longer-running requests
  • workflows that can finish later and notify your system with a webhook

Result retrieval

After submission, retrieve the completed result using the generation identifier. Background results preserve the original request and the final response, which makes them useful for later inspection and dataset curation.

Webhooks

For async flows, webhooks are usually better than tight polling loops. Relevant events include:
  • generation.completed
  • async-embedding.completed
To receive a webhook instead of polling, include a webhook_id in the request metadata:
{
  "model": "meta-llama/llama-3.2-3b-instruct/fp-16",
  "messages": [{"role": "user", "content": "Hello, webhook test!"}],
  "stream": false,
  "n": 1,
  "temperature": 1,
  "top_p": 1,
  "metadata": {
    "webhook_id": "wh_123"
  }
}
See /reference/webhooks.

When to use another mode instead