Learn how to use our OpenAI-compatible Asynchronous Inference API to send individual inference requests that complete within 24-72 hours at reduced costs. Simply use /v1/slow instead of /v1/ in your API calls to access this feature.

Background inference is cheaper, and easier to build with when you’re application isn’t serving real-time inference.

Asynchronous Inference API is compatible with all the models we offer.

Webhook support is only available for /chat/completions calls. Support for /completions will come later.

Overview

The Asynchronous Inference API provides a simple way to make cost-effective inference requests when immediate responses aren’t required. By using the /v1/slow prefix instead of /v1/, you can:

  1. Get immediate request IDs: Your API call returns instantly with a unique ID.
  2. Save on costs: Enjoy significantly cheaper pricing compared to synchronous requests.
  3. Flexible completion: Requests complete within 24-72 hours.
  4. Same familiar API: Uses the exact same request format as our standard endpoints.

This API is perfect for use cases like:

  • Large-scale content generation
  • Batch document processing
  • Non-urgent data analysis
  • Cost-sensitive workloads
  • Background processing tasks

Getting Started

Using the Asynchronous Inference API is as simple as changing your base URL from /v1/ to /v1/slow/. The API maintains full compatibility with the OpenAI SDK.

import OpenAI from "openai";

const openai = new OpenAI({
    baseURL: "https://api.inference.net/v1/slow", // Note the /v1/slow prefix
    apiKey: process.env.INFERENCE_API_KEY,
});

Making Asynchronous Requests

1. Submit a Request

Make requests exactly as you would with the standard API, but responses will include a request ID instead of the completion result:

const response = await openai.chat.completions.create({
  model: "meta-llama/llama-3.2-1b-instruct/fp-8",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "What is the capital of France?" }
  ],
  max_tokens: 1000
});

console.log(response.id); // Returns immediately with request ID

The initial response will include a unique request ID:

{
  "id": "N2mZQjrvh-k_m8nMMN7Jn",
  "choices": [],
  "created": 1749061362809,
  "model": "meta-llama/llama-3.1-8b-instruct/fp-16",
  "object": "chat.completion"
}

2. Retrieve Results

Once your request is processed, retrieve the results using the generation endpoint:

curl https://api.inference.net/v1/generation/N2mZQjrvh-k_m8nMMN7Jn \
  -H "Authorization: Bearer $INFERENCE_API_KEY"

The completed response includes both the original request and the generation result:

{
  "request": {
    "messages": [
      {"content": "You are a helpful assistant.", "role": "system"},
      {"content": "What is the meaning of life?", "role": "user"}
    ],
    "model": "meta-llama/llama-3.1-8b-instruct/fp-16",
    "stream": false,
    "max_tokens": 8,
    "metadata": {"webhook_id": "mPufxRcrw"}
  },
  "response": {
    "id": "N2mZQjrvh-k_m8nMMN7Jn",
    "object": "chat.completion",
    "created": 1749061362,
    "model": "meta-llama/llama-3.1-8b-instruct/fp-16",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "The meaning of life is a complex and",
          "reasoning_content": null,
          "tool_calls": null
        },
        "logprobs": null,
        "finish_reason": "length",
        "matched_stop": null
      }
    ],
    "usage": {
      "prompt_tokens": 48,
      "total_tokens": 56,
      "completion_tokens": 8,
      "prompt_tokens_details": null
    },
    "system_fingerprint": ""
  },
  "state": "Success",
  "stateMessage": "Generation successful",
  "dispatchedAt": "2025-06-04T18:22:42.807Z",
  "finishedAt": "2025-06-04T18:22:42.912Z"
}

Request States

Asynchronous requests can have the following states:

StatusDescription
QueuedRequest received and queued for processing
In ProgressRequest is currently being processed
SuccessRequest completed successfully, results available
FailedRequest failed due to an error

Best Practices

  1. Store Request IDs: Always save the returned request ID for later retrieval.
  2. Use Webhooks: Instead of polling, set up webhooks for real-time notifications when requests complete. See our Getting Started with Webhooks guide.
  3. Handle Failures: Have a fallback plan for requests that fail during processing.
  4. Batch When Possible: For multiple requests, consider using our Batch API for better organization.

Supported Endpoints

The Asynchronous Inference API supports the following endpoints:

  • /v1/slow/chat/completions
  • /v1/slow/completions

Simply replace /v1/ with /v1/slow/ in your existing code to use asynchronous processing.

Pricing and Limits

  • Pricing: Significantly reduced compared to synchronous requests (contact sales for specific rates)
  • Completion Time: 24-72 hours
  • Rate Limits: More generous than synchronous endpoints
  • Request Expiration: Requests expire after 72 hours if not completed

For specific pricing information and higher rate limits, please contact [email protected].