Overview

Learn how to use our OpenAI-compatible Asynchronous Inference API to send individual inference requests that complete within 24-72 hours at reduced costs. Simply use /v1/slow instead of /v1/ in your API calls to access this feature. Background inference is cheaper, and easier to build with when you’re application isn’t serving real-time inference.

Asynchronous Inference API is compatible with all the models we offer.

Webhook support is only available for /chat/completions calls. Support for /completions will come later.

The Asynchronous Inference API provides a simple way to make cost-effective inference requests when immediate responses aren’t required. By using the /v1/slow prefix instead of /v1/, you can:

Get immediate request IDs: Your API call returns instantly with a unique ID.
Save on costs: Enjoy significantly cheaper pricing compared to synchronous requests.
Flexible completion: Requests complete within 24-72 hours.
Same familiar API: Uses the exact same request format as our standard endpoints.

This API is perfect for use cases like:

Large-scale content generation
Batch document processing
Non-urgent data analysis
Cost-sensitive workloads
Background processing tasks

Getting Started

Using the Asynchronous Inference API is as simple as changing your base URL from /v1/ to /v1/slow/. The API maintains full compatibility with the OpenAI SDK.

import OpenAI from "openai";

const openai = new OpenAI({
    baseURL: "https://api.inference.net/v1/slow", // Note the /v1/slow prefix
    apiKey: process.env.INFERENCE_API_KEY,
});

Making Asynchronous Requests

1. Submit a Request

Make requests exactly as you would with the standard API, but responses will include a request ID instead of the completion result:

const response = await openai.chat.completions.create({
  model: "meta-llama/llama-3.2-1b-instruct/fp-8",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "What is the capital of France?" }
  ],
  max_tokens: 1000
});

console.log(response.id); // Returns immediately with request ID

The initial response will include a unique request ID:

{
  "id": "N2mZQjrvh-k_m8nMMN7Jn",
  "choices": [],
  "created": 1749061362809,
  "model": "meta-llama/llama-3.1-8b-instruct/fp-16",
  "object": "chat.completion"
}

2. Retrieve Results

Once your request is processed, retrieve the results using the generation endpoint:

curl https://api.inference.net/v1/generation/N2mZQjrvh-k_m8nMMN7Jn \
  -H "Authorization: Bearer $INFERENCE_API_KEY"

The completed response includes both the original request and the generation result:

{
  "request": {
    "messages": [
      {"content": "You are a helpful assistant.", "role": "system"},
      {"content": "What is the meaning of life?", "role": "user"}
    ],
    "model": "meta-llama/llama-3.1-8b-instruct/fp-16",
    "stream": false,
    "max_tokens": 8,
    "metadata": {"webhook_id": "mPufxRcrw"}
  },
  "response": {
    "id": "N2mZQjrvh-k_m8nMMN7Jn",
    "object": "chat.completion",
    "created": 1749061362,
    "model": "meta-llama/llama-3.1-8b-instruct/fp-16",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "The meaning of life is a complex and",
          "reasoning_content": null,
          "tool_calls": null
        },
        "logprobs": null,
        "finish_reason": "length",
        "matched_stop": null
      }
    ],
    "usage": {
      "prompt_tokens": 48,
      "total_tokens": 56,
      "completion_tokens": 8,
      "prompt_tokens_details": null
    },
    "system_fingerprint": ""
  },
  "state": "Success",
  "stateMessage": "Generation successful",
  "finishedAt": "2025-06-04T18:22:42.912Z"
}

Request States

Asynchronous requests can have the following states:

Status	Description
Queued	Request received and queued for processing
In Progress	Request is currently being processed
Success	Request completed successfully, results available
Failed	Request failed due to an error

Best Practices

Store Request IDs: Always save the returned request ID for later retrieval.
Use Webhooks: Instead of polling, set up webhooks for real-time notifications when requests complete. See our Getting Started with Webhooks guide.
Handle Failures: Have a fallback plan for requests that fail during processing.
Batch When Possible: For multiple requests, consider using our Batch API for better organization.

Supported Endpoints

The Asynchronous Inference API supports the following endpoints:

/v1/slow/chat/completions
/v1/slow/completions

Simply replace /v1/ with /v1/slow/ in your existing code to use asynchronous processing.

Pricing and Limits

Pricing: Significantly reduced compared to synchronous requests (contact sales for specific rates)
Completion Time: 24-72 hours
Rate Limits: More generous than synchronous endpoints
Request Expiration: Requests expire after 72 hours if not completed

For specific pricing information and higher rate limits, please contact [email protected].

Get Started

Features

Fine-Tuning

Use Cases

Resources

Overview

Overview

Getting Started

Making Asynchronous Requests

1. Submit a Request

2. Retrieve Results

Request States

Best Practices

Supported Endpoints

Pricing and Limits

Get Started

Features

Fine-Tuning

Use Cases

Resources

​Overview

​Getting Started

​Making Asynchronous Requests

​1. Submit a Request

​2. Retrieve Results

​Request States

​Best Practices

​Supported Endpoints

​Pricing and Limits

Overview

Getting Started

Making Asynchronous Requests

1. Submit a Request

2. Retrieve Results

Request States

Best Practices

Supported Endpoints

Pricing and Limits