> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inference.net/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

> Make cost-effective inference requests with flexible completion times.

Learn how to use our OpenAI-compatible Asynchronous Inference API to send individual inference requests that complete within 24-72 hours at reduced costs. Simply use `/v1/slow` instead of `/v1/` in your API calls to access this feature.

Background inference is cheaper, and easier to build with when your application isn't serving real-time inference.

<Note>
  Asynchronous Inference API is compatible with all the [models](https://inference.net/models) we offer.
</Note>

<Warning>
  Webhook support is only available for `/chat/completions` calls. Support for `/completions` will come later.
</Warning>

## Overview

The Asynchronous Inference API provides a simple way to make cost-effective inference requests when immediate responses aren't required. By using the `/v1/slow` prefix instead of `/v1/`, you can:

1. **Get immediate request IDs:** Your API call returns instantly with a unique ID.
2. **Save on costs:** Enjoy significantly cheaper pricing compared to synchronous requests.
3. **Flexible completion:** Requests complete within 24-72 hours.
4. **Same familiar API:** Uses the exact same request format as our standard endpoints.

This API is perfect for use cases like:

* Large-scale content generation
* Batch document processing
* Non-urgent data analysis
* Cost-sensitive workloads
* Background processing tasks

## Getting Started

Using the Asynchronous Inference API is as simple as changing your base URL from `/v1/` to `/v1/slow/`. The API maintains full compatibility with the OpenAI SDK.

<CodeGroup>
  ```typescript TypeScript theme={"system"}
  import OpenAI from "openai";

  const client = new OpenAI({
    baseURL: "https://api.inference.net/v1/slow", // Note the /v1/slow prefix
    apiKey: process.env.INFERENCE_API_KEY,
  });
  ```

  ```python Python theme={"system"}
  import os
  from openai import OpenAI

  client = OpenAI(
      base_url="https://api.inference.net/v1/slow",  # Note the /v1/slow prefix
      api_key=os.environ["INFERENCE_API_KEY"],
  )
  ```

  ```bash cURL theme={"system"}
  # Use /v1/slow/ instead of /v1/ in the URL
  curl https://api.inference.net/v1/slow/chat/completions \
    -H "Authorization: Bearer $INFERENCE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{ ... }'
  ```
</CodeGroup>

## Making Asynchronous Requests

### 1. Submit a Request

Make requests exactly as you would with the standard API, but responses will include a request ID instead of the completion result:

<CodeGroup>
  ```typescript TypeScript theme={"system"}
  const response = await client.chat.completions.create({
    model: "google/gemma-3-27b-instruct/bf-16",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "What is the capital of France?" },
    ],
    max_tokens: 1000,
  });

  console.log(response.id); // Returns immediately with request ID
  ```

  ```python Python theme={"system"}
  response = client.chat.completions.create(
      model="google/gemma-3-27b-instruct/bf-16",
      messages=[
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "What is the capital of France?"},
      ],
      max_tokens=1000,
  )

  print(response.id)  # Returns immediately with request ID
  ```

  ```bash cURL theme={"system"}
  curl https://api.inference.net/v1/slow/chat/completions \
    -H "Authorization: Bearer $INFERENCE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "google/gemma-3-27b-instruct/bf-16",
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
      ],
      "max_tokens": 1000
    }'
  ```
</CodeGroup>

The initial response will include a unique request ID:

```json JSON theme={"system"}
{
  "id": "N2mZQjrvh-k_m8nMMN7Jn",
  "choices": [],
  "created": 1749061362809,
  "model": "google/gemma-3-27b-instruct/bf-16",
  "object": "chat.completion"
}
```

### 2. Retrieve Results

Once your request is processed, retrieve the results using the generation endpoint:

<CodeGroup>
  ```typescript TypeScript theme={"system"}
  const response = await fetch(
    `https://api.inference.net/v1/generation/${generationId}`,
    {
      headers: {
        Authorization: `Bearer ${process.env.INFERENCE_API_KEY}`,
      },
    },
  );
  const result = await response.json();
  ```

  ```python Python theme={"system"}
  import os
  import requests

  response = requests.get(
      f"https://api.inference.net/v1/generation/{generation_id}",
      headers={"Authorization": f"Bearer {os.environ['INFERENCE_API_KEY']}"},
  )
  result = response.json()
  ```

  ```bash cURL theme={"system"}
  curl https://api.inference.net/v1/generation/N2mZQjrvh-k_m8nMMN7Jn \
    -H "Authorization: Bearer $INFERENCE_API_KEY"
  ```
</CodeGroup>

The completed response includes both the original request and the generation result:

```json JSON theme={"system"}
{
  "request": {
    "messages": [
      {"content": "You are a helpful assistant.", "role": "system"},
      {"content": "What is the meaning of life?", "role": "user"}
    ],
    "model": "google/gemma-3-27b-instruct/bf-16",
    "stream": false,
    "max_tokens": 8,
    "metadata": {"webhook_id": "mPufxRcrw"}
  },
  "response": {
    "id": "N2mZQjrvh-k_m8nMMN7Jn",
    "object": "chat.completion",
    "created": 1749061362,
    "model": "google/gemma-3-27b-instruct/bf-16",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "The meaning of life is a complex and",
          "reasoning_content": null,
          "tool_calls": null
        },
        "logprobs": null,
        "finish_reason": "length",
        "matched_stop": null
      }
    ],
    "usage": {
      "prompt_tokens": 48,
      "total_tokens": 56,
      "completion_tokens": 8,
      "prompt_tokens_details": null
    },
    "system_fingerprint": ""
  },
  "state": "Success",
  "stateMessage": "Generation successful",
  "finishedAt": "2025-06-04T18:22:42.912Z"
}
```

## Request States

Asynchronous requests can have the following states:

| Status      | Description                                       |
| ----------- | ------------------------------------------------- |
| Queued      | Request received and queued for processing        |
| In Progress | Request is currently being processed              |
| Success     | Request completed successfully, results available |
| Failed      | Request failed due to an error                    |

## Best Practices

1. **Store Request IDs:** Always save the returned request ID for later retrieval.
2. **Use Webhooks:** Instead of polling, set up webhooks for real-time notifications when requests complete. See our [Getting Started with Webhooks](/api/async-inference/webhooks/getting-started-with-webhooks) guide.
3. **Handle Failures:** Have a fallback plan for requests that fail during processing.
4. **Batch When Possible:** For multiple requests, consider using our [Batch API](/api/async-inference/batch-api) for better organization.

## Supported Endpoints

The Asynchronous Inference API supports the following endpoints:

* `/v1/slow/chat/completions`
* `/v1/slow/completions`

Simply replace `/v1/` with `/v1/slow/` in your existing code to use asynchronous processing.

## Pricing and Limits

* **Pricing:** Significantly reduced compared to synchronous requests (contact sales for specific rates)
* **Completion Time:** 24-72 hours
* **Rate Limits:** More generous than synchronous endpoints
* **Request Expiration:** Requests expire after 72 hours if not completed

For specific pricing information and higher rate limits, please contact [support@inference.net](mailto:support@inference.net).
