Learn how to use our OpenAI-compatible Asynchronous Inference API to send individual inference requests that complete within 24-72 hours at reduced costs. Simply use /v1/slow
instead of /v1/
in your API calls to access this feature.
Background inference is cheaper, and easier to build with when you’re application isn’t serving real-time inference.
Asynchronous Inference API is compatible with all the models we offer.
Webhook support is only available for /chat/completions
calls. Support for /completions
will come later.
Overview
The Asynchronous Inference API provides a simple way to make cost-effective inference requests when immediate responses aren’t required. By using the /v1/slow
prefix instead of /v1/
, you can:
- Get immediate request IDs: Your API call returns instantly with a unique ID.
- Save on costs: Enjoy significantly cheaper pricing compared to synchronous requests.
- Flexible completion: Requests complete within 24-72 hours.
- Same familiar API: Uses the exact same request format as our standard endpoints.
This API is perfect for use cases like:
- Large-scale content generation
- Batch document processing
- Non-urgent data analysis
- Cost-sensitive workloads
- Background processing tasks
Getting Started
Using the Asynchronous Inference API is as simple as changing your base URL from /v1/
to /v1/slow/
. The API maintains full compatibility with the OpenAI SDK.
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "https://api.inference.net/v1/slow", // Note the /v1/slow prefix
apiKey: process.env.INFERENCE_API_KEY,
});
Making Asynchronous Requests
1. Submit a Request
Make requests exactly as you would with the standard API, but responses will include a request ID instead of the completion result:
const response = await openai.chat.completions.create({
model: "meta-llama/llama-3.2-1b-instruct/fp-8",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is the capital of France?" }
],
max_tokens: 1000
});
console.log(response.id); // Returns immediately with request ID
The initial response will include a unique request ID:
{
"id": "N2mZQjrvh-k_m8nMMN7Jn",
"choices": [],
"created": 1749061362809,
"model": "meta-llama/llama-3.1-8b-instruct/fp-16",
"object": "chat.completion"
}
2. Retrieve Results
Once your request is processed, retrieve the results using the generation endpoint:
curl https://api.inference.net/v1/generation/N2mZQjrvh-k_m8nMMN7Jn \
-H "Authorization: Bearer $INFERENCE_API_KEY"
The completed response includes both the original request and the generation result:
{
"request": {
"messages": [
{"content": "You are a helpful assistant.", "role": "system"},
{"content": "What is the meaning of life?", "role": "user"}
],
"model": "meta-llama/llama-3.1-8b-instruct/fp-16",
"stream": false,
"max_tokens": 8,
"metadata": {"webhook_id": "mPufxRcrw"}
},
"response": {
"id": "N2mZQjrvh-k_m8nMMN7Jn",
"object": "chat.completion",
"created": 1749061362,
"model": "meta-llama/llama-3.1-8b-instruct/fp-16",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The meaning of life is a complex and",
"reasoning_content": null,
"tool_calls": null
},
"logprobs": null,
"finish_reason": "length",
"matched_stop": null
}
],
"usage": {
"prompt_tokens": 48,
"total_tokens": 56,
"completion_tokens": 8,
"prompt_tokens_details": null
},
"system_fingerprint": ""
},
"state": "Success",
"stateMessage": "Generation successful",
"finishedAt": "2025-06-04T18:22:42.912Z"
}
Request States
Asynchronous requests can have the following states:
Status | Description |
---|
Queued | Request received and queued for processing |
In Progress | Request is currently being processed |
Success | Request completed successfully, results available |
Failed | Request failed due to an error |
Best Practices
- Store Request IDs: Always save the returned request ID for later retrieval.
- Use Webhooks: Instead of polling, set up webhooks for real-time notifications when requests complete. See our Getting Started with Webhooks guide.
- Handle Failures: Have a fallback plan for requests that fail during processing.
- Batch When Possible: For multiple requests, consider using our Batch API for better organization.
Supported Endpoints
The Asynchronous Inference API supports the following endpoints:
/v1/slow/chat/completions
/v1/slow/completions
Simply replace /v1/
with /v1/slow/
in your existing code to use asynchronous processing.
Pricing and Limits
- Pricing: Significantly reduced compared to synchronous requests (contact sales for specific rates)
- Completion Time: 24-72 hours
- Rate Limits: More generous than synchronous endpoints
- Request Expiration: Requests expire after 72 hours if not completed
For specific pricing information and higher rate limits, please contact [email protected].
Responses are generated using AI and may contain mistakes.