Overview
Make cost-effective inference requests with flexible completion times.
Learn how to use our OpenAI-compatible Asynchronous Inference API to send individual inference requests that complete within 24-72 hours at reduced costs. Simply use /v1/slow
instead of /v1/
in your API calls to access this feature.
Background inference is cheaper, and easier to build with when you’re application isn’t serving real-time inference.
Asynchronous Inference API is compatible with all the models we offer.
Webhook support is only available for /chat/completions
calls. Support for /completions
will come later.
Overview
The Asynchronous Inference API provides a simple way to make cost-effective inference requests when immediate responses aren’t required. By using the /v1/slow
prefix instead of /v1/
, you can:
- Get immediate request IDs: Your API call returns instantly with a unique ID.
- Save on costs: Enjoy significantly cheaper pricing compared to synchronous requests.
- Flexible completion: Requests complete within 24-72 hours.
- Same familiar API: Uses the exact same request format as our standard endpoints.
This API is perfect for use cases like:
- Large-scale content generation
- Batch document processing
- Non-urgent data analysis
- Cost-sensitive workloads
- Background processing tasks
Getting Started
Using the Asynchronous Inference API is as simple as changing your base URL from /v1/
to /v1/slow/
. The API maintains full compatibility with the OpenAI SDK.
Making Asynchronous Requests
1. Submit a Request
Make requests exactly as you would with the standard API, but responses will include a request ID instead of the completion result:
The initial response will include a unique request ID:
2. Retrieve Results
Once your request is processed, retrieve the results using the generation endpoint:
The completed response includes both the original request and the generation result:
Request States
Asynchronous requests can have the following states:
Status | Description |
---|---|
Queued | Request received and queued for processing |
In Progress | Request is currently being processed |
Success | Request completed successfully, results available |
Failed | Request failed due to an error |
Best Practices
- Store Request IDs: Always save the returned request ID for later retrieval.
- Use Webhooks: Instead of polling, set up webhooks for real-time notifications when requests complete. See our Getting Started with Webhooks guide.
- Handle Failures: Have a fallback plan for requests that fail during processing.
- Batch When Possible: For multiple requests, consider using our Batch API for better organization.
Supported Endpoints
The Asynchronous Inference API supports the following endpoints:
/v1/slow/chat/completions
/v1/slow/completions
Simply replace /v1/
with /v1/slow/
in your existing code to use asynchronous processing.
Pricing and Limits
- Pricing: Significantly reduced compared to synchronous requests (contact sales for specific rates)
- Completion Time: 24-72 hours
- Rate Limits: More generous than synchronous endpoints
- Request Expiration: Requests expire after 72 hours if not completed
For specific pricing information and higher rate limits, please contact [email protected].