/v1/slow instead of /v1/ in your API calls to access this feature.
Background inference is cheaper, and easier to build with when you’re application isn’t serving real-time inference.
Asynchronous Inference API is compatible with all the models we offer.
Webhook support is only available for
/chat/completions calls. Support for /completions will come later.Overview
The Asynchronous Inference API provides a simple way to make cost-effective inference requests when immediate responses aren’t required. By using the/v1/slow prefix instead of /v1/, you can:
- Get immediate request IDs: Your API call returns instantly with a unique ID.
- Save on costs: Enjoy significantly cheaper pricing compared to synchronous requests.
- Flexible completion: Requests complete within 24-72 hours.
- Same familiar API: Uses the exact same request format as our standard endpoints.
- Large-scale content generation
- Batch document processing
- Non-urgent data analysis
- Cost-sensitive workloads
- Background processing tasks
Getting Started
Using the Asynchronous Inference API is as simple as changing your base URL from/v1/ to /v1/slow/. The API maintains full compatibility with the OpenAI SDK.
Making Asynchronous Requests
1. Submit a Request
Make requests exactly as you would with the standard API, but responses will include a request ID instead of the completion result:2. Retrieve Results
Once your request is processed, retrieve the results using the generation endpoint:Request States
Asynchronous requests can have the following states:| Status | Description |
|---|---|
| Queued | Request received and queued for processing |
| In Progress | Request is currently being processed |
| Success | Request completed successfully, results available |
| Failed | Request failed due to an error |
Best Practices
- Store Request IDs: Always save the returned request ID for later retrieval.
- Use Webhooks: Instead of polling, set up webhooks for real-time notifications when requests complete. See our Getting Started with Webhooks guide.
- Handle Failures: Have a fallback plan for requests that fail during processing.
- Batch When Possible: For multiple requests, consider using our Batch API for better organization.
Supported Endpoints
The Asynchronous Inference API supports the following endpoints:/v1/slow/chat/completions/v1/slow/completions
/v1/ with /v1/slow/ in your existing code to use asynchronous processing.
Pricing and Limits
- Pricing: Significantly reduced compared to synchronous requests (contact sales for specific rates)
- Completion Time: 24-72 hours
- Rate Limits: More generous than synchronous endpoints
- Request Expiration: Requests expire after 72 hours if not completed