Batch API
Process jobs asynchronously with Batch API.
Learn how to use our OpenAI-compatible Batch API to send asynchronous groups of inference requests to Inference Cloud, with nearly unlimited rate limits and fast completion times. The service is ideal for processing a large number of jobs that don’t require immediate responses.
Batch API is currently is compatible with all the models we offer.
Overview
While some uses require you to send synchronous requests, there are many cases where requests do not need an immediate response or rate limits prevent you from executing a large number of queries quickly. Batch processing jobs are often helpful in use cases like:
- Extracting structured data from a large number of documents.
- Generating synthetic data for training.
- Translating a large number of documents into other languages.
- Summarizing a large number of customer interactions.
Inference Cloud’s Batch API offers a straightforward set of endpoints that allow you to uploads a batch of requests, kick off a batch processing job, query for the status of the batch, and eventually retrieve the collected results when the batch is complete.
Compared to using standard endpoints directly, Batch API has:
- Higher rate limits: Substantially more headroom compared to the synchronous APIs.
- Fast completion times: Each batch completes within 24 hours (and often much more quickly).
Getting Started
You’ll need an Inference Cloud account and API key to use the Batch API. See our Quick Start Guide for instructions on how to create an account and get an API key.
Install the OpenAI SDK for your language of choice.
To connect to Inference Cloud using the OpenAI SDK, you will need to set thebaseURL
to https://batch.inference.net/v1
and the apiKey
to your Inference Cloud API key, as shown below:
Running A Batch Processing Job
1. Preparing Your Batch File
Prepare a .jsonl
file where each line is a separate JSON object that represents an individual request.
Each JSON object must be on a single line and cannot contain any line breaks.
Each JSON object must include the following fields:
custom_id
: A unique identifier for the request. This is used to reference the request’s results after completion. It must be unique for each request in the file.method
: The HTTP method to use for the request. Currently, onlyPOST
is supported.url
: The URL to send the request to. Currently, only/v1/chat/completions
and/v1/completions
are supported.body
: The request body, which contains the input for the inference request. The parameters in each line’sbody
field are the same as the parameters for the underlying endpoint specified by theurl
field. See this example for more details.
Here’s an example of an input file with 2 requests using the /v1/chat/completions
endpoint.
And here is an example of an input file using the /v1/completions
endpoint:
2. Uploading Your Batch Input File
In order to create a Batch Processing job, you must first upload your input file.
The response will look similar to this, depending on the language you are using:
2. Starting the Batch Processing Job
Once you’ve successfully uploaded your input file, you can use the ID of the file to create a batch.
In this case, let’s assume the file ID is file-abc123
.
For now, the completion window can only be set to 24h
.
To associate custom metadata with the batch, you can provide an optional metadata
parameter.
This metadata is not used by Inference Cloud to complete requests, but it is included when retrieving the status of a batch so that you can associate custom metadata with the batch.
Note: The Batch Processing job will begin processing immediately after creation.
Create the Batch
This request will return a batch object with metadata about your batch:
Inference Cloud supports a webhook_url
that you can set to receive a webhook notification when the batch is complete.
The webhook_url
must be an HTTPS URL that can receive POST requests.
Your webhook will receive a POST with a request JSON body that looks like this:
If you are using Typescript for the OpenAI SDK, specifying the webhook_url
in the request body will result in a type error because it is not an officially supported parameter.
You can safely ignore this error by casting the body as type BatchCreateParams
, like this:
2. Checking the Status of a Batch
You can check the status of a batch at any time, which will also return a Batch object.
Check the status of a batch by retrieving it using the Batch ID assigned to it by Inference Cloud (represented here by batch_abc123
).
The status of a given Batch object can be any of the following:
Status | Description |
---|---|
validating | the input file is being validated before the batch can begin |
failed | the input file has failed the validation process |
in_progress | the input file was successfully validated and the batch is currently being run |
finalizing | the batch has completed and the results are being prepared |
completed | the batch has been completed and the results are ready |
expired | the batch was not able to be completed within the 24-hour time window |
cancelling | the batch is being cancelled (may take up to 10 minutes) |
cancelled | the batch was cancelled |
3. Retrieving the Results
You will receive an email notification when the batch is complete.
Once the batch is complete, you can download the output by making a request against the Files API using the output_file_id
field from the Batch object.
Similarly, you can retrieve the error file (containing all failed requests) by making a request against the Files API using the error_file_id
field from the Batch object.
Supposing the output file ID is output-file-id
in the following example:
The output .jsonl
file will have one response line for every successful request line in the input file. Any failed requests in the batch will have their error information written to an error file that can be found via the batch’s error_file_id
.
Note that the output line order may not match the input line order.
Instead of relying on order to process your results, use the custom_id field which will be present in each line of your output file and allow you to map requests in your input to results in your output.
Listing All Batches
At any time, you can see all your batches. For users with many batches, you can use the limit
and after
parameters to paginate your results.
If an after
parameter is provided, the list will return batches after the specified batch ID.
Batch Expiration
Batches that do not complete in time eventually move to an expired
state; unfinished requests within that batch are cancelled, and any responses to completed requests are made available via the batch’s output file. You will only be charged for tokens consumed from any completed requests.
Expired requests will be written to your error file with the message as shown below. You can use the custom_id
to retrieve the request data for expired requests.
Rate Limits
Batch API rate limits are separate from existing per-model rate limits. A single batch may include up to 50,000 requests, and a batch input file can be up to 200 MB in size. If you need higher rate limits, please contact us at [email protected].
Compatibility Notes
1. Batch Cancellation
Although the OpenAI SDK supports the ability to cancel an in-progress batch, Inference Cloud does not currently support batch cancellation. This is under development and will be available soon.
2. Model Availability
Inference Cloud’s Batch Processing is compatible with all of Inference Cloud’s supported models. See our list of supported models for a complete list.