Rate Limits and Quotas - Inference.net Documentation

Direct API defaults
When limits become the wrong tool
Need a higher limit?

Direct API defaults

Current baseline limits for the direct API are:

Language models: 500 requests per minute
Image models: 100 requests per minute

These are the fastest numbers to reason about for the shared realtime API.

When limits become the wrong tool

If you are hitting rate limits regularly, the answer is often to change the execution mode rather than just ask for a larger number. Consider:

/api/background-jobs for delayed single-request workflows
/guides/choose-realtime-background-group-or-batch when you need help choosing between background jobs, group jobs, and batch
/api/batch for large offline workloads
/deploy/overview if the workload deserves dedicated capacity

Need a higher limit?

If you need more headroom on the direct API, meet with our team or contact [email protected].

File Formats Security and Data Retention

⌘I