Rate Limits - Inference.net Documentation

Generation Rate Limits

Rate limits for model inference requests are based on your account tier:

Tier	Requests per minute (RPM)
Free	30
Paid	250
Custom teams	500 - 1,000,000 (per-team overrides)

Deployed Model Rate Limits

Models you deploy on the platform have a rate limit of 300 RPM per instance. If you need higher throughput, scale up the number of instances — total RPM equals the number of instances multiplied by 300.

Batch API Rate Limits

Batch file upload: 1 per minute
Batch processing rate limits are separate from generation rate limits. See the Batch API docs for details.

General API Rate Limits

The Catalyst gateway enforces a general rate limit of 100 requests per second per API key or IP address. This applies across all endpoints.

Increasing Your Limits

If you need higher rate limits, contact us or use the support chat to request a custom tier.

Documentation Index

​Generation Rate Limits

​Deployed Model Rate Limits

​Batch API Rate Limits

​General API Rate Limits

​Increasing Your Limits

Generation Rate Limits

Deployed Model Rate Limits

Batch API Rate Limits

General API Rate Limits

Increasing Your Limits