Skip to main content
For detailed rate limit information including current limits by tier, see the Inference API rate limits page.

What happens when you hit a limit

You’ll receive a 429 Too Many Requests response. Back off and retry with exponential backoff.

Deployment-specific limits

Self-serve deployments on a single GPU have inherent throughput limits. If traffic exceeds capacity, requests slow down and eventually return 429s. For higher throughput, see Scale to Production.

Requesting higher limits

If you need higher rate limits, talk to the team.