Rate Limits

What happens when you hit a limit
Deployment-specific limits
Requesting higher limits

For detailed rate limit information including current limits by tier, see the Inference API rate limits page.

What happens when you hit a limit

You’ll receive a 429 Too Many Requests response. Back off and retry with exponential backoff.

Deployment-specific limits

Self-serve deployments on a single GPU have inherent throughput limits. If traffic exceeds capacity, requests slow down and eventually return 429s. For higher throughput, see Scale to Production.

Requesting higher limits

If you need higher rate limits, talk to the team.

API Keys and Authentication

Glossary

⌘I

Get Started

Observe

Datasets

Eval

Train

Deploy

Platform

What happens when you hit a limit

Deployment-specific limits

Requesting higher limits

Get Started

Observe

Datasets

Eval

Train

Deploy

Platform

Documentation Index

​What happens when you hit a limit

​Deployment-specific limits

​Requesting higher limits

What happens when you hit a limit

Deployment-specific limits

Requesting higher limits