Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.inference.net/llms.txt

Use this file to discover all available pages before exploring further.

For detailed rate limit information including current limits by tier, see the Inference API rate limits page.

What happens when you hit a limit

You’ll receive a 429 Too Many Requests response. Back off and retry with exponential backoff.

Deployment-specific limits

Self-serve deployments on a single GPU have inherent throughput limits. If traffic exceeds capacity, requests slow down and eventually return 429s. For higher throughput, see Scale to Production.

Requesting higher limits

If you need higher rate limits, talk to the team.