Generate concise, descriptive captions for images with Vision-Language Models
Image captioning is the process of generating natural language descriptions for images using Vision-Language Models. Our API makes it easy to create accessible, SEO-friendly, and contextually relevant captions for any image at scale.
With Vision models, you can generate captions by passing a base64-encoded image to the /chat/completions
endpoint and asking for a description. The model analyzes the visual content and produces human-readable text that describes what it sees.
You’ll need an Inference.net account and API key. See our Quick Start Guide for instructions on how to create an account and get an API key.
Install the OpenAI SDK for your language of choice and set the base URL to https://api.inference.net/v1
.
When using Structured Outputs, always instruct the model to respond in JSON format (e.g. via a system prompt).
Guarantee that every response contains exactly one alt_text
field by using Structured Outputs.
Processing thousands of images? Use the Batch API to caption large datasets asynchronously and avoid rate-limit constraints.
For more advanced vision workflows—object detection, multi-modal Q&A, or scene classification—see the Vision Models guide.