Introduction
Image captioning is the process of generating natural language descriptions for images using Vision-Language Models. Our API makes it easy to create accessible, SEO-friendly, and contextually relevant captions for any image at scale. With Vision models, you can generate captions by passing a base64-encoded image to the/chat/completions
endpoint and asking for a description. The model analyzes the visual content and produces human-readable text that describes what it sees.
Key Benefits
- Accessibility – Generate alt-text at scale for better web accessibility
- SEO & Discovery – Create rich, indexable metadata for visual content
- Content Automation – Auto-fill social preview text, product descriptions, and more
- Zero-shot Capability – Works out-of-the-box on any domain without fine-tuning
Getting Started
You’ll need an Inference.net account and API key. See our Quick Start Guide for instructions on how to create an account and get an API key. Install the OpenAI SDK for your language of choice and set the base URL tohttps://api.inference.net/v1
.
Quick Example
Structured Outputs for Clean Alt-Text
When using Structured Outputs, always instruct the model to respond in JSON format (e.g. via a system prompt).
alt_text
field by using Structured Outputs.
Processing thousands of images? Use the Batch API to caption large datasets asynchronously and avoid rate-limit constraints.
Best Practices
- Keep it short – Aim for 1–2 sentences (≤ 120 characters) to maximize usefulness for screen readers and SEO.
- Be objective – Describe what is visible, not what you assume (avoid gender, emotions, or unverified context).
- Add context when relevant – If the image appears in a product page, include the product name (e.g. “Red leather backpack on a white background”).
Limitations & Considerations
- Caption quality depends on image resolution and clarity. Images larger than 1 MB are not well supported, and the number of images you can pass is dependent on the model.
- Vision models may occasionally hallucinate unseen details. Consider manual spot checks for critical use cases.
- Using images counts toward your token budget. See the Vision guide for the token calculation formula.
- Some uncommon image formats may not be supported.
For more advanced vision workflows—object detection, multi-modal Q&A, or scene classification—see the Vision Models guide.