Introduction

Image captioning is the process of generating natural language descriptions for images using Vision-Language Models. Our API makes it easy to create accessible, SEO-friendly, and contextually relevant captions for any image at scale.

With Vision models, you can generate captions by passing a base64-encoded image to the /chat/completions endpoint and asking for a description. The model analyzes the visual content and produces human-readable text that describes what it sees.

Key Benefits

  • Accessibility – Generate alt-text at scale for better web accessibility
  • SEO & Discovery – Create rich, indexable metadata for visual content
  • Content Automation – Auto-fill social preview text, product descriptions, and more
  • Zero-shot Capability – Works out-of-the-box on any domain without fine-tuning

Getting Started

You’ll need an Inference.net account and API key. See our Quick Start Guide for instructions on how to create an account and get an API key.

Install the OpenAI SDK for your language of choice and set the base URL to https://api.inference.net/v1.

import os
from openai import OpenAI

openai = OpenAI(
    base_url="https://api.inference.net/v1",
    api_key=os.getenv("INFERENCE_API_KEY"),
)

Quick Example

Structured Outputs for Clean Alt-Text

When using Structured Outputs, always instruct the model to respond in JSON format (e.g. via a system prompt).

Guarantee that every response contains exactly one alt_text field by using Structured Outputs.

import base64, requests

# Fetch the image and convert to a Data URI in one shot
url = "https://upload.wikimedia.org/wikipedia/commons/3/3f/Crystal_Project_bug.png"
img = requests.get(url).content

data_uri = f"data:image/png;base64,{base64.b64encode(img).decode()}"

completion = openai.chat.completions.create(
    model="meta-llama/llama-3.2-11b-instruct/fp-16",
    messages=[
        {
            "role": "system",
            "content": "Generate an alt text caption. Respond in JSON format.",
        },
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": data_uri}},
            ],
        },
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "caption",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "alt_text": {"type": "string"},
                },
                "required": ["alt_text"],
                "additionalProperties": False,
            },
        },
    },
)
print(completion.choices[0].message.content)

Processing thousands of images? Use the Batch API to caption large datasets asynchronously and avoid rate-limit constraints.

Best Practices

  • Keep it short – Aim for 1–2 sentences (≤ 120 characters) to maximize usefulness for screen readers and SEO.
  • Be objective – Describe what is visible, not what you assume (avoid gender, emotions, or unverified context).
  • Add context when relevant – If the image appears in a product page, include the product name (e.g. “Red leather backpack on a white background”).

Limitations & Considerations

  • Caption quality depends on image resolution and clarity. Images larger than 1 MB are not well supported, and the number of images you can pass is dependent on the model.
  • Vision models may occasionally hallucinate unseen details. Consider manual spot checks for critical use cases.
  • Using images counts toward your token budget. See the Vision guide for the token calculation formula.
  • Some uncommon image formats may not be supported.

For more advanced vision workflows—object detection, multi-modal Q&A, or scene classification—see the Vision Models guide.