Best fit
Use vision for:- image captioning and alt text
- multimodal Q&A
- product or document image extraction
- scene understanding and tagging
Request model
Vision requests use the normal chat-completions shape. The main difference is that the user message includes image content. On Inference.net today, the safest input pattern is:- encode the image as a base64 data URI
- include it in the message content array
- pair it with an explicit text instruction
Current practical limits
- images should be sent as data URIs, not remote URLs
- supported formats include
png,jpg/jpeg,gif, andwebp - keep request bodies below the documented request-size limits
- monitor token usage because image inputs consume tokens too
Recommended paths
- use /tutorials/image-captioning for alt text and captioning workflows
- use /tutorials/video-understanding-with-cliptagger for multi-frame analysis with ClipTagger and async scale paths
- use /api/structured-outputs when the output must match a schema