Skip to main content
Use Vision when you want a model to reason over both text and images in the same request.

Best fit

Use vision for:
  • image captioning and alt text
  • multimodal Q&A
  • product or document image extraction
  • scene understanding and tagging

Request model

Vision requests use the normal chat-completions shape. The main difference is that the user message includes image content. On Inference.net today, the safest input pattern is:
  • encode the image as a base64 data URI
  • include it in the message content array
  • pair it with an explicit text instruction

Current practical limits

  • images should be sent as data URIs, not remote URLs
  • supported formats include png, jpg/jpeg, gif, and webp
  • keep request bodies below the documented request-size limits
  • monitor token usage because image inputs consume tokens too