Vision - Inference.net Documentation

Introduction

Vision Models are multi-modal models that accept both text and images as input. You can use vision models to extract information from images (for example, by asking the model to describe the image). This guide explains how to use Vision Models with the Inference API.

Getting Started

You’ll need an Inference.net account and API key. See our Quick Start Guide for instructions on how to create an account and get an API key. Install the OpenAI SDK for your language of choice. To connect to Inference.net using the OpenAI SDK, you will need to set the base URL to https://api.inference.net/v1. In the following examples, we are reading the API key from the environment variable INFERENCE_API_KEY.

Step By Step Example

To use image inputs with the Inference API:

Encode your image as a base64 string
Include the base64 string in a Data URI with an image mimetype (e.g. image/png)
Include the Data URI in the content array of a user message
Send the request to the Inference API and inspect the response

Step 1: Encode your image as a Data URI

const url = "https://upload.wikimedia.org/wikipedia/commons/3/3f/Crystal_Project_bug.png";

const response = await fetch(url);
const buffer = Buffer.from(await response.arrayBuffer());
const base64 = buffer.toString("base64");
const dataUri = `data:image/png;base64,${base64}`;

import base64
import requests

url = "https://upload.wikimedia.org/wikipedia/commons/3/3f/Crystal_Project_bug.png"

response = requests.get(url)
image_data = response.content

encoded_string = base64.b64encode(image_data).decode("utf-8")

data_uri = f"data:image/png;base64,{encoded_string}"

# Encode an image file to a base64 data URI:
DATA_URI="data:image/png;base64,$(base64 -i image.png)"
# Or fetch and encode from a URL:
DATA_URI="data:image/png;base64,$(curl -s https://upload.wikimedia.org/wikipedia/commons/3/3f/Crystal_Project_bug.png | base64)"

Step 2: Structure and send your request

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.inference.net/v1",
  apiKey: process.env.INFERENCE_API_KEY,
});

const completion = await client.chat.completions.create({
  model: "google/gemma-3-27b-instruct/bf-16",
  messages: [
    {
      role: "system",
      content: "You are a helpful assistant that can answer questions about the image.",
    },
    {
      role: "user",
      content: [
        {
          type: "image_url",
          image_url: { url: dataUri },
        },
        {
          type: "text",
          text: "What is in this image?",
        },
      ],
    },
  ],
});

console.log(completion.choices[0].message.content);

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.inference.net/v1",
    api_key=os.environ["INFERENCE_API_KEY"],
)

completion = client.chat.completions.create(
    model="google/gemma-3-27b-instruct/bf-16",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that can answer questions about the image.",
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": data_uri},
                },
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
            ],
        },
    ],
)

print(completion.choices[0].message.content)

curl https://api.inference.net/v1/chat/completions \
  -H "Authorization: Bearer $INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-27b-instruct/bf-16",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant that can answer questions about the image."
      },
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": { "url": "'"$DATA_URI"'" }
          },
          {
            "type": "text",
            "text": "What is in this image?"
          }
        ]
      }
    ]
  }'

Limitations

We do not support sending images from a URL directly into the request body.
Supported image formats include webp, png, gif and jpg/jpeg.
The total size of the request body must be less than 1MB.
Each request can contain a maximum of 2 images.

Token Usage

Using images in a request counts towards the total token usage for a request. The exact token count will vary, but a handy approximation of the number of tokens used by an image is the following formula:

h = max(2, min(1, HEIGHT / 560))
w = max(2, min(1, WIDTH / 560))
tokens = h * w * 1601

In plain English:

The image height and width in pixels are both divided by 560
The resulting height and width are clamped between 1 and 2
Finally, the height and width are multiplied together and then multiplied by 1,601

Here is a table of image dimensions and their corresponding estimated token counts:

Height	Width	Tokens	Note
32px	32px	1,601	Images smaller than 560x560 are still considered 560x560
560px	560px	1,601
1120px	1120px	6,404	6,404 is the approximate maximum token usage of a single image

The above formula and table is an approximation. We suggest that you:

Explicitly check your image dimensions before submitting them to the API to avoid high token usage.
Monitor your token usage and adjust your requests if necessary.

See the Models page for current pricing per token for Vision Models.

​Introduction

​Getting Started

​Step By Step Example

​Step 1: Encode your image as a Data URI

​Step 2: Structure and send your request

​Limitations

​Token Usage