> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inference.net/llms.txt
> Use this file to discover all available pages before exploring further.

# Vision

> Use models to extract information from images.

## Introduction

**Vision Models** are *multi-modal models* that accept both text and images as input.
You can use vision models to extract information from images (for example, by asking the model to describe the image).
This guide explains how to use Vision Models with the Inference API.

## Getting Started

You'll need an Inference.net account and API key. See our [Quick Start Guide](/api/api-quickstart) for instructions on how to create an account and get an API key.

Install the [OpenAI SDK](https://platform.openai.com/docs/libraries) for your language of choice.
To connect to Inference.net using the OpenAI SDK, you will need to set the base URL to `https://api.inference.net/v1`.
In the following examples, we are reading the API key from the environment variable `INFERENCE_API_KEY`.

## Step By Step Example

To use image inputs with the Inference API:

1. Encode your image as a base64 string
2. Include the base64 string in a [Data URI](https://developer.mozilla.org/en-US/docs/Web/URI/Reference/Schemes/data) with an image mimetype (e.g. `image/png`)
3. Include the Data URI in the `content` array of a user message
4. Send the request to the Inference API and inspect the response

### Step 1: Encode your image as a Data URI

<CodeGroup>
  <Metadata text="vision/encode_data_uri[series=step_by_step]" />

  ```typescript TypeScript theme={"system"}
  const url = "https://upload.wikimedia.org/wikipedia/commons/3/3f/Crystal_Project_bug.png";

  const response = await fetch(url);
  const buffer = Buffer.from(await response.arrayBuffer());
  const base64 = buffer.toString("base64");
  const dataUri = `data:image/png;base64,${base64}`;
  ```

  <Metadata text="vision/encode_data_uri[series=step_by_step]" />

  ```python Python theme={"system"}
  import base64
  import requests

  url = "https://upload.wikimedia.org/wikipedia/commons/3/3f/Crystal_Project_bug.png"

  response = requests.get(url)
  image_data = response.content

  encoded_string = base64.b64encode(image_data).decode("utf-8")

  data_uri = f"data:image/png;base64,{encoded_string}"
  ```

  <Metadata text="vision/encode_data_uri[series=step_by_step]" />

  ```bash cURL theme={"system"}
  # Encode an image file to a base64 data URI:
  DATA_URI="data:image/png;base64,$(base64 -i image.png)"
  # Or fetch and encode from a URL:
  DATA_URI="data:image/png;base64,$(curl -s https://upload.wikimedia.org/wikipedia/commons/3/3f/Crystal_Project_bug.png | base64)"
  ```
</CodeGroup>

### Step 2: Structure and send your request

<CodeGroup>
  <Metadata text="vision/structure_request[series=step_by_step]" />

  ```typescript TypeScript theme={"system"}
  import OpenAI from "openai";

  const client = new OpenAI({
    baseURL: "https://api.inference.net/v1",
    apiKey: process.env.INFERENCE_API_KEY,
  });

  const completion = await client.chat.completions.create({
    model: "google/gemma-3-27b-instruct/bf-16",
    messages: [
      {
        role: "system",
        content: "You are a helpful assistant that can answer questions about the image.",
      },
      {
        role: "user",
        content: [
          {
            type: "image_url",
            image_url: { url: dataUri },
          },
          {
            type: "text",
            text: "What is in this image?",
          },
        ],
      },
    ],
  });

  console.log(completion.choices[0].message.content);
  ```

  <Metadata text="vision/structure_request[series=step_by_step]" />

  ```python Python theme={"system"}
  import os
  from openai import OpenAI

  client = OpenAI(
      base_url="https://api.inference.net/v1",
      api_key=os.environ["INFERENCE_API_KEY"],
  )

  completion = client.chat.completions.create(
      model="google/gemma-3-27b-instruct/bf-16",
      messages=[
          {
              "role": "system",
              "content": "You are a helpful assistant that can answer questions about the image.",
          },
          {
              "role": "user",
              "content": [
                  {
                      "type": "image_url",
                      "image_url": {"url": data_uri},
                  },
                  {
                      "type": "text",
                      "text": "What is in this image?",
                  },
              ],
          },
      ],
  )

  print(completion.choices[0].message.content)
  ```

  <Metadata text="vision/structure_request[series=step_by_step]" />

  ```bash cURL theme={"system"}
  curl https://api.inference.net/v1/chat/completions \
    -H "Authorization: Bearer $INFERENCE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "google/gemma-3-27b-instruct/bf-16",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant that can answer questions about the image."
        },
        {
          "role": "user",
          "content": [
            {
              "type": "image_url",
              "image_url": { "url": "'"$DATA_URI"'" }
            },
            {
              "type": "text",
              "text": "What is in this image?"
            }
          ]
        }
      ]
    }'
  ```
</CodeGroup>

## Limitations

* We do not support sending images from a URL directly into the request body.
* Supported image formats include webp, png, gif and jpg/jpeg.
* The total size of the request body must be less than 1MB.
* Each request can contain a maximum of 2 images.

## Token Usage

Using images in a request counts towards the total token usage for a request.
The exact token count will vary, but a handy approximation of the number of tokens used by an image is the following formula:

```
h = max(2, min(1, HEIGHT / 560))
w = max(2, min(1, WIDTH / 560))
tokens = h * w * 1601
```

In plain English:

1. The image height and width in pixels are both divided by 560
2. The resulting height and width are clamped between 1 and 2
3. Finally, the height and width are multiplied together and then multiplied by 1,601

Here is a table of image dimensions and their corresponding estimated token counts:

| Height | Width  | Tokens | Note                                                           |
| ------ | ------ | ------ | -------------------------------------------------------------- |
| 32px   | 32px   | 1,601  | Images smaller than 560x560 are still considered 560x560       |
| 560px  | 560px  | 1,601  |                                                                |
| 1120px | 1120px | 6,404  | 6,404 is the approximate maximum token usage of a single image |

The above formula and table is an approximation. We suggest that you:

* Explicitly check your image dimensions before submitting them to the API to avoid high token usage.
* Monitor your token usage and adjust your requests if necessary.

See the [Models](https://inference.net/models) page for current pricing per token for Vision Models.
