Introduction

Vision Models are multi-modal models that accept both text and images as input. You can use vision models to extract information from images (for example, by asking the model to describe the image). This guide explains how to use Vision Models with the Inference API.

Currently, we support the following vision models:

  • meta-llama/llama-3.2-11b-instruct/fp-16

Getting Started

You’ll need an Inference Cloud account and API key. See our Quick Start Guide for instructions on how to create an account and get an API key.

Install the OpenAI SDK for your language of choice. To connect to Inference Cloud using the OpenAI SDK, you will need to set the base URL to https://api.inference.net/v1. In the following examples, we are reading the API key from the environment variable INFERENCE_API_KEY.

Step By Step Example

To use image inputs with the Inference API:

  1. Encode your image as a base64 string
  2. Include the base64 string in a Data URI with an image mimetype (e.g. image/png)
  3. Include the Data URI in the content array of a user message
  4. Send the request to the Inference API and inspect the response

Step 1: Encode your image as a Data URI

import base64, requests

url = "https://upload.wikimedia.org/wikipedia/commons/3/3f/Crystal_Project_bug.png"

response = requests.get(url)
image_data = response.content

encoded_string = base64.b64encode(image_data).decode('utf-8')

data_uri = f"data:image/png;base64,{encoded_string}"

Step 2: Structure and send your request


import os
from openai import OpenAI

openai = OpenAI(
    base_url="https://api.inference.net/v1",
    api_key=os.getenv("INFERENCE_API_KEY"),
)

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that can answer questions about the image."
    },
    {
        "role": "user", 
        "content": [
            {
                "type": "image_url",
                "image_url": { "url": data_uri }
            },
            {
                "type": "text",
                "text": "What is in this image?"
            },
        ]
    }
]

completion = openai.chat.completions.create(
    model="meta-llama/llama-3.2-11b-instruct/fp-16",
    messages=messages,
)

print(completion.choices[0].message.content)

Limitations

  • We do not support sending images from a URL directly into the request body.
  • Supported image formats include webp, png, gif and jpg/jpeg.
  • The total size of the request body must be less than 1MB.
  • Each request can contain a maximum of 2 images.

Token Usage

Using images in a request counts towards the total token usage for a request. The exact token count will vary, but a handy approximation of the number of tokens used by an image is the following formula:

h = max(2, min(1, HEIGHT / 560))
w = max(2, min(1, WIDTH / 560))
tokens = h * w * 1601

In plain English:

  1. The image height and width in pixels are both divided by 560
  2. The resulting height and width are clamped between 1 and 2
  3. Finally, the height and width are multiplied together and then multiplied by 1,601

Here is a table of image dimensions and their corresponding estimated token counts:

HeightWidthTokensNote
32px32px1,601Images smaller than 560x560 are still considered 560x560
560px560px1,601
1120px1120px6,4046,404 is the approximate maximum token usage of a single image

The above formula and table is an approximation. We suggest that you:

  • Explicitly check your image dimensions before submitting them to the API to avoid high token usage.
  • Monitor your token usage and adjust your requests if necessary.

See the Models page for current pricing per token for Vision Models.