Vision
Use models to extract information from images.
Introduction
Vision Models are multi-modal models that accept both text and images as input. You can use vision models to extract information from images (for example, by asking the model to describe the image). This guide explains how to use Vision Models with the Inference API.
Currently, we support the following vision models:
- meta-llama/llama-3.2-11b-instruct/fp-16
Getting Started
You’ll need an Inference Cloud account and API key. See our Quick Start Guide for instructions on how to create an account and get an API key.
Install the OpenAI SDK for your language of choice.
To connect to Inference Cloud using the OpenAI SDK, you will need to set the base URL to https://api.inference.net/v1
.
In the following examples, we are reading the API key from the environment variable INFERENCE_API_KEY
.
Step By Step Example
To use image inputs with the Inference API:
- Encode your image as a base64 string
- Include the base64 string in a Data URI with an image mimetype (e.g.
image/png
) - Include the Data URI in the
content
array of a user message - Send the request to the Inference API and inspect the response
Step 1: Encode your image as a Data URI
Step 2: Structure and send your request
Limitations
- We do not support sending images from a URL directly into the request body.
- Supported image formats include webp, png, gif and jpg/jpeg.
- The total size of the request body must be less than 1MB.
- Each request can contain a maximum of 2 images.
Token Usage
Using images in a request counts towards the total token usage for a request. The exact token count will vary, but a handy approximation of the number of tokens used by an image is the following formula:
In plain English:
- The image height and width in pixels are both divided by 560
- The resulting height and width are clamped between 1 and 2
- Finally, the height and width are multiplied together and then multiplied by 1,601
Here is a table of image dimensions and their corresponding estimated token counts:
Height | Width | Tokens | Note |
---|---|---|---|
32px | 32px | 1,601 | Images smaller than 560x560 are still considered 560x560 |
560px | 560px | 1,601 | |
1120px | 1120px | 6,404 | 6,404 is the approximate maximum token usage of a single image |
The above formula and table is an approximation. We suggest that you:
- Explicitly check your image dimensions before submitting them to the API to avoid high token usage.
- Monitor your token usage and adjust your requests if necessary.
See the Models page for current pricing per token for Vision Models.