ClipTagger

Introduction

Programmatic video understanding enables developers to build applications that extract structured information from video content at scale, transforming raw frames into searchable, actionable data. By processing video frames through specialized vision language models, developers can build systems that understand visual content as deeply as they understand text. ClipTagger-12b is a custom 12-billion parameter vision language model (VLM) designed for video understanding at massive scale. The model was trained by the Inference.net research team in collaboration with Grass to meet their trillion-scale video frame captioning needs. Read the announcement blog post. On our benchmarks, ClipTagger-12b achieves quality on par with GPT-4.1 and outperforms Claude 4 Sonnet, while delivering 15–17x lower costs. We use Gemini-2.5-Pro as an independent judge to evaluate caption quality. The model enables teams operating at scale to perform temporal video understanding, content indexing, and automated video analysis workflows at state-of-the-art model quality while maintaining control over their data and costs.

Key Features

Frontier-quality – Comparable to GPT-4.1 and better than Claude 4 Sonnet.
Cost-Efficient – 15x lower cost than GPT-4.1 and 17x lower than Claude 4 Sonnet.
Production-Ready – Battle-tested on trillion-scale video frame captioning workloads.
Temporal Consistency – Maintains semantic consistency across video frames.
Open Source – The model is open source and available on Hugging Face.

Processing at Scale

For large-scale workloads, we recommend using our asynchronous APIs:

Batch API

Handle huge frame queues asynchronously. Submit up to 50,000 requests in one batch, track progress, and get results within 24 hours with optional webhooks.

Group API

Best for smaller collections of up to 50 frames. Send a simple JSON payload, monitor the whole group as one job, and receive a single callback when it finishes.

Getting Started

You’ll need an Inference.net account and API key. See our Quick Start Guide for setup instructions. Install the OpenAI SDK and set the base URL to https://api.inference.net/v1.

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.inference.net/v1",
    api_key=os.environ.get("INFERENCE_API_KEY"),
)

Basic Usage

The model processes a single video frame per request. In order to process a video, you’ll need to send a request for each frame you want to analyze. Processing every frame of a video is not always necessary. Depending on the you case, you may want to process a subset of frames, such as keyframes or frames with high motion. You can use our Group API to submit multiple frames at once, and received a response once all frames have been processed. This is often ideal for processing 10-20 frames per video. The model requires a specific system prompt and user prompt to generate structured output. Here’s a complete example:

import base64
import requests
import json
from typing import List
from pydantic import BaseModel, conlist

# Define the response model using Pydantic
class ClipTaggerResponse(BaseModel):
    description: str
    objects: conlist(str, max_length=10)
    actions: conlist(str, max_length=5)
    environment: str
    content_type: str
    specific_style: str
    production_quality: str
    summary: str
    logos: List[str]

# System and user prompts (use exactly as shown for best results)
SYSTEM_PROMPT = "You are an image annotation API trained to analyze YouTube video keyframes. You will be given instructions on the output format, what to caption, and how to perform your job. Follow those instructions. For descriptions and summaries, provide them directly and do not lead them with 'This image shows' or 'This keyframe displays...', just get right into the details."

USER_PROMPT = """
You are an image annotation API trained to analyze YouTube video keyframes. You must respond with a valid JSON object matching the exact structure below.

Your job is to extract detailed **factual elements directly visible** in the image. Do not speculate or interpret artistic intent, camera focus, or composition. Do not include phrases like "this appears to be", "this looks like", or anything about the image itself. Describe what **is physically present in the frame**, and nothing more.

Return JSON in this structure:

{
    "description": "A detailed, factual account of what is visibly happening (4 sentences max). Only mention concrete elements or actions that are clearly shown. Do not include anything about how the image is styled, shot, or composed. Do not lead the description with something like 'This image shows' or 'this keyframe is...', just get right into the details.",
    "objects": ["object1 with relevant visual details", "object2 with relevant visual details", ...],
    "actions": ["action1 with participants and context", "action2 with participants and context", ...],
    "environment": "Detailed factual description of the setting and atmosphere based on visible cues (e.g., interior of a classroom with fluorescent lighting, or outdoor forest path with snow-covered trees).",
    "content_type": "The type of content it is, e.g. 'real-world footage', 'video game', 'animation', 'cartoon', 'CGI', 'VTuber', etc.",
    "specific_style": "Specific genre, aesthetic, or platform style (e.e., anime, 3D animation, mobile gameplay, vlog, tutorial, news broadcast, etc.)",
    "production_quality": "Visible production level: e.g., 'professional studio', 'amateur handheld', 'webcam recording', 'TV broadcast', etc.",
    "summary": "One clear, comprehensive sentence summarizing the visual content of the frame. Like the description, get right to the point.",
    "logos": ["logo1 with visual description", "logo2 with visual description", ...]
}

Rules:
- Be specific and literal. Focus on what is explicitly visible.
- Do NOT include interpretations of emotion, mood, or narrative unless it's visually explicit.
- No artistic or cinematic analysis.
- Always include the language of any text in the image if present as an object, e.g. "English text", "Japanese text", "Russian text", etc.
- Maximum 10 objects and 5 actions.
- Return an empty array for 'logos' if none are present.
- Always output strictly valid JSON with proper escaping.
- Output **only the JSON**, no extra text or explanation.
"""

# Function to encode image from URL
def encode_image_url(image_url):
    response = requests.get(image_url)
    return base64.b64encode(response.content).decode('utf-8')

# Example usage
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
base64_image = encode_image_url(image_url)

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {"type": "text", "text": USER_PROMPT},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{base64_image}",
                    "detail": "high"
                },
            },
        ],
    },
]

response = client.chat.completions.create(
    model="inference-net/cliptagger-12b",
    messages=messages,
    temperature=0.1,
    max_tokens=2000,
    response_format={"type": "json_object"},
)

# Parse and validate the JSON response
raw_result = json.loads(response.choices[0].message.content)
result = ClipTaggerResponse(**raw_result)  # This will raise ValidationError if the response doesn't match the schema

# Now 'result' is a typed Pydantic model instance
print(result.model_dump_json(indent=2))

# You can access typed properties
print(f"Description: {result.description}")
print(f"Objects found: {len(result.objects)}")

ClipTagger-12b requires these exact system and user prompts for optimal performance. The model was specifically trained with this prompt structure to ensure consistent, high-quality output.

Output

We’re going to use the image below as an example.

The model reliably outputs this exact JSON structure for every frame, with all fields always present. Even when certain elements aren’t detected (like actions or logos in the example above), the model returns empty arrays rather than omitting fields.

{
  "description": "A wooden boardwalk path extends from the foreground into the distance, cutting through a field of tall, vibrant green grass. The path is flanked on both sides by the dense grass. In the background, a line of trees is visible on the horizon under a blue sky with scattered white clouds.",
  "objects": [
    "Wooden boardwalk",
    "Tall green grass",
    "Blue sky",
    "White clouds",
    "Trees"
  ],
  "actions": [],
  "environment": "An outdoor, natural landscape, likely a marsh or wetland, on a clear day. The scene is characterized by a wooden boardwalk, lush green vegetation, and a bright blue sky with wispy clouds.",
  "content_type": "real-world footage",
  "specific_style": "landscape photography",
  "production_quality": "professional photography",
  "summary": "A wooden boardwalk path winds through a lush green field under a bright blue sky with scattered clouds.",
  "logos": []
}

The JSON response schema shown above is the model’s standard output format. All fields will always be present in the response, with empty arrays returned when no relevant elements are detected (e.g., no logos or actions in the frame).

Keyframe selection

There are several ways to select keyframes for a video.

1. Frame-by-frame analysis

The simplest way to select keyframes is to process every frame of the video. This is the most accurate way to select keyframes, but it is also the most expensive.

Example Use Cases

Video Search & Discovery - Build searchable video databases by indexing structured metadata and tracking object persistence across frames. Content Moderation - Automated content analysis with consistent categorization for content type detection and quality verification. Accessibility - Generate consistent alt-text and scene summaries for video content with frame-by-frame descriptions. Ad Verification - Ensure sponsored content compliance by tracking product visibility and logo appearances.

Best Practices

Use the exact prompts – The provided system and user prompts are optimized for best results
Set low temperature – Use temperature: 0.1 for consistent output
Enable JSON mode – Always set response_format: {"type": "json_object"}
Process frames systematically – Maintain temporal order for better analysis
Batch similar content – Group frames from the same video or scene

Limitations

Single video frame per request only. Use the Group API to process multiple frames at once.
Maximum image size: 1MB
Supported formats: JPEG, PNG, WebP, GIF
English-only descriptions (though it can identify text in other languages)

Support

For technical support or custom deployment options:

Email: [email protected]
Schedule a call for a consultation

cliptagger-12B was developed in partnership with Grass to process over a billion videos while maintaining perfect schema adherence.

Get Started

Workhorse Models

Features

Fine-Tuning

Use Cases

Resources

Introduction

Key Features

Processing at Scale

Batch API

Group API

Getting Started

Basic Usage

Output

Keyframe selection

1. Frame-by-frame analysis

Example Use Cases

Best Practices

Limitations

Support

Get Started

Workhorse Models

Features

Fine-Tuning

Use Cases

Resources

​Introduction

​Key Features

​Processing at Scale

Batch API

Group API

​Getting Started

​Basic Usage

​Output

​Keyframe selection

​1. Frame-by-frame analysis

​Example Use Cases

​Best Practices

​Limitations

​Support

Introduction

Key Features

Processing at Scale

Getting Started

Basic Usage

Output

Keyframe selection

1. Frame-by-frame analysis

Example Use Cases

Best Practices

Limitations

Support