
Introduction
Programmatic video understanding enables developers to build applications that extract structured information from video content at scale, transforming raw frames into searchable, actionable data. By processing video frames through specialized vision language models, developers can build systems that understand visual content as deeply as they understand text. ClipTagger-12b is a custom 12-billion parameter vision language model (VLM) designed for video understanding at massive scale. The model was trained by the Inference.net research team in collaboration with Grass to meet their trillion-scale video frame captioning needs. Read the announcement blog post. On our benchmarks, ClipTagger-12b achieves quality on par with GPT-4.1 and outperforms Claude 4 Sonnet, while delivering 15–17x lower costs. We use Gemini-2.5-Pro as an independent judge to evaluate caption quality. The model enables teams operating at scale to perform temporal video understanding, content indexing, and automated video analysis workflows at state-of-the-art model quality while maintaining control over their data and costs.Key Features
- Frontier-quality – Comparable to GPT-4.1 and better than Claude 4 Sonnet.
- Cost-Efficient – 15x lower cost than GPT-4.1 and 17x lower than Claude 4 Sonnet.
- Production-Ready – Battle-tested on trillion-scale video frame captioning workloads.
- Temporal Consistency – Maintains semantic consistency across video frames.
- Open Source – The model is open source and available on Hugging Face.
Processing at Scale
For large-scale workloads, we recommend using our asynchronous APIs:Batch API
Ideal for processing thousands to millions of frames with up to 50,000 requests per batch, 24-hour completion window, webhook notifications, and ~95% cost savings vs synchronous requests.
Group API
Perfect for smaller batches (up to 50 frames) with simpler integration. Direct JSON request body, group related frames together, track processing as a unit, and webhook support.
Getting Started
You’ll need an Inference.net account and API key. See our Quick Start Guide for setup instructions. Install the OpenAI SDK and set the base URL tohttps://api.inference.net/v1
.
Basic Usage
The model processes a single video frame per request. In order to process a video, you’ll need to send a request for each frame you want to analyze. Processing every frame of a video is not always necessary. Depending on the you case, you may want to process a subset of frames, such as keyframes or frames with high motion. You can use our Group API to submit multiple frames at once, and received a response once all frames have been processed. This is often ideal for processing 10-20 frames per video. The model requires a specific system prompt and user prompt to generate structured output. Here’s a complete example:ClipTagger-12b requires these exact system and user prompts for optimal performance. The model was specifically trained with this prompt structure to ensure consistent, high-quality output.
Output
We’re going to use the image below as an example.
actions
or logos
in the example above), the model returns empty arrays rather than omitting fields.
The JSON response schema shown above is the model’s standard output format. All fields will always be present in the response, with empty arrays returned when no relevant elements are detected (e.g., no logos or actions in the frame).
Keyframe selection
There are several ways to select keyframes for a video.1. Frame-by-frame analysis
The simplest way to select keyframes is to process every frame of the video. This is the most accurate way to select keyframes, but it is also the most expensive.Example Use Cases
Video Search & Discovery - Build searchable video databases by indexing structured metadata and tracking object persistence across frames. Content Moderation - Automated content analysis with consistent categorization for content type detection and quality verification. Accessibility - Generate consistent alt-text and scene summaries for video content with frame-by-frame descriptions. Ad Verification - Ensure sponsored content compliance by tracking product visibility and logo appearances.Best Practices
- Use the exact prompts – The provided system and user prompts are optimized for best results
- Set low temperature – Use
temperature: 0.1
for consistent output - Enable JSON mode – Always set
response_format: {"type": "json_object"}
- Process frames systematically – Maintain temporal order for better analysis
- Batch similar content – Group frames from the same video or scene
Limitations
- Single video frame per request only. Use the Group API to process multiple frames at once.
- Maximum image size: 1MB
- Supported formats: JPEG, PNG, WebP, GIF
- English-only descriptions (though it can identify text in other languages)
Support
For technical support or custom deployment options:- Email: [email protected]
- Schedule a call for a consultation
cliptagger-12B was developed in partnership with Grass to process over a billion videos while maintaining perfect schema adherence.