Video Understanding with ClipTagger - Inference.net Documentation

Best fit
Recommended stack
Workflow
Related pages

Use this tutorial when you want a repeatable video-understanding workflow rather than a one-off multimodal prompt.

Best fit

large sets of frames
scene tagging
factual captioning
video metadata enrichment

Recommended stack

model: ClipTagger
small related bundles: background jobs or group jobs
large queues: batch

Workflow

decide the frame sampling strategy
choose whether the job is small-batch or large offline
run frames through ClipTagger
aggregate frame outputs into your higher-level video result

HTML Extraction with Schematron Classification with Structured Outputs

⌘I