From giant generalists to lean, production-ready models
Step | What Happens | Why You Care |
---|---|---|
1. Fine-tune a teacher (pick any strong open-source or proprietary checkpoint). | We make the model excellent on your task, no compromises. | Accurate. |
2. Distill that teacher into a smaller student (7-27 B). | The student learns by copying the teacher’s answers. | 5-10× faster & cheaper. |
3. Deploy the student behind the same /chat/completions endpoint. | Zero code changes besides the model name. | Straight to prod. |
Use-case | Why Distill? |
---|---|
Image or video tagging | Teacher handles vision; student runs on a single GPU for nightly batches. |
High-volume classification | 10 K requests/sec without burning $$. |
Strict JSON / XML extraction | Student stays inside schema; latency drops below 100 ms. |
Edge deployment | Fit a 7 B model on CPUs or mobile, no external API call. |
Rule of thumb: If the teacher nails quality but misses your latency or cost SLO, distill it.
Gemma 12 B distilled accuracy chart
Modality | Example Tasks |
---|---|
Text → Text | Code completion, instruction following, data extraction |
Image → Text | Captioning, classification, tagging |
Video → Text | Captioning, Q&A, summarization |
Audio → Text | Transcription, Q&A (often with video) |