From giant generalists to lean, production-ready models
Large frontier models already know how to do your task, but running them in production can be painfully slow and expensive.
Distillation fixes that.
Step | What Happens | Why You Care |
---|---|---|
1. Fine-tune a teacher (pick any strong open-source or proprietary checkpoint). | We make the model excellent on your task, no compromises. | Accurate. |
2. Distill that teacher into a smaller student (3-12 B). | The student learns by copying the teacher’s answers. | 5-10× faster & cheaper. |
3. Deploy the student behind the same /chat/completions endpoint. | Zero code changes besides the model name. | Straight to prod. |
You don’t need labeled outputs.
Provide a pile of inputs (dog images, video frames, customer emails).
The teacher generates the supervision signal; the student learns from it.
It’s possible you already have users passing in data to a large, overly expensive LLM. We can just use that data to fine-tune the model!
Use-case | Why Distill? |
---|---|
Image or video tagging | Teacher handles vision; student runs on a single GPU for nightly batches. |
High-volume classification | 10 K requests/sec without burning $$. |
Strict JSON / XML extraction | Student stays inside schema; latency drops below 100 ms. |
Edge deployment | Fit a 7 B model on CPUs or mobile, no external API call. |
Rule of thumb: If the teacher nails quality but misses your latency or cost SLO, distill it.
Here’s model an example of a model training run where we distilled a 27B model into a 12B model. Notice that the distilled model was effectively able to learn the same task as the teacher model, but at a fraction of the cost and latency.
We first fine-tuned the teacher model on a set of 100k examples passed through Gemini-2.5-Pro. Then we distilled the teacher model into a 12B model. The distilled model was able to learn the same task as the teacher model, but at a fraction of the cost and latency!
Gemma 12 B distilled accuracy chart
Technique | When We Recommend It |
---|---|
Full fine-tune | Maximum quality; you own the weights. |
LoRA / Adapters | Small dataset, rapid iteration, weights ≤ 1 GB. |
Quantised LoRA | Edge devices, CPU inference. |
Distillation | Production latency/cost constraints. |
Our team has shipped hundreds of custom models. Email [email protected]
and we’ll pick the right recipe for you.
Need results by next week? We’ve done it before. Ping us—let’s build something that actually ships.