Skip to main content
Fine-tuning takes a small open-source model and trains it on your data so it gets really good at one specific task. The result is a model that’s smaller, faster, and cheaper to run than the general-purpose model it replaces, while being more accurate for your workload. You don’t need to be an ML engineer to use it. Training is one optimization lever, not the only one. Many teams get massive value from observability + evals + prompt tuning alone. Fine-tuning is for when you’ve confirmed via evals that off-the-shelf models aren’t good enough for a specific task, and you have the data to prove it.

Prerequisites

Training requires three things before you start:
  1. A training dataset - the data the model learns from. Build it from captured traffic or upload your own.
  2. An eval dataset - measures learning progress. Must have zero overlap with training data.
  3. A validated rubric - run it against your eval dataset first to confirm it measures what you care about. See Set Up Your First Eval.

📍 TODO:MEDIA

Visual showing the training loop: select recipe → launch run → monitor progress → converge or early-stop → deploy.

Key concepts

ConceptDescription
RecipeA pre-configured training setup with a vetted base model, optimized parameters, and compute config. You pick by task difficulty, the platform handles the ML complexity.
Training datasetThe data the model learns from. Diversity and quality matter most. Build from captured traffic or upload your own.
Eval datasetA separate dataset that measures learning progress. Must have zero overlap with training data to prevent overfitting.
RubricThe quality criteria that guide training. Mid-training evals use it to decide when to stop. If the rubric is wrong, the model optimizes for the wrong thing.
Mid-training evalsPeriodic quality checks during training. If scores improve, training continues. If they degrade, training stops early to prevent overfitting.

Why fine-tune

  • Reduce latency - a smaller, task-specific model responds faster than a general-purpose one
  • Reduce cost - smaller models cost less to serve at scale
  • Improve accuracy - a model trained on your data and scored against your rubric is optimized for exactly what you need
  • Maintain ownership - you own the model artifact and control where it runs

What makes good training data

The quality of your trained model depends directly on the quality of your data. A few principles:
  • Diversity matters most. Training data should cover the range of inputs the model will see in production — different phrasings, edge cases, varying complexity. A narrow dataset produces a narrow model.
  • Real traffic beats synthetic data. Production inputs reflect what users actually send. Build datasets from live traffic when possible.
  • Scope to a single task. A dataset built from mixed traffic teaches the model many things poorly. Use task tags to filter for one objective at a time.
  • More is generally better, but quality trumps volume. A thousand clean, diverse examples outperform ten thousand repetitive ones.
For eval data, the goal is different: pick a small, stable set of hard examples that stress-test the model. Don’t change your eval dataset often — it’s your benchmark. See Datasets for more on building both types.

Next steps

Choose a recipe

Pick a pre-configured training setup for your task.

Launch a training run

End-to-end flow from datasets to queued job.

Monitor a training run

Track progress, graphs, and logs during training.

Deploy a trained model

Ship your model to a dedicated GPU.