Turn Eval Failures into a Training Run

Use this guide when prompt changes are no longer enough and you need to improve the model itself.

What you’ll have when you finish

one training job launched against paired training and eval datasets
a clear training objective tied to the eval baseline
a checklist for deciding whether the result is worth promoting

Before you start

complete /guides/build-a-real-traffic-eval-baseline
create both a training dataset and an eval dataset with /guides/create-datasets-from-observed-traffic

Step 1: decide whether this is a training problem

Training is worth it when:

the baseline model consistently fails on the same task pattern
the eval dataset is representative and stable
the improvement target matters enough to justify a new model artifact

If the issue is still mostly prompt quality or dataset quality, fix those first.

Step 2: choose the model improvement path

Use fine-tuning when the base model is close but not good enough. Use distillation when the teacher model already performs well and your bigger problem is cost, latency, or serving footprint.

Step 3: launch the training job with paired datasets

The current self-serve flow expects:

one training dataset
one eval dataset
one eval definition and version
one base model

Training jobs should be anchored to the same eval rubric you used to decide that the baseline was not good enough.

Step 4: monitor the run, not just the final status

Watch the training job detail page for:

queued vs running vs completed vs failed
current step vs total steps
current loss
checkpoint evals and score distribution
final model reference and weights

Checkpoint evals matter because they tell you whether the model is improving before the job finishes.

Step 5: decide if the result is good enough to promote

A completed training job is not automatically a production-ready model. Before promotion, confirm:

the training job finished successfully
the model beats or matches the baseline on the eval you trust
the latency and cost tradeoffs still make sense

Verify it worked

You should now have:

one completed training job or one active job with a clear monitoring path
one model reference to evaluate for promotion

What to do next

Promote a Trained Model to Deployment

Move the trained result into a dedicated serving path and validate it.

Meet with Us

Talk to our team if you want help with dataset strategy, distillation, or rollout planning.

Start Here

Guides

Reference

Tutorials

What you’ll have when you finish

Before you start

Step 1: decide whether this is a training problem

Step 2: choose the model improvement path

Step 3: launch the training job with paired datasets

Step 4: monitor the run, not just the final status

Step 5: decide if the result is good enough to promote

Verify it worked

What to do next

Promote a Trained Model to Deployment

Meet with Us

Start Here

Guides

Reference

Tutorials

​What you’ll have when you finish

​Before you start

​Step 1: decide whether this is a training problem

​Step 2: choose the model improvement path

​Step 3: launch the training job with paired datasets

​Step 4: monitor the run, not just the final status

​Step 5: decide if the result is good enough to promote

​Verify it worked

​What to do next

Promote a Trained Model to Deployment

Meet with Us

What you’ll have when you finish

Before you start

Step 1: decide whether this is a training problem

Step 2: choose the model improvement path

Step 3: launch the training job with paired datasets

Step 4: monitor the run, not just the final status

Step 5: decide if the result is good enough to promote

Verify it worked

What to do next