Skip to main content
Training runs should generally succeed without intervention. If a run fails, it usually indicates an underlying issue worth investigating.

Common failures

FailureWhat happenedWhat to do
Model collapse / divergenceModel trained but performs poorly. May have overfit despite good eval scores.Check data quality and diversity. Try a different recipe.
Insufficient dataModel fails to generalize well.Add more representative samples to the training dataset.
Poor data qualityTraining data isn’t diverse or heterogeneous enough.Curate a more varied dataset. Ensure it covers edge cases.
Rate limitedToo many concurrent training runs. The platform rate-limits usage to prevent abuse.Wait for your current run to finish, then retry.

What you see on failure

  • An error message on the training run
  • A description of the failure condition
  • A retry button

When to reach out

If training fails unexpectedly, contact support. Failures usually indicate something worth looking into — the team can help diagnose whether it’s a data issue, a recipe mismatch, or a platform problem.

Talk to an engineer

Meet with our team to discuss training failures or get help with your approach.

Training succeeded?

Model trained and eval scores look good? Deploy it to a dedicated GPU and start serving traffic.

Deploy your model

Ship your trained model to production.