Common failures
| Failure | What happened | What to do |
|---|---|---|
| Model collapse / divergence | Model trained but performs poorly. May have overfit despite good eval scores. | Check data quality and diversity. Try a different recipe. |
| Insufficient data | Model fails to generalize well. | Add more representative samples to the training dataset. |
| Poor data quality | Training data isn’t diverse or heterogeneous enough. | Curate a more varied dataset. Ensure it covers edge cases. |
| Rate limited | Too many concurrent training runs. The platform rate-limits usage to prevent abuse. | Wait for your current run to finish, then retry. |
What you see on failure
- An error message on the training run
- A description of the failure condition
- A retry button
When to reach out
If training fails unexpectedly, contact support. Failures usually indicate something worth looking into — the team can help diagnose whether it’s a data issue, a recipe mismatch, or a platform problem.Talk to an engineer
Meet with our team to discuss training failures or get help with your approach.
Training succeeded?
Model trained and eval scores look good? Deploy it to a dedicated GPU and start serving traffic.Deploy your model
Ship your trained model to production.