Troubleshooting Training Failures - Inference.net Documentation

Training runs should generally succeed without intervention. If a run fails, it usually indicates an underlying issue worth investigating.

Common failures

Failure	What happened	What to do
Model collapse / divergence	Model trained but performs poorly. May have overfit despite good eval scores.	Check data quality and diversity. Try a different recipe.
Insufficient data	Model fails to generalize well.	Add more representative samples to the training dataset.
Poor data quality	Training data isn’t diverse or heterogeneous enough.	Curate a more varied dataset. Ensure it covers edge cases.
Rate limited	Too many concurrent training runs. The platform rate-limits usage to prevent abuse.	Wait for your current run to finish, then retry.

What you see on failure

An error message on the training run
A description of the failure condition
A retry button

When to reach out

If training fails unexpectedly, contact support. Failures usually indicate something worth looking into — the team can help diagnose whether it’s a data issue, a recipe mismatch, or a platform problem.

Talk to an engineer

Meet with our team to discuss training failures or get help with your approach.

Training succeeded?

Model trained and eval scores look good? Deploy it to a dedicated GPU and start serving traffic.

Deploy your model

Ship your trained model to production.

After Training Completes

⌘I

​Common failures

​What you see on failure

​When to reach out