> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inference.net/llms.txt
> Use this file to discover all available pages before exploring further.

# Troubleshooting Training Failures

> Common training failures, what they look like, and how to recover.

Training runs should generally succeed without intervention. If a run fails, it usually indicates an underlying issue worth investigating.

## Common failures

| Failure                         | What happened                                                                       | What to do                                                 |
| ------------------------------- | ----------------------------------------------------------------------------------- | ---------------------------------------------------------- |
| **Model collapse / divergence** | Model trained but performs poorly. May have overfit despite good eval scores.       | Check data quality and diversity. Try a different recipe.  |
| **Insufficient data**           | Model fails to generalize well.                                                     | Add more representative samples to the training dataset.   |
| **Poor data quality**           | Training data isn't diverse or heterogeneous enough.                                | Curate a more varied dataset. Ensure it covers edge cases. |
| **Rate limited**                | Too many concurrent training runs. The platform rate-limits usage to prevent abuse. | Wait for your current run to finish, then retry.           |

## What you see on failure

* An error message on the training run
* A description of the failure condition
* A retry button

## When to reach out

If training fails unexpectedly, contact support. Failures usually indicate something worth looking into — the team can help diagnose whether it's a data issue, a recipe mismatch, or a platform problem.

<Card title="Talk to an engineer" icon="calendar" href="https://inference.net/meet-with-us/">
  Meet with our team to discuss training failures or get help with your approach.
</Card>

## Training succeeded?

Model trained and eval scores look good? Deploy it to a dedicated GPU and start serving traffic.

<Card title="Deploy your model" icon="server" href="/platform/deploy/overview">
  Ship your trained model to production.
</Card>
