> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inference.net/llms.txt
> Use this file to discover all available pages before exploring further.

# Datasets

> Curate datasets from production traffic or your own files for evals and training.

Datasets are collections of LLM inputs and outputs used for evaluation and fine-tuning. They can come from two places: your live production traffic captured through [Gateway](/platform/gateway/overview), or files you [upload directly](/platform/datasets/upload-a-dataset).

Everything downstream depends on good data. Evals need representative examples to measure model quality. Training needs diverse, high-quality samples to teach a model your task. Datasets are where both start.

<Frame>
  <iframe style={{ width: "100%", aspectRatio: "16 / 9", border: 0, display: "block" }} src="https://www.youtube.com/embed/WIohm_V4aHo?list=PLJzp7SN2tfJsRAU9VGSfSo60CyDJzqhLP&rel=0" title="Turn Production Traffic Into LLM Training Data | Catalyst" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowFullScreen />
</Frame>

## Types of datasets

| Type                 | Purpose                                       | How it evolves                                                               |
| -------------------- | --------------------------------------------- | ---------------------------------------------------------------------------- |
| **Eval dataset**     | Measures model quality against a rubric       | Stays stable, a fixed set of challenging examples that act as your benchmark |
| **Training dataset** | Data the model learns from during fine-tuning | Changes often as you iterate on data quality and coverage                    |

### The zero-overlap rule

Catalyst automatically enforces zero-overlap between training and eval datasets. If a training dataset overlaps with an eval dataset, the overlapping data will be excluded from the training dataset when a new training run is created.

## Key concepts

| Concept                | Description                                                                                                                                                 |
| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Build from traffic** | Filter your captured production inferences and save them as a dataset. The best datasets come from real usage.                                              |
| **Upload**             | Bring your own JSONL files when you have curated data or are migrating from another platform.                                                               |
| **Dataset format**     | The schema your data needs to follow. See [Dataset Formats](/platform/datasets/formats) for supported fields and validation rules.                          |
| **Task tags**          | Use [task tags](/platform/gateway/tasks) when building from traffic to filter by objective. This gives you clean, focused samples instead of mixed traffic. |

## Tips for good datasets

* **Diverse training data** leads to models that generalize well. If your training data isn't heterogeneous, the trained model won't handle edge cases.
* **Stable eval data** gives you a consistent benchmark. Don't change your eval dataset frequently, it's the measuring stick.
* **Start with production traffic** when possible. Real user inputs reflect the actual distribution of requests your model will see, and they're harder to fake than synthetic data.
* **Use task tags** to filter by objective before saving a dataset. A dataset scoped to a single task is almost always more useful than one built from mixed traffic.

## Next steps

<CardGroup cols={2}>
  <Card title="Build from traffic" icon="satellite-dish" href="/platform/datasets/build-from-traffic">
    Turn filtered production traffic into a dataset.
  </Card>

  <Card title="Upload a dataset" icon="upload" href="/platform/datasets/upload-a-dataset">
    Bring your own JSONL files.
  </Card>

  <Card title="Set up your first eval" icon="flask" href="/get-started/run-first-eval">
    Use your dataset to compare models.
  </Card>

  <Card title="Train a custom model" icon="brain" href="/get-started/train-and-deploy">
    Use your dataset to fine-tune a task-specific model.
  </Card>
</CardGroup>
