> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inference.net/llms.txt
> Use this file to discover all available pages before exploring further.

# Build a Dataset from Traffic

> Turn production traffic into datasets for evaluation and training.

The most useful datasets come from real production traffic. Catalyst lets you filter your captured inferences and save the results as a dataset, ready for evals or training.

You can create a dataset from traffic in two places: the **Create Dataset** button in the Datasets tab, or **Save as Dataset** in the [Inference Viewer](/platform/gateway/inference-viewer). Both follow the same flow.

## The flow

<Steps>
  <Step title="Filter your traffic">
    Filter by model, task, provider, status code, or any tracked dimension until you have a representative slice of traffic.
  </Step>

  <Step title="Choose a dataset type">
    Decide whether this will be an **eval dataset** or a **training dataset**. Remember the [zero-overlap rule](/platform/datasets/overview#the-zero-overlap-rule), training and eval data must never share examples.
  </Step>

  <Step title="Save as dataset">
    Name the dataset and save. It's immediately available for evals or training.
  </Step>
</Steps>

## Getting clean samples

The quality of your dataset depends on how well you filter. A few tips:

* **Filter by [task](/platform/gateway/tasks)** to get samples for a specific objective rather than a mix of everything
* **Exclude errors** unless you specifically want failure cases (e.g. for training a model to handle edge cases)
* **Check the date range** - a dataset pulled from a single day might not capture the full variety of inputs your app sees

## Eval vs training: different goals, different data

**Eval datasets** should be small, stable, and challenging. Pick examples that represent the hard cases — the ones where you're not sure the model will get it right. These become your benchmark, so don't change them often.

**Training datasets** should be large, diverse, and representative. The more variety, the better the model generalizes. Iterate on these as you learn what the model struggles with.

## Next steps

<CardGroup cols={2}>
  <Card title="Upload your own data" icon="upload" href="/platform/datasets/upload-a-dataset">
    Already have curated data? Upload it directly.
  </Card>

  <Card title="Dataset formats" icon="file-code" href="/platform/datasets/formats">
    Supported schemas and validation rules.
  </Card>
</CardGroup>
