Skip to main content
Once you have your first observed request, the next job is to turn useful slices of traffic into reusable datasets.

Why you would do this

Datasets are the bridge from passive observability to active model improvement. Without them, you cannot run meaningful evals or launch a trustworthy training run.

What you’ll have when you finish

  • one eval dataset
  • one training dataset
  • a clean separation between evaluation and training data

Before you start

Step 1: narrow the traffic to one real workflow

On the Inferences page, filter by the dimensions that make the workflow coherent:
  • environment
  • task
  • provider
  • model
  • status
  • time range
Your goal is not “all traffic.” Your goal is “the slice of traffic I want to measure and improve.”

Step 2: make sure the slice is valid for datasets

The current dataset creation flow enforces a few important constraints:
  • datasets only include successful requests
  • eval datasets are capped at 10,000 rows
  • training datasets can go up to 1,000,000 rows
  • training datasets must exclude an eval dataset to prevent leakage

Step 3: save an eval dataset first

Start with the eval dataset. Why first? Because the training flow depends on you choosing an existing eval dataset to exclude overlapping rows. Good eval dataset characteristics:
  • representative of the real workflow
  • small enough to review and rerun often
  • difficult enough to catch regressions

Step 4: save the training dataset

When you save the training dataset, choose the eval dataset you just created. The current product automatically excludes overlapping inferences from training so you do not leak eval data into the training set. If overlap is high, fix it before training:
  • if overlap is 100%, the training dataset would be empty
  • if overlap is above 25%, the product warns that the training slice is probably too narrow

Step 5: name datasets for reruns, not for today

Good dataset names should still make sense after multiple evals or retraining cycles. Examples:
  • Support bot production eval - Jan 2026
  • Support bot training slice - production, tier-1 issues
  • Extraction eval - invoice workflow

Verify it worked

You should now have:
  • one dataset of type eval
  • one dataset of type training
  • a training dataset that excludes the eval rows

What to do next

Build a Real-Traffic Eval Baseline

Use the eval dataset to create your first repeatable quality check.

Turn Eval Failures into a Training Run

Once the baseline is stable, use the paired datasets to launch training.