Create Datasets from Observed Traffic

Once you have your first observed request, the next job is to turn useful slices of traffic into reusable datasets.

Why you would do this

Datasets are the bridge from passive observability to active model improvement. Without them, you cannot run meaningful evals or launch a trustworthy training run.

What you’ll have when you finish

one eval dataset
one training dataset
a clean separation between evaluation and training data

Before you start

complete /start-here/observe-quickstart
confirm you can filter real traffic in the dashboard

Step 1: narrow the traffic to one real workflow

On the Inferences page, filter by the dimensions that make the workflow coherent:

environment
task
provider
model
status
time range

Your goal is not “all traffic.” Your goal is “the slice of traffic I want to measure and improve.”

Step 2: make sure the slice is valid for datasets

The current dataset creation flow enforces a few important constraints:

datasets only include successful requests
eval datasets are capped at 10,000 rows
training datasets can go up to 1,000,000 rows
training datasets must exclude an eval dataset to prevent leakage

Step 3: save an eval dataset first

Start with the eval dataset. Why first? Because the training flow depends on you choosing an existing eval dataset to exclude overlapping rows. Good eval dataset characteristics:

representative of the real workflow
small enough to review and rerun often
difficult enough to catch regressions

Step 4: save the training dataset

When you save the training dataset, choose the eval dataset you just created. The current product automatically excludes overlapping inferences from training so you do not leak eval data into the training set. If overlap is high, fix it before training:

if overlap is 100%, the training dataset would be empty
if overlap is above 25%, the product warns that the training slice is probably too narrow

Step 5: name datasets for reruns, not for today

Good dataset names should still make sense after multiple evals or retraining cycles. Examples:

Support bot production eval - Jan 2026
Support bot training slice - production, tier-1 issues
Extraction eval - invoice workflow

Verify it worked

You should now have:

one dataset of type eval
one dataset of type training
a training dataset that excludes the eval rows

What to do next

Build a Real-Traffic Eval Baseline

Use the eval dataset to create your first repeatable quality check.

Turn Eval Failures into a Training Run

Once the baseline is stable, use the paired datasets to launch training.

Start Here

Guides

Reference

Tutorials

Why you would do this

What you’ll have when you finish

Before you start

Step 1: narrow the traffic to one real workflow

Step 2: make sure the slice is valid for datasets

Step 3: save an eval dataset first

Step 4: save the training dataset

Step 5: name datasets for reruns, not for today

Verify it worked

What to do next

Build a Real-Traffic Eval Baseline

Turn Eval Failures into a Training Run

Start Here

Guides

Reference

Tutorials

​Why you would do this

​What you’ll have when you finish

​Before you start

​Step 1: narrow the traffic to one real workflow

​Step 2: make sure the slice is valid for datasets

​Step 3: save an eval dataset first

​Step 4: save the training dataset

​Step 5: name datasets for reruns, not for today

​Verify it worked

​What to do next

Build a Real-Traffic Eval Baseline

Turn Eval Failures into a Training Run

Why you would do this

What you’ll have when you finish

Before you start

Step 1: narrow the traffic to one real workflow

Step 2: make sure the slice is valid for datasets

Step 3: save an eval dataset first

Step 4: save the training dataset

Step 5: name datasets for reruns, not for today

Verify it worked

What to do next