Why you would do this
Datasets are the bridge from passive observability to active model improvement. Without them, you cannot run meaningful evals or launch a trustworthy training run.What you’ll have when you finish
- one eval dataset
- one training dataset
- a clean separation between evaluation and training data
Before you start
- complete /start-here/observe-quickstart
- confirm you can filter real traffic in the dashboard
Step 1: narrow the traffic to one real workflow
On the Inferences page, filter by the dimensions that make the workflow coherent:- environment
- task
- provider
- model
- status
- time range
Step 2: make sure the slice is valid for datasets
The current dataset creation flow enforces a few important constraints:- datasets only include successful requests
- eval datasets are capped at
10,000rows - training datasets can go up to
1,000,000rows - training datasets must exclude an eval dataset to prevent leakage
Step 3: save an eval dataset first
Start with the eval dataset. Why first? Because the training flow depends on you choosing an existing eval dataset to exclude overlapping rows. Good eval dataset characteristics:- representative of the real workflow
- small enough to review and rerun often
- difficult enough to catch regressions
Step 4: save the training dataset
When you save the training dataset, choose the eval dataset you just created. The current product automatically excludes overlapping inferences from training so you do not leak eval data into the training set. If overlap is high, fix it before training:- if overlap is
100%, the training dataset would be empty - if overlap is above
25%, the product warns that the training slice is probably too narrow
Step 5: name datasets for reruns, not for today
Good dataset names should still make sense after multiple evals or retraining cycles. Examples:Support bot production eval - Jan 2026Support bot training slice - production, tier-1 issuesExtraction eval - invoice workflow
Verify it worked
You should now have:- one dataset of type
eval - one dataset of type
training - a training dataset that excludes the eval rows
What to do next
Build a Real-Traffic Eval Baseline
Use the eval dataset to create your first repeatable quality check.
Turn Eval Failures into a Training Run
Once the baseline is stable, use the paired datasets to launch training.