Skip to main content
Datasets are collections of LLM inputs and outputs used for evaluation and fine-tuning. They can come from two places: your live production traffic captured through Observe, or files you upload directly. Everything downstream depends on good data. Evals need representative examples to measure model quality. Training needs diverse, high-quality samples to teach a model your task. Datasets are where both start.

Types of datasets

TypePurposeHow it evolves
Eval datasetMeasures model quality against a rubricStays stable, a fixed set of challenging examples that act as your benchmark
Training datasetData the model learns from during fine-tuningChanges often as you iterate on data quality and coverage

The zero-overlap rule

Training and eval datasets must not share any data. If the model trains on eval examples, the eval becomes meaningless because you’re testing memorization, not generalization. Always keep them separate.

Key concepts

ConceptDescription
Build from trafficFilter your captured production inferences and save them as a dataset. The best datasets come from real usage.
UploadBring your own JSONL files when you have curated data or are migrating from another platform.
Dataset formatThe schema your data needs to follow. See Dataset Formats for supported fields and validation rules.
Task tagsUse task tags when building from traffic to filter by objective. This gives you clean, focused samples instead of mixed traffic.

Tips for good datasets

  • Diverse training data leads to models that generalize well. If your training data isn’t heterogeneous, the trained model won’t handle edge cases.
  • Stable eval data gives you a consistent benchmark. Don’t change your eval dataset frequently, it’s the measuring stick.
  • Start with production traffic when possible. Real user inputs reflect the actual distribution of requests your model will see, and they’re harder to fake than synthetic data.
  • Use task tags to filter by objective before saving a dataset. A dataset scoped to a single task is almost always more useful than one built from mixed traffic.

Next steps

Build from traffic

Turn filtered production traffic into a dataset.

Upload a dataset

Bring your own JSONL files.

Set up your first eval

Use your dataset to compare models.

Train a custom model

Use your dataset to fine-tune a task-specific model.