Datasets can be created from captured traffic or uploaded as JSONL files. This page covers the supported formats.
The system auto-detects the format from the first valid line. All rows in a file must use the same format.
You cannot mix source-backed and Hugging Face rows in the same file. Mixed-format files fail validation.
Each line has a top-level request and optional response object containing raw provider bodies.
| Field | Required | Description |
|---|
request | Yes | Raw provider request body that includes a usable model value |
response | No | Raw provider response body, or null if you only have requests |
Validation notes:
- The request must include a usable model value.
response may be omitted or set to null if you only have request-side data.
{"request":{"model":"gpt-4","messages":[{"role":"user","content":"Hello"}],"temperature":0.7,"max_tokens":100},"response":{"id":"chatcmpl-123","object":"chat.completion","created":1700000000,"model":"gpt-4","choices":[{"index":0,"message":{"role":"assistant","content":"Hi there!"},"finish_reason":"stop"}]}}
Each line has a top-level messages array with role/content objects.
| Field | Required | Description |
|---|
messages | Yes | Array of { role, content } objects (at least one required) |
id | No | Optional row identifier (stored in metadata) |
tools | No | Optional top-level tool definitions |
Valid roles: system, user, assistant, tool.
Additional supported fields:
content may be a string or an array of content parts for multimodal rows.
- Assistant messages may include
tool_calls.
- Tool messages must include
tool_call_id.
- Top-level
tools are preserved on import.
{"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"What is 2+2?"},{"role":"assistant","content":"4"}]}
When importing Hugging Face-format rows, the system:
- Treats the last assistant turn as the imported response and earlier turns as request context
- Synthesizes request/response payloads so evals and detail views work
- Sets
request_model to unknown-imported-model
- Sets token usage and costs to zero
- Stores the original row
id in metadata as importOriginalRowId
Validation behavior
- Files must be valid JSONL.
- Invalid rows are reported with line numbers in the upload status details.
- Uploads can complete with partial failures if at least one row imports successfully.
- If every row fails validation, the upload status is
failed.
Upload limits
| Limit | Value |
|---|
| Maximum file size | 10 GB |
| Maximum line count | 1,000,000 |
Datasets can be downloaded in two formats:
| Format | Description | Best for |
|---|
| Hugging Face | { id, messages } per row (default) | Training, fine-tuning, sharing |
| Source-backed | { request, response } per row | Re-uploading, round-trip testing |
In the UI, click Download and choose the format. In the CLI:
# Default (Hugging Face)
inf dataset download my-dataset
# Source-backed format
inf dataset download my-dataset --format source-backed
Hugging Face exports skip rows with empty message arrays. Source-backed exports include all rows with a valid request payload. Row counts may differ between formats.