> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inference.net/llms.txt
> Use this file to discover all available pages before exploring further.

# Dataset Formats and Schemas

> JSONL upload formats, required fields, validation rules, and upload limits.

Datasets can be created from [captured traffic](/platform/datasets/build-from-traffic) or [uploaded as JSONL files](/platform/datasets/upload-a-dataset). This page covers the supported formats.

## Supported formats

The system auto-detects the format from the first valid line. All rows in a file must use the same format.

<Warning>
  You cannot mix source-backed and Hugging Face rows in the same file. Mixed-format files fail validation.
</Warning>

### Source-backed format

Each line has a top-level `request` and optional `response` object containing raw provider bodies.

| Field      | Required | Description                                                     |
| ---------- | -------- | --------------------------------------------------------------- |
| `request`  | Yes      | Raw provider request body that includes a usable `model` value  |
| `response` | No       | Raw provider response body, or `null` if you only have requests |

Validation notes:

* The request must include a usable model value.
* `response` may be omitted or set to `null` if you only have request-side data.

<Metadata text="datasets/format-source-backed" />

```json theme={"system"}
{"request":{"model":"gpt-4","messages":[{"role":"user","content":"Hello"}],"temperature":0.7,"max_tokens":100},"response":{"id":"chatcmpl-123","object":"chat.completion","created":1700000000,"model":"gpt-4","choices":[{"index":0,"message":{"role":"assistant","content":"Hi there!"},"finish_reason":"stop"}]}}
```

### Hugging Face format

Each line has a top-level `messages` array with `role`/`content` objects.

| Field      | Required | Description                                                  |
| ---------- | -------- | ------------------------------------------------------------ |
| `messages` | Yes      | Array of `{ role, content }` objects (at least one required) |
| `id`       | No       | Optional row identifier (stored in metadata)                 |
| `tools`    | No       | Optional top-level tool definitions                          |

Valid roles: `system`, `user`, `assistant`, `tool`.

Additional supported fields:

* `content` may be a string or an array of content parts for multimodal rows.
* Assistant messages may include `tool_calls`.
* Tool messages must include `tool_call_id`.
* Top-level `tools` are preserved on import.

<Metadata text="datasets/format-huggingface" />

```json theme={"system"}
{"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"What is 2+2?"},{"role":"assistant","content":"4"}]}
```

When importing Hugging Face-format rows, the system:

* Treats the last assistant turn as the imported response and earlier turns as request context
* Synthesizes request/response payloads so evals and detail views work
* Sets `request_model` to `unknown-imported-model`
* Sets token usage and costs to zero
* Stores the original row `id` in metadata as `importOriginalRowId`

## Validation behavior

* Files must be valid JSONL.
* Invalid rows are reported with line numbers in the upload status details.
* Uploads can complete with partial failures if at least one row imports successfully.
* If every row fails validation, the upload status is `failed`.

## Upload limits

| Limit              | Value     |
| ------------------ | --------- |
| Maximum file size  | 10 GB     |
| Maximum line count | 1,000,000 |

## Download formats

Datasets can be downloaded in two formats:

| Format            | Description                          | Best for                         |
| ----------------- | ------------------------------------ | -------------------------------- |
| **Hugging Face**  | `{ id, messages }` per row (default) | Training, fine-tuning, sharing   |
| **Source-backed** | `{ request, response }` per row      | Re-uploading, round-trip testing |

In the UI, click **Download** and choose the format. In the CLI:

<Metadata text="datasets/download-cli" />

```bash theme={"system"}
# Default (Hugging Face)
inf dataset download my-dataset

# Source-backed format

inf dataset download my-dataset --format source-backed

```

<Note>
  Hugging Face exports skip rows with empty message arrays. Source-backed exports include all rows with a valid request payload. Row counts may differ between formats.
</Note>
