inf evals
The full eval loop is paste-able from the terminal:
<provider>:<model-alias> (e.g. openai:gpt-5.2). Use inf models list to discover every route ID available to your team — see Route IDs for the full format.
inf eval rubric create
Create a rubric — the judge prompt an eval run scores responses against. Rubrics live in the active project, carry versioned prompt content, and are passed to inf eval run by ID. The template must contain the placeholder {{ eval_model_response }} where the model’s response will be injected for scoring.
Options
| Flag | Required | Description | Default |
|---|---|---|---|
-n, --name <name> | Yes | Rubric name | — |
-f, --file <path> | Yes | Path to a markdown file containing the judge prompt template | — |
--max-score <n> | No | Maximum score for the rubric (2–100) | 10 |
--project-id <id> | No | Project to create the rubric in | Active project |
inf eval run.
Examples
inf eval rubric get
Get details of a rubric — ID, name, latest version number, version count, score range, and a preview of the template.
Arguments
| Argument | Required | Description |
|---|---|---|
id | Yes | Full UUID, 4+ character prefix, or exact rubric name |
inf eval rubric delete
Archive (soft-delete) a rubric. Rubrics cannot be hard-deleted — archiving hides them from inf eval rubrics but preserves their eval history. Restore from the dashboard if needed.
inf eval rubric archive <id> — both names do the same thing; use whichever reads clearer in your script.
Arguments
| Argument | Required | Description |
|---|---|---|
id | Yes | Full UUID, 4+ character prefix, or exact rubric name |
Options
| Flag | Required | Description | Default |
|---|---|---|---|
-y, --yes | Yes in non-TTY environments | Skip the confirmation prompt | Off |
-y is passed. In non-TTY environments (CI, scripts) the command refuses to run without -y.
Examples
inf eval rubrics
List rubrics in the active project.
inf eval defs
Options
| Flag | Required | Description | Default |
|---|---|---|---|
--include-archived | No | Include archived rubrics | Off |
--json for full UUIDs.
inf eval run
Launch a new eval run group against one or more models, scored by a judge model.
Options
| Flag | Required | Description | Default |
|---|---|---|---|
--rubric-id <id> | Yes | Rubric ID | — |
--dataset-id <id> | Yes | Eval-type dataset ID (create one with inf dataset create -t eval) | — |
--models <ids> | Yes | Comma-separated model route IDs — run inf models list to discover them | — |
--judge-model <id> | Yes | Route ID of the judge model — run inf models list --judge-only to filter | — |
--rubric-version-id <id> | No | Pin to a specific rubric version | Latest version |
--sample-size <n> | No | Samples drawn from the dataset per model (1–100) | 100 |
-n, --name <name> | No | Display name for the run group | Auto-generated |
inf eval get <id> follow-up command to track progress.
Examples
inf eval list
List eval run groups for a given rubric.
inf eval ls
Options
| Flag | Required | Description | Default |
|---|---|---|---|
--rubric-id <id> | Yes | Rubric ID to list runs for | — |
--rubric-version-id <id> | No | Filter by a specific rubric version | — |
pending, running, failed, or completed), and creation date.
inf eval get
View detailed information about a specific eval run group.
Arguments
| Argument | Required | Description |
|---|---|---|
id | Yes | The eval run group ID |
Output
The detail view covers the run group itself, followed by a sub-table of individual runs:| Field | Description |
|---|---|
id | Run group ID |
rubricId / rubricVersionId | Rubric and pinned version |
evalDatasetId | Dataset the run group scored |
judgeProvider / judgeModelId | Judge model scoring the responses |
models | How many models were evaluated in this run group |
created | Run group creation timestamp |
| Runs sub-table | One row per model: run ID, provider, model, status, average score, failed sample count, completed/total samples. When avg score is —, the adjacent N failed hint shows how many samples the judge couldn’t score. |
inf eval datasets
List datasets available for evaluations (type = eval).
Options
| Flag | Required | Description | Default |
|---|---|---|---|
-l, --limit <n> | No | Maximum number of results | 50 |
--include-archived | No | Include archived datasets | Off |
inf dataset create -t eval … or the dashboard. The output shows the dataset ID (8-char prefix), name, inference count, and creation date.