Evals

Run and inspect model evaluations from the command line. Manage rubrics (the judge prompts evals run against), list and inspect run groups, launch new runs, and browse eval-ready datasets. Alias: inf evals The full eval loop is paste-able from the terminal:

# 1. Create a rubric from a markdown file
inf eval rubric create -n support-tickets-v1 -f ./rubric.md
# → Rubric rub_abc12 / version rv_xyz45 created.

# 2. Materialize an eval dataset (traffic-backed, upload-backed, or from a file)
inf dataset create -n demo-eval -t eval --file ./samples.jsonl
# → Dataset ds_def78 created.

# 3. Launch an eval run group
inf eval run \
  --rubric-id rub_abc12 \
  --dataset-id ds_def78 \
  --models openai:gpt-5.2,anthropic:claude-sonnet-4-6 \
  --judge-model anthropic:claude-sonnet-4-6
# → Run group rg_20260415_152340 created.

# 4. Track progress
inf eval get rg_20260415_152340

Route IDs look like <provider>:<model-alias> (e.g. openai:gpt-5.2). Use inf models list to discover every route ID available to your team — see Route IDs for the full format.

`inf eval rubric create`

Create a rubric — the judge prompt an eval run scores responses against. Rubrics live in the active project, carry versioned prompt content, and are passed to inf eval run by ID. The template must contain the placeholder {{ eval_model_response }} where the model’s response will be injected for scoring.

inf eval rubric create -n <name> -f <path-to-markdown>

Options

Flag	Required	Description	Default
`-n, --name <name>`	Yes	Rubric name	—
`-f, --file <path>`	Yes	Path to a markdown file containing the judge prompt template	—
`--max-score <n>`	No	Maximum score for the rubric (2–100)	`10`
`--project-id <id>`	No	Project to create the rubric in	Active project

Prints the rubric ID and the first version ID. Use them directly with inf eval run.

Examples

# Create a rubric with the default 0–10 scoring scale
inf eval rubric create -n support-tickets-v1 -f ./rubric.md

# Create a rubric with a 0–100 scale
inf eval rubric create -n quality-v2 -f ./quality.md --max-score 100

`inf eval rubric get`

Get details of a rubric — ID, name, latest version number, version count, score range, and a preview of the template.

inf eval rubric get <id>

Arguments

Argument	Required	Description
`id`	Yes	Full UUID, 4+ character prefix, or exact rubric name

Ambiguous prefixes print the candidate list and abort.

`inf eval rubric delete`

Archive (soft-delete) a rubric. Rubrics cannot be hard-deleted — archiving hides them from inf eval rubrics but preserves their eval history. Restore from the dashboard if needed.

inf eval rubric delete <id>

Alias: inf eval rubric archive <id> — both names do the same thing; use whichever reads clearer in your script.

Arguments

Argument	Required	Description
`id`	Yes	Full UUID, 4+ character prefix, or exact rubric name

Options

Flag	Required	Description	Default
`-y, --yes`	Yes in non-TTY environments	Skip the confirmation prompt	Off

In an interactive terminal, the CLI asks for confirmation unless -y is passed. In non-TTY environments (CI, scripts) the command refuses to run without -y.

Examples

# Archive interactively (prompts for confirmation)
inf eval rubric delete support-tickets-v1

# Archive non-interactively
inf eval rubric archive rub_abc12 --yes

`inf eval rubrics`

List rubrics in the active project.

inf eval rubrics

Alias: inf eval defs

Options

Flag	Required	Description	Default
`--include-archived`	No	Include archived rubrics	Off

Shows the rubric ID (8-char prefix), name, latest version, total version count, and creation date. Use --json for full UUIDs.

`inf eval run`

Launch a new eval run group against one or more models, scored by a judge model.

inf eval run \
  --rubric-id <id> \
  --dataset-id <id> \
  --models <route-id-csv> \
  --judge-model <route-id>

Options

Flag	Required	Description	Default
`--rubric-id <id>`	Yes	Rubric ID	—
`--dataset-id <id>`	Yes	Eval-type dataset ID (create one with `inf dataset create -t eval`)	—
`--models <ids>`	Yes	Comma-separated model route IDs — run `inf models list` to discover them	—
`--judge-model <id>`	Yes	Route ID of the judge model — run `inf models list --judge-only` to filter	—
`--rubric-version-id <id>`	No	Pin to a specific rubric version	Latest version
`--sample-size <n>`	No	Samples drawn from the dataset per model (1–100)	`100`
`-n, --name <name>`	No	Display name for the run group	Auto-generated

Prints the run group ID and an inf eval get <id> follow-up command to track progress.

Examples

# Launch a run against two models with a third as judge
inf eval run \
  --rubric-id rub_abc12 \
  --dataset-id ds_def78 \
  --models openai:gpt-5.2,anthropic:claude-sonnet-4-6 \
  --judge-model anthropic:claude-sonnet-4-6

# Pin to a specific rubric version
inf eval run \
  --rubric-id rub_abc12 \
  --rubric-version-id rv_xyz45 \
  --dataset-id ds_def78 \
  --models openai:gpt-5.2 \
  --judge-model anthropic:claude-sonnet-4-6

`inf eval list`

List eval run groups for a given rubric.

inf eval list --rubric-id <id>

Alias: inf eval ls

Options

Flag	Required	Description	Default
`--rubric-id <id>`	Yes	Rubric ID to list runs for	—
`--rubric-version-id <id>`	No	Filter by a specific rubric version	—

Shows the run group ID (8-char prefix), rubric version, model count, derived status (pending, running, failed, or completed), and creation date.

`inf eval get`

View detailed information about a specific eval run group.

inf eval get <id>

Arguments

Argument	Required	Description
`id`	Yes	The eval run group ID

Output

The detail view covers the run group itself, followed by a sub-table of individual runs:

Field	Description
`id`	Run group ID
`rubricId` / `rubricVersionId`	Rubric and pinned version
`evalDatasetId`	Dataset the run group scored
`judgeProvider` / `judgeModelId`	Judge model scoring the responses
`models`	How many models were evaluated in this run group
`created`	Run group creation timestamp
Runs sub-table	One row per model: run ID, provider, model, status, average score, failed sample count, `completed/total` samples. When avg score is `—`, the adjacent `N failed` hint shows how many samples the judge couldn’t score.

`inf eval datasets`

List datasets available for evaluations (type = eval).

inf eval datasets

Options

Flag	Required	Description	Default
`-l, --limit <n>`	No	Maximum number of results	`50`
`--include-archived`	No	Include archived datasets	Off

Eval datasets are materialized via inf dataset create -t eval … or the dashboard. The output shows the dataset ID (8-char prefix), name, inference count, and creation date.

Get Started

Command Reference

`inf eval rubric create`

Options

Examples

`inf eval rubric get`

Arguments

`inf eval rubric delete`

Arguments

Options

Examples

`inf eval rubrics`

Options

`inf eval run`

Options

Examples

`inf eval list`

Options

`inf eval get`

Arguments

Output

`inf eval datasets`

Options

Get Started

Command Reference

Documentation Index

​inf eval rubric create

​Options

​Examples

​inf eval rubric get

​Arguments

​inf eval rubric delete

​Arguments

​Options

​Examples

​inf eval rubrics

​Options

​inf eval run

​Options

​Examples

​inf eval list

​Options

​inf eval get

​Arguments

​Output

​inf eval datasets

​Options

`inf eval rubric create`

Options

Examples

`inf eval rubric get`

Arguments

`inf eval rubric delete`

Arguments

Options

Examples

`inf eval rubrics`

Options

`inf eval run`

Options

Examples

`inf eval list`

Options

`inf eval get`

Arguments

Output

`inf eval datasets`

Options