Skip to main content
Run and inspect model evaluations from the command line. Manage rubrics (the judge prompts evals run against), list and inspect run groups, launch new runs, and browse eval-ready datasets. Alias: inf evals The full eval loop is paste-able from the terminal:
# 1. Create a rubric from a markdown file
inf eval rubric create -n support-tickets-v1 -f ./rubric.md
# → Rubric rub_abc12 / version rv_xyz45 created.

# 2. Materialize an eval dataset (traffic-backed, upload-backed, or from a file)
inf dataset create -n demo-eval -t eval --file ./samples.jsonl
# → Dataset ds_def78 created.

# 3. Launch an eval run group
inf eval run \
  --rubric-id rub_abc12 \
  --dataset-id ds_def78 \
  --models openai:gpt-5.2,anthropic:claude-sonnet-4-6 \
  --judge-model anthropic:claude-sonnet-4-6
# → Run group rg_20260415_152340 created.

# 4. Track progress
inf eval get rg_20260415_152340
Route IDs look like <provider>:<model-alias> (e.g. openai:gpt-5.2). Use inf models list to discover every route ID available to your team — see Route IDs for the full format.

inf eval rubric create

Create a rubric — the judge prompt an eval run scores responses against. Rubrics live in the active project, carry versioned prompt content, and are passed to inf eval run by ID. The template must contain the placeholder {{ eval_model_response }} where the model’s response will be injected for scoring.
inf eval rubric create -n <name> -f <path-to-markdown>

Options

FlagRequiredDescriptionDefault
-n, --name <name>YesRubric name
-f, --file <path>YesPath to a markdown file containing the judge prompt template
--max-score <n>NoMaximum score for the rubric (2–100)10
--project-id <id>NoProject to create the rubric inActive project
Prints the rubric ID and the first version ID. Use them directly with inf eval run.

Examples

# Create a rubric with the default 0–10 scoring scale
inf eval rubric create -n support-tickets-v1 -f ./rubric.md

# Create a rubric with a 0–100 scale
inf eval rubric create -n quality-v2 -f ./quality.md --max-score 100

inf eval rubric get

Get details of a rubric — ID, name, latest version number, version count, score range, and a preview of the template.
inf eval rubric get <id>

Arguments

ArgumentRequiredDescription
idYesFull UUID, 4+ character prefix, or exact rubric name
Ambiguous prefixes print the candidate list and abort.

inf eval rubric delete

Archive (soft-delete) a rubric. Rubrics cannot be hard-deleted — archiving hides them from inf eval rubrics but preserves their eval history. Restore from the dashboard if needed.
inf eval rubric delete <id>
Alias: inf eval rubric archive <id> — both names do the same thing; use whichever reads clearer in your script.

Arguments

ArgumentRequiredDescription
idYesFull UUID, 4+ character prefix, or exact rubric name

Options

FlagRequiredDescriptionDefault
-y, --yesYes in non-TTY environmentsSkip the confirmation promptOff
In an interactive terminal, the CLI asks for confirmation unless -y is passed. In non-TTY environments (CI, scripts) the command refuses to run without -y.

Examples

# Archive interactively (prompts for confirmation)
inf eval rubric delete support-tickets-v1

# Archive non-interactively
inf eval rubric archive rub_abc12 --yes

inf eval rubrics

List rubrics in the active project.
inf eval rubrics
Alias: inf eval defs

Options

FlagRequiredDescriptionDefault
--include-archivedNoInclude archived rubricsOff
Shows the rubric ID (8-char prefix), name, latest version, total version count, and creation date. Use --json for full UUIDs.

inf eval run

Launch a new eval run group against one or more models, scored by a judge model.
inf eval run \
  --rubric-id <id> \
  --dataset-id <id> \
  --models <route-id-csv> \
  --judge-model <route-id>

Options

FlagRequiredDescriptionDefault
--rubric-id <id>YesRubric ID
--dataset-id <id>YesEval-type dataset ID (create one with inf dataset create -t eval)
--models <ids>YesComma-separated model route IDs — run inf models list to discover them
--judge-model <id>YesRoute ID of the judge model — run inf models list --judge-only to filter
--rubric-version-id <id>NoPin to a specific rubric versionLatest version
--sample-size <n>NoSamples drawn from the dataset per model (1–100)100
-n, --name <name>NoDisplay name for the run groupAuto-generated
Prints the run group ID and an inf eval get <id> follow-up command to track progress.

Examples

# Launch a run against two models with a third as judge
inf eval run \
  --rubric-id rub_abc12 \
  --dataset-id ds_def78 \
  --models openai:gpt-5.2,anthropic:claude-sonnet-4-6 \
  --judge-model anthropic:claude-sonnet-4-6

# Pin to a specific rubric version
inf eval run \
  --rubric-id rub_abc12 \
  --rubric-version-id rv_xyz45 \
  --dataset-id ds_def78 \
  --models openai:gpt-5.2 \
  --judge-model anthropic:claude-sonnet-4-6

inf eval list

List eval run groups for a given rubric.
inf eval list --rubric-id <id>
Alias: inf eval ls

Options

FlagRequiredDescriptionDefault
--rubric-id <id>YesRubric ID to list runs for
--rubric-version-id <id>NoFilter by a specific rubric version
Shows the run group ID (8-char prefix), rubric version, model count, derived status (pending, running, failed, or completed), and creation date.

inf eval get

View detailed information about a specific eval run group.
inf eval get <id>

Arguments

ArgumentRequiredDescription
idYesThe eval run group ID

Output

The detail view covers the run group itself, followed by a sub-table of individual runs:
FieldDescription
idRun group ID
rubricId / rubricVersionIdRubric and pinned version
evalDatasetIdDataset the run group scored
judgeProvider / judgeModelIdJudge model scoring the responses
modelsHow many models were evaluated in this run group
createdRun group creation timestamp
Runs sub-tableOne row per model: run ID, provider, model, status, average score, failed sample count, completed/total samples. When avg score is , the adjacent N failed hint shows how many samples the judge couldn’t score.

inf eval datasets

List datasets available for evaluations (type = eval).
inf eval datasets

Options

FlagRequiredDescriptionDefault
-l, --limit <n>NoMaximum number of results50
--include-archivedNoInclude archived datasetsOff
Eval datasets are materialized via inf dataset create -t eval … or the dashboard. The output shows the dataset ID (8-char prefix), name, inference count, and creation date.