Documentation Index
Fetch the complete documentation index at: https://docs.inference.net/llms.txt
Use this file to discover all available pages before exploring further.
Run and inspect model evaluations from the command line. Manage rubrics (the judge prompts evals run against), list and inspect run groups, launch new runs, and browse eval-ready datasets.
Alias: inf evals
The full eval loop is paste-able from the terminal:
# 1. Create a rubric from a markdown file
inf eval rubric create -n support-tickets-v1 -f ./rubric.md
# → Rubric rub_abc12 / version rv_xyz45 created.
# 2. Materialize an eval dataset (traffic-backed, upload-backed, or from a file)
inf dataset create -n demo-eval -t eval --file ./samples.jsonl
# → Dataset ds_def78 created.
# 3. Launch an eval run group
inf eval run \
--rubric-id rub_abc12 \
--dataset-id ds_def78 \
--models openai:gpt-5.2,anthropic:claude-sonnet-4-6 \
--judge-model anthropic:claude-sonnet-4-6
# → Run group rg_20260415_152340 created.
# 4. Track progress
inf eval get rg_20260415_152340
Route IDs look like <provider>:<model-alias> (e.g. openai:gpt-5.2). Use inf models list to discover every route ID available to your team — see Route IDs for the full format.
inf eval rubric create
Create a rubric — the judge prompt an eval run scores responses against. Rubrics live in the active project, carry versioned prompt content, and are passed to inf eval run by ID. The template must contain the placeholder {{ eval_model_response }} where the model’s response will be injected for scoring.
inf eval rubric create -n <name> -f <path-to-markdown>
Options
| Flag | Required | Description | Default |
|---|
-n, --name <name> | Yes | Rubric name | — |
-f, --file <path> | Yes | Path to a markdown file containing the judge prompt template | — |
--max-score <n> | No | Maximum score for the rubric (2–100) | 10 |
--project-id <id> | No | Project to create the rubric in | Active project |
Prints the rubric ID and the first version ID. Use them directly with inf eval run.
Examples
# Create a rubric with the default 0–10 scoring scale
inf eval rubric create -n support-tickets-v1 -f ./rubric.md
# Create a rubric with a 0–100 scale
inf eval rubric create -n quality-v2 -f ./quality.md --max-score 100
inf eval rubric get
Get details of a rubric — ID, name, latest version number, version count, score range, and a preview of the template.
Arguments
| Argument | Required | Description |
|---|
id | Yes | Full UUID, 4+ character prefix, or exact rubric name |
Ambiguous prefixes print the candidate list and abort.
inf eval rubric delete
Archive (soft-delete) a rubric. Rubrics cannot be hard-deleted — archiving hides them from inf eval rubrics but preserves their eval history. Restore from the dashboard if needed.
inf eval rubric delete <id>
Alias: inf eval rubric archive <id> — both names do the same thing; use whichever reads clearer in your script.
Arguments
| Argument | Required | Description |
|---|
id | Yes | Full UUID, 4+ character prefix, or exact rubric name |
Options
| Flag | Required | Description | Default |
|---|
-y, --yes | Yes in non-TTY environments | Skip the confirmation prompt | Off |
In an interactive terminal, the CLI asks for confirmation unless -y is passed. In non-TTY environments (CI, scripts) the command refuses to run without -y.
Examples
# Archive interactively (prompts for confirmation)
inf eval rubric delete support-tickets-v1
# Archive non-interactively
inf eval rubric archive rub_abc12 --yes
inf eval rubrics
List rubrics in the active project.
Alias: inf eval defs
Options
| Flag | Required | Description | Default |
|---|
--include-archived | No | Include archived rubrics | Off |
Shows the rubric ID (8-char prefix), name, latest version, total version count, and creation date. Use --json for full UUIDs.
inf eval run
Launch a new eval run group against one or more models, scored by a judge model.
inf eval run \
--rubric-id <id> \
--dataset-id <id> \
--models <route-id-csv> \
--judge-model <route-id>
Options
| Flag | Required | Description | Default |
|---|
--rubric-id <id> | Yes | Rubric ID | — |
--dataset-id <id> | Yes | Eval-type dataset ID (create one with inf dataset create -t eval) | — |
--models <ids> | Yes | Comma-separated model route IDs — run inf models list to discover them | — |
--judge-model <id> | Yes | Route ID of the judge model — run inf models list --judge-only to filter | — |
--rubric-version-id <id> | No | Pin to a specific rubric version | Latest version |
--sample-size <n> | No | Samples drawn from the dataset per model (1–100) | 100 |
-n, --name <name> | No | Display name for the run group | Auto-generated |
Prints the run group ID and an inf eval get <id> follow-up command to track progress.
Examples
# Launch a run against two models with a third as judge
inf eval run \
--rubric-id rub_abc12 \
--dataset-id ds_def78 \
--models openai:gpt-5.2,anthropic:claude-sonnet-4-6 \
--judge-model anthropic:claude-sonnet-4-6
# Pin to a specific rubric version
inf eval run \
--rubric-id rub_abc12 \
--rubric-version-id rv_xyz45 \
--dataset-id ds_def78 \
--models openai:gpt-5.2 \
--judge-model anthropic:claude-sonnet-4-6
inf eval list
List eval run groups for a given rubric.
inf eval list --rubric-id <id>
Alias: inf eval ls
Options
| Flag | Required | Description | Default |
|---|
--rubric-id <id> | Yes | Rubric ID to list runs for | — |
--rubric-version-id <id> | No | Filter by a specific rubric version | — |
Shows the run group ID (8-char prefix), rubric version, model count, derived status (pending, running, failed, or completed), and creation date.
inf eval get
View detailed information about a specific eval run group.
Arguments
| Argument | Required | Description |
|---|
id | Yes | The eval run group ID |
Output
The detail view covers the run group itself, followed by a sub-table of individual runs:
| Field | Description |
|---|
id | Run group ID |
rubricId / rubricVersionId | Rubric and pinned version |
evalDatasetId | Dataset the run group scored |
judgeProvider / judgeModelId | Judge model scoring the responses |
models | How many models were evaluated in this run group |
created | Run group creation timestamp |
| Runs sub-table | One row per model: run ID, provider, model, status, average score, failed sample count, completed/total samples. When avg score is —, the adjacent N failed hint shows how many samples the judge couldn’t score. |
inf eval datasets
List datasets available for evaluations (type = eval).
Options
| Flag | Required | Description | Default |
|---|
-l, --limit <n> | No | Maximum number of results | 50 |
--include-archived | No | Include archived datasets | Off |
Eval datasets are materialized via inf dataset create -t eval … or the dashboard. The output shows the dataset ID (8-char prefix), name, inference count, and creation date.