Training

Kick off and manage model training runs from the command line. Discover recipes and trainable base models, queue new runs (with override flags for the trickier trainingConfig knobs), cancel in-flight jobs, and zoom in on errors without scrolling through raw logs. Alias: inf train The full training loop is paste-able from the terminal:

# 1. Materialize training and eval datasets
inf dataset create -n my-train-split -t training --file ./train.jsonl
inf dataset create -n my-eval-split  -t eval     --file ./eval.jsonl
# → Datasets ds_trn_abc12 / ds_evl_def34 created.

# 2. Create (or reuse) a rubric the evals will score against
inf eval rubric create -n my-rubric -f ./rubric.md
# → Rubric rub_xyz56 / version rv_ver78 created.

# 3. Pick a recipe and base model
inf training recipes
inf training models

# 4. Queue the training run
inf training create \
  --name distill-hardreasoning-qwen3.5-4b-v2 \
  --recipe inf-public-training-recipe:qwen-3.5-4b-fft \
  --training-dataset ds_trn_abc12 \
  --eval-dataset     ds_evl_def34 \
  --rubric           rub_xyz56 \
  --sample-packing   false \
  --num-epochs       5 \
  --task-id distill-hardreasoning-v2
# → Training job job_90ab12 queued.

# 5. Track progress until it finishes
inf training poll job_90ab12

`inf training models`

Discover the base models you can fine-tune and (with --judge) the judge models that can score checkpoints.

inf training models

Options

Flag	Required	Description	Default
`--judge`	No	List judge models instead of base models (use with `--judge-model`)	Off

The table shows each model’s canonical alias, full name, and ID prefix. Pair with --json when scripting to preserve full IDs — those full IDs are what you pass to --base-model / --judge-model on inf training create.

Examples

# List base models you can fine-tune
inf training models

# List judge models instead
inf training models --judge

# Dump full IDs for scripting
inf training models --json | jq -r '.[].id'

`inf training recipes`

Recipes bundle a base model + judge model + GPU plan + full trainingConfig. inf training recipes lists everything visible to the active project (public recipes + the project’s own recipes).

inf training recipes

Options

Flag	Required	Description	Default
`--include-archived`	No	Include archived recipes	Off
`--public-only`	No	Show only public recipes	Off
`--project-only`	No	Show only the active project’s recipes	Off

Only super-admins can fork a public recipe into a project recipe. If you need to customize a recipe’s trainingConfig, use the override flags on inf training create rather than trying to clone the recipe.

`inf training recipes get`

Inspect a specific recipe, including its full trainingConfig. Useful for spotting knobs you may want to override at queue time.

inf training recipes get <id>

Arguments

Argument	Required	Description
`id`	Yes	Recipe ID

Examples

# List recipes visible to the active project
inf training recipes

# Inspect a specific recipe (including its trainingConfig)
inf training recipes get inf-public-training-recipe:qwen-3.5-4b-fft

# Only show project-owned recipes
inf training recipes --project-only

`inf training create`

Queue a new training run. You can either specify a recipe (recommended — it pre-fills base model, judge, GPU plan, and trainingConfig) or pass individual flags. Override flags let you tweak specific trainingConfig fields without forking the recipe.

inf training create \
  --name <name> \
  --training-dataset <id> \
  --eval-dataset <id> \
  --rubric <id>

Alias: inf training queue

Options

Flag	Required	Description	Default
`-n, --name <name>`	Yes	Display name for the run	—
`--training-dataset <id>`	Yes	Training-type dataset ID (also accepts `--training-dataset-id`)	—
`--eval-dataset <id>`	Yes	Eval-type dataset ID (also accepts `--eval-dataset-id`)	—
`--rubric <id>`	Yes	Rubric ID (also accepts `--rubric-id`)	—
`--rubric-version-id <id>`	No	Pin to a specific rubric version	Latest version
`--recipe <id>`	No	Training recipe ID — pre-fills base model, judge, GPU plan, and `trainingConfig` (also `--recipe-id`)	—
`--base-model <id>`	No	Base model to fine-tune (overrides the recipe, also `--base-model-id`)	Recipe default
`--judge-model <id>`	No	Judge model for evals (overrides the recipe, also `--judge-model-id`)	Recipe default
`--num-nodes <n>`	No	Number of training nodes	Recipe default
`--gpus-per-node <n>`	No	GPUs per node (1–8)	Recipe default
`--task-id <id>`	No	Associate the run with an existing task	—
`--sample-packing <bool>`	No	Override `trainingConfig.sample_packing` (`true`/`false`)	Recipe default
`--num-epochs <n>`	No	Override `trainingConfig.num_epochs`	Recipe default
`--max-steps <n>`	No	Override `trainingConfig.max_steps` (use `-1` for “no cap”)	Recipe default
`--learning-rate <n>`	No	Override `trainingConfig.learning_rate`	Recipe default
`--dry-run`	No	Print the exact tRPC payload that would be sent and exit — nothing is created	Off

GPU hardware selection is managed server-side and is not configurable from the CLI. When --rubric-version-id is omitted, the CLI fetches the latest version of --rubric before queueing — pin a version if you need reproducibility across re-runs. Prints the new training job ID along with an inf training poll <id> follow-up command. The run status starts as queued while datasets are prepared, then moves to running.

Escape hatches for recipe-pinned configs

The public recipes are opinionated: sample_packing is often true, and num_epochs / max_steps are tuned for the typical dataset size. When your dataset is small or shaped differently, those defaults can cause torchrun to crash (for example, with args.max_steps must be set to a positive value if dataloader does not have a length, was -1 — which is what happens when sample packing collapses a tiny dataset into fewer batches than the FSDP shard count). The --sample-packing, --num-epochs, --max-steps, and --learning-rate flags give you override knobs without needing to fork the recipe (only super-admins can). Pair them with --dry-run to sanity-check the resulting payload before burning a job slot.

Examples

# Minimal queue with a recipe
inf training create \
  --name distill-v1 \
  --recipe inf-public-training-recipe:qwen-3.5-4b-fft \
  --training-dataset ds_trn_abc12 \
  --eval-dataset     ds_evl_def34 \
  --rubric           rub_xyz56

# Override recipe defaults for a small dataset and verify the payload first
inf training create \
  --name distill-hardreasoning-qwen3.5-4b-v2 \
  --recipe inf-public-training-recipe:qwen-3.5-4b-fft \
  --training-dataset ds_trn_abc12 \
  --eval-dataset     ds_evl_def34 \
  --rubric           rub_xyz56 \
  --sample-packing false \
  --num-epochs 5 \
  --dry-run

# Use the `queue` alias
inf training queue \
  --name distill-v1 \
  --base-model model_abc \
  --judge-model model_xyz \
  --training-dataset ds_trn_abc12 \
  --eval-dataset     ds_evl_def34 \
  --rubric           rub_xyz56

`inf training list`

Display a table of training runs in the active project.

inf training list

Alias: inf training ls

Options

Flag	Required	Description	Default
`-s, --status <status>`	No	Filter by status: `exporting_datasets`, `queued`, `running`, `cycling`, `completed`, `failed`, `cancelled`, or `timed_out`	All statuses
`-l, --limit <n>`	No	Maximum number of results	`20`
`--offset <n>`	No	Offset for pagination	`0`

The table shows the run ID (8-char prefix), name, status (color-coded), base model, progress (currentStep/totalSteps), current loss, and creation date.

Examples

# Show only running training jobs
inf training list --status running

# Get the first 50 training runs
inf training list --limit 50

# Page through results
inf training list --limit 20 --offset 40

`inf training get`

View detailed information about a specific training run. Without --error, prints the full detail view. With --error, dumps only the fields that matter when a run crashed — ideal for triaging failed jobs from CI or a script.

inf training get <id>

Arguments

Argument	Required	Description
`id`	Yes	The training job ID

Options

Flag	Required	Description	Default
`--error`	No	Print only the status, status detail, and error message in a highlighted block (for `failed` runs)	Off

Output

The default detail view includes every field on the training job:

Field	Description
`id`, `name`	Run identifier and display name
`status`	One of the status values listed under `inf training list`
`baseModelId`	Base model the run fine-tunes from
`adapter`	Adapter type (e.g. LoRA)
`currentStep` / `totalSteps`	Progress counters
`currentLoss`	Most recent training loss
`numNodes`	Number of nodes participating in the run
`provider` / `providerRunId`	Underlying training provider and their internal run ID
`evalDatasetName` / `rubricName`	Eval configuration (if the run is configured for mid-training evals)
`filteredDatasetName`	Training dataset name
`startedAt` / `completedAt` / `createdAt`	Lifecycle timestamps
`statusDetail` / `errorMessage`	Populated when the run fails or ends in a non-success state

With --error, the output collapses to just status, statusDetail, and errorMessage.

Examples

# Full detail view
inf training get job_abc123

# Just the error payload for a failed run
inf training get job_abc123 --error

`inf training cancel`

Cancel a queued or running training job. Completed and already-cancelled jobs will reject the call.

inf training cancel <id>

Arguments

Argument	Required	Description
`id`	Yes	The training job ID

Options

Flag	Required	Description	Default
`-y, --yes`	Yes in non-TTY environments	Skip the confirmation prompt	Off

In an interactive terminal, the CLI asks for confirmation unless -y is passed. In non-TTY environments (CI, scripts) the command refuses to run without -y.

Examples

# Cancel interactively
inf training cancel job_abc123

# Cancel non-interactively (required in CI)
inf training cancel job_abc123 --yes

`inf training logs`

Stream or view log entries for a training run. Logs are color-coded by level, and each line is prefixed with the training phase it came from (torchrun_init, training, inference_export, …) so you can tell setup crashes apart from training crashes at a glance.

inf training logs <id>

Arguments

Argument	Required	Description
`id`	Yes	The training job ID

Options

Flag	Required	Description	Default
`-l, --limit <n>`	No	Maximum number of log entries	`50`
`--level <level>`	No	Filter by log level (e.g. `error`, `warn`, `info`)	All levels
`--phase <phase>`	No	Filter by training phase (substring match — e.g. `torchrun_init`, `training`)	No filter
`-f, --follow`	No	Continuously poll for new logs (like `tail -f`)	Off

Log entries are timestamped and color-coded: errors in red, warnings in yellow, info in blue. In follow mode, the CLI polls for new logs every 3 seconds until you press Ctrl+C.

Examples

# View the last 50 log entries
inf training logs job_abc123

# Stream logs in real time
inf training logs job_abc123 --follow

# Only show error logs
inf training logs job_abc123 --level error

# Only show setup-phase logs (everything before training starts)
inf training logs job_abc123 --phase torchrun_init

`inf training poll`

Wait for a training run to complete, printing status updates as the status changes.

inf training poll <id>

Arguments

Argument	Required	Description
`id`	Yes	The training job ID

Options

Flag	Required	Description	Default
`-i, --interval <seconds>`	No	Poll interval in seconds	`10`

The CLI prints a status line each time the status changes, showing the status, progress, and current loss. It exits automatically when the run reaches completed, failed, cancelled, or timed_out. On a failed exit, the CLI points you at inf training get <id> --error for the full error payload.

Examples

# Poll every 10 seconds (default)
inf training poll job_abc123

# Poll every 30 seconds
inf training poll job_abc123 --interval 30

Get Started

Command Reference

`inf training models`

Options

Examples

`inf training recipes`

Options

`inf training recipes get`

Arguments

Examples

`inf training create`

Options

Escape hatches for recipe-pinned configs

Examples

`inf training list`

Options

Examples

`inf training get`

Arguments

Options

Output

Examples

`inf training cancel`

Arguments

Options

Examples

`inf training logs`

Arguments

Options

Examples

`inf training poll`

Arguments

Options

Examples

Get Started

Command Reference

Documentation Index

​inf training models

​Options

​Examples

​inf training recipes

​Options

​inf training recipes get

​Arguments

​Examples

​inf training create

​Options

​Escape hatches for recipe-pinned configs

​Examples

​inf training list

​Options

​Examples

​inf training get

​Arguments

​Options

​Output

​Examples

​inf training cancel

​Arguments

​Options

​Examples

​inf training logs

​Arguments

​Options

​Examples

​inf training poll

​Arguments

​Options

​Examples

`inf training models`

Options

Examples

`inf training recipes`

Options

`inf training recipes get`

Arguments

Examples

`inf training create`

Options

Escape hatches for recipe-pinned configs

Examples

`inf training list`

Options

Examples

`inf training get`

Arguments

Options

Output

Examples

`inf training cancel`

Arguments

Options

Examples

`inf training logs`

Arguments

Options

Examples

`inf training poll`

Arguments

Options

Examples