Skip to main content
Kick off and manage model training runs from the command line. Discover recipes and trainable base models, queue new runs (with override flags for the trickier trainingConfig knobs), cancel in-flight jobs, and zoom in on errors without scrolling through raw logs. Alias: inf train The full training loop is paste-able from the terminal:
# 1. Materialize training and eval datasets
inf dataset create -n my-train-split -t training --file ./train.jsonl
inf dataset create -n my-eval-split  -t eval     --file ./eval.jsonl
# → Datasets ds_trn_abc12 / ds_evl_def34 created.

# 2. Create (or reuse) a rubric the evals will score against
inf eval rubric create -n my-rubric -f ./rubric.md
# → Rubric rub_xyz56 / version rv_ver78 created.

# 3. Pick a recipe and base model
inf training recipes
inf training models

# 4. Queue the training run
inf training create \
  --name distill-hardreasoning-qwen3.5-4b-v2 \
  --recipe inf-public-training-recipe:qwen-3.5-4b-fft \
  --training-dataset ds_trn_abc12 \
  --eval-dataset     ds_evl_def34 \
  --rubric           rub_xyz56 \
  --sample-packing   false \
  --num-epochs       5 \
  --task-id distill-hardreasoning-v2
# → Training job job_90ab12 queued.

# 5. Track progress until it finishes
inf training poll job_90ab12

inf training models

Discover the base models you can fine-tune and (with --judge) the judge models that can score checkpoints.
inf training models

Options

FlagRequiredDescriptionDefault
--judgeNoList judge models instead of base models (use with --judge-model)Off
The table shows each model’s canonical alias, full name, and ID prefix. Pair with --json when scripting to preserve full IDs — those full IDs are what you pass to --base-model / --judge-model on inf training create.

Examples

# List base models you can fine-tune
inf training models

# List judge models instead
inf training models --judge

# Dump full IDs for scripting
inf training models --json | jq -r '.[].id'

inf training recipes

Recipes bundle a base model + judge model + GPU plan + full trainingConfig. inf training recipes lists everything visible to the active project (public recipes + the project’s own recipes).
inf training recipes

Options

FlagRequiredDescriptionDefault
--include-archivedNoInclude archived recipesOff
--public-onlyNoShow only public recipesOff
--project-onlyNoShow only the active project’s recipesOff
Only super-admins can fork a public recipe into a project recipe. If you need to customize a recipe’s trainingConfig, use the override flags on inf training create rather than trying to clone the recipe.

inf training recipes get

Inspect a specific recipe, including its full trainingConfig. Useful for spotting knobs you may want to override at queue time.
inf training recipes get <id>

Arguments

ArgumentRequiredDescription
idYesRecipe ID

Examples

# List recipes visible to the active project
inf training recipes

# Inspect a specific recipe (including its trainingConfig)
inf training recipes get inf-public-training-recipe:qwen-3.5-4b-fft

# Only show project-owned recipes
inf training recipes --project-only

inf training create

Queue a new training run. You can either specify a recipe (recommended — it pre-fills base model, judge, GPU plan, and trainingConfig) or pass individual flags. Override flags let you tweak specific trainingConfig fields without forking the recipe.
inf training create \
  --name <name> \
  --training-dataset <id> \
  --eval-dataset <id> \
  --rubric <id>
Alias: inf training queue

Options

FlagRequiredDescriptionDefault
-n, --name <name>YesDisplay name for the run
--training-dataset <id>YesTraining-type dataset ID (also accepts --training-dataset-id)
--eval-dataset <id>YesEval-type dataset ID (also accepts --eval-dataset-id)
--rubric <id>YesRubric ID (also accepts --rubric-id)
--rubric-version-id <id>NoPin to a specific rubric versionLatest version
--recipe <id>NoTraining recipe ID — pre-fills base model, judge, GPU plan, and trainingConfig (also --recipe-id)
--base-model <id>NoBase model to fine-tune (overrides the recipe, also --base-model-id)Recipe default
--judge-model <id>NoJudge model for evals (overrides the recipe, also --judge-model-id)Recipe default
--num-nodes <n>NoNumber of training nodesRecipe default
--gpus-per-node <n>NoGPUs per node (1–8)Recipe default
--task-id <id>NoAssociate the run with an existing task
--sample-packing <bool>NoOverride trainingConfig.sample_packing (true/false)Recipe default
--num-epochs <n>NoOverride trainingConfig.num_epochsRecipe default
--max-steps <n>NoOverride trainingConfig.max_steps (use -1 for “no cap”)Recipe default
--learning-rate <n>NoOverride trainingConfig.learning_rateRecipe default
--dry-runNoPrint the exact tRPC payload that would be sent and exit — nothing is createdOff
GPU hardware selection is managed server-side and is not configurable from the CLI. When --rubric-version-id is omitted, the CLI fetches the latest version of --rubric before queueing — pin a version if you need reproducibility across re-runs. Prints the new training job ID along with an inf training poll <id> follow-up command. The run status starts as queued while datasets are prepared, then moves to running.

Escape hatches for recipe-pinned configs

The public recipes are opinionated: sample_packing is often true, and num_epochs / max_steps are tuned for the typical dataset size. When your dataset is small or shaped differently, those defaults can cause torchrun to crash (for example, with args.max_steps must be set to a positive value if dataloader does not have a length, was -1 — which is what happens when sample packing collapses a tiny dataset into fewer batches than the FSDP shard count). The --sample-packing, --num-epochs, --max-steps, and --learning-rate flags give you override knobs without needing to fork the recipe (only super-admins can). Pair them with --dry-run to sanity-check the resulting payload before burning a job slot.

Examples

# Minimal queue with a recipe
inf training create \
  --name distill-v1 \
  --recipe inf-public-training-recipe:qwen-3.5-4b-fft \
  --training-dataset ds_trn_abc12 \
  --eval-dataset     ds_evl_def34 \
  --rubric           rub_xyz56

# Override recipe defaults for a small dataset and verify the payload first
inf training create \
  --name distill-hardreasoning-qwen3.5-4b-v2 \
  --recipe inf-public-training-recipe:qwen-3.5-4b-fft \
  --training-dataset ds_trn_abc12 \
  --eval-dataset     ds_evl_def34 \
  --rubric           rub_xyz56 \
  --sample-packing false \
  --num-epochs 5 \
  --dry-run

# Use the `queue` alias
inf training queue \
  --name distill-v1 \
  --base-model model_abc \
  --judge-model model_xyz \
  --training-dataset ds_trn_abc12 \
  --eval-dataset     ds_evl_def34 \
  --rubric           rub_xyz56

inf training list

Display a table of training runs in the active project.
inf training list
Alias: inf training ls

Options

FlagRequiredDescriptionDefault
-s, --status <status>NoFilter by status: exporting_datasets, queued, running, cycling, completed, failed, cancelled, or timed_outAll statuses
-l, --limit <n>NoMaximum number of results20
--offset <n>NoOffset for pagination0
The table shows the run ID (8-char prefix), name, status (color-coded), base model, progress (currentStep/totalSteps), current loss, and creation date.

Examples

# Show only running training jobs
inf training list --status running

# Get the first 50 training runs
inf training list --limit 50

# Page through results
inf training list --limit 20 --offset 40

inf training get

View detailed information about a specific training run. Without --error, prints the full detail view. With --error, dumps only the fields that matter when a run crashed — ideal for triaging failed jobs from CI or a script.
inf training get <id>

Arguments

ArgumentRequiredDescription
idYesThe training job ID

Options

FlagRequiredDescriptionDefault
--errorNoPrint only the status, status detail, and error message in a highlighted block (for failed runs)Off

Output

The default detail view includes every field on the training job:
FieldDescription
id, nameRun identifier and display name
statusOne of the status values listed under inf training list
baseModelIdBase model the run fine-tunes from
adapterAdapter type (e.g. LoRA)
currentStep / totalStepsProgress counters
currentLossMost recent training loss
numNodesNumber of nodes participating in the run
provider / providerRunIdUnderlying training provider and their internal run ID
evalDatasetName / rubricNameEval configuration (if the run is configured for mid-training evals)
filteredDatasetNameTraining dataset name
startedAt / completedAt / createdAtLifecycle timestamps
statusDetail / errorMessagePopulated when the run fails or ends in a non-success state
With --error, the output collapses to just status, statusDetail, and errorMessage.

Examples

# Full detail view
inf training get job_abc123

# Just the error payload for a failed run
inf training get job_abc123 --error

inf training cancel

Cancel a queued or running training job. Completed and already-cancelled jobs will reject the call.
inf training cancel <id>

Arguments

ArgumentRequiredDescription
idYesThe training job ID

Options

FlagRequiredDescriptionDefault
-y, --yesYes in non-TTY environmentsSkip the confirmation promptOff
In an interactive terminal, the CLI asks for confirmation unless -y is passed. In non-TTY environments (CI, scripts) the command refuses to run without -y.

Examples

# Cancel interactively
inf training cancel job_abc123

# Cancel non-interactively (required in CI)
inf training cancel job_abc123 --yes

inf training logs

Stream or view log entries for a training run. Logs are color-coded by level, and each line is prefixed with the training phase it came from (torchrun_init, training, inference_export, …) so you can tell setup crashes apart from training crashes at a glance.
inf training logs <id>

Arguments

ArgumentRequiredDescription
idYesThe training job ID

Options

FlagRequiredDescriptionDefault
-l, --limit <n>NoMaximum number of log entries50
--level <level>NoFilter by log level (e.g. error, warn, info)All levels
--phase <phase>NoFilter by training phase (substring match — e.g. torchrun_init, training)No filter
-f, --followNoContinuously poll for new logs (like tail -f)Off
Log entries are timestamped and color-coded: errors in red, warnings in yellow, info in blue. In follow mode, the CLI polls for new logs every 3 seconds until you press Ctrl+C.

Examples

# View the last 50 log entries
inf training logs job_abc123

# Stream logs in real time
inf training logs job_abc123 --follow

# Only show error logs
inf training logs job_abc123 --level error

# Only show setup-phase logs (everything before training starts)
inf training logs job_abc123 --phase torchrun_init

inf training poll

Wait for a training run to complete, printing status updates as the status changes.
inf training poll <id>

Arguments

ArgumentRequiredDescription
idYesThe training job ID

Options

FlagRequiredDescriptionDefault
-i, --interval <seconds>NoPoll interval in seconds10
The CLI prints a status line each time the status changes, showing the status, progress, and current loss. It exits automatically when the run reaches completed, failed, cancelled, or timed_out. On a failed exit, the CLI points you at inf training get <id> --error for the full error payload.

Examples

# Poll every 10 seconds (default)
inf training poll job_abc123

# Poll every 30 seconds
inf training poll job_abc123 --interval 30