trainingConfig knobs), cancel in-flight jobs, and zoom in on errors without scrolling through raw logs.
Alias: inf train
The full training loop is paste-able from the terminal:
inf training models
Discover the base models you can fine-tune and (with --judge) the judge models that can score checkpoints.
Options
| Flag | Required | Description | Default |
|---|---|---|---|
--judge | No | List judge models instead of base models (use with --judge-model) | Off |
--json when scripting to preserve full IDs — those full IDs are what you pass to --base-model / --judge-model on inf training create.
Examples
inf training recipes
Recipes bundle a base model + judge model + GPU plan + full trainingConfig. inf training recipes lists everything visible to the active project (public recipes + the project’s own recipes).
Options
| Flag | Required | Description | Default |
|---|---|---|---|
--include-archived | No | Include archived recipes | Off |
--public-only | No | Show only public recipes | Off |
--project-only | No | Show only the active project’s recipes | Off |
inf training recipes get
Inspect a specific recipe, including its full trainingConfig. Useful for spotting knobs you may want to override at queue time.
Arguments
| Argument | Required | Description |
|---|---|---|
id | Yes | Recipe ID |
Examples
inf training create
Queue a new training run. You can either specify a recipe (recommended — it pre-fills base model, judge, GPU plan, and trainingConfig) or pass individual flags. Override flags let you tweak specific trainingConfig fields without forking the recipe.
inf training queue
Options
| Flag | Required | Description | Default |
|---|---|---|---|
-n, --name <name> | Yes | Display name for the run | — |
--training-dataset <id> | Yes | Training-type dataset ID (also accepts --training-dataset-id) | — |
--eval-dataset <id> | Yes | Eval-type dataset ID (also accepts --eval-dataset-id) | — |
--rubric <id> | Yes | Rubric ID (also accepts --rubric-id) | — |
--rubric-version-id <id> | No | Pin to a specific rubric version | Latest version |
--recipe <id> | No | Training recipe ID — pre-fills base model, judge, GPU plan, and trainingConfig (also --recipe-id) | — |
--base-model <id> | No | Base model to fine-tune (overrides the recipe, also --base-model-id) | Recipe default |
--judge-model <id> | No | Judge model for evals (overrides the recipe, also --judge-model-id) | Recipe default |
--num-nodes <n> | No | Number of training nodes | Recipe default |
--gpus-per-node <n> | No | GPUs per node (1–8) | Recipe default |
--task-id <id> | No | Associate the run with an existing task | — |
--sample-packing <bool> | No | Override trainingConfig.sample_packing (true/false) | Recipe default |
--num-epochs <n> | No | Override trainingConfig.num_epochs | Recipe default |
--max-steps <n> | No | Override trainingConfig.max_steps (use -1 for “no cap”) | Recipe default |
--learning-rate <n> | No | Override trainingConfig.learning_rate | Recipe default |
--dry-run | No | Print the exact tRPC payload that would be sent and exit — nothing is created | Off |
--rubric-version-id is omitted, the CLI fetches the latest version of --rubric before queueing — pin a version if you need reproducibility across re-runs. Prints the new training job ID along with an inf training poll <id> follow-up command. The run status starts as queued while datasets are prepared, then moves to running.
Escape hatches for recipe-pinned configs
The public recipes are opinionated:sample_packing is often true, and num_epochs / max_steps are tuned for the typical dataset size. When your dataset is small or shaped differently, those defaults can cause torchrun to crash (for example, with args.max_steps must be set to a positive value if dataloader does not have a length, was -1 — which is what happens when sample packing collapses a tiny dataset into fewer batches than the FSDP shard count).
The --sample-packing, --num-epochs, --max-steps, and --learning-rate flags give you override knobs without needing to fork the recipe (only super-admins can). Pair them with --dry-run to sanity-check the resulting payload before burning a job slot.
Examples
inf training list
Display a table of training runs in the active project.
inf training ls
Options
| Flag | Required | Description | Default |
|---|---|---|---|
-s, --status <status> | No | Filter by status: exporting_datasets, queued, running, cycling, completed, failed, cancelled, or timed_out | All statuses |
-l, --limit <n> | No | Maximum number of results | 20 |
--offset <n> | No | Offset for pagination | 0 |
currentStep/totalSteps), current loss, and creation date.
Examples
inf training get
View detailed information about a specific training run. Without --error, prints the full detail view. With --error, dumps only the fields that matter when a run crashed — ideal for triaging failed jobs from CI or a script.
Arguments
| Argument | Required | Description |
|---|---|---|
id | Yes | The training job ID |
Options
| Flag | Required | Description | Default |
|---|---|---|---|
--error | No | Print only the status, status detail, and error message in a highlighted block (for failed runs) | Off |
Output
The default detail view includes every field on the training job:| Field | Description |
|---|---|
id, name | Run identifier and display name |
status | One of the status values listed under inf training list |
baseModelId | Base model the run fine-tunes from |
adapter | Adapter type (e.g. LoRA) |
currentStep / totalSteps | Progress counters |
currentLoss | Most recent training loss |
numNodes | Number of nodes participating in the run |
provider / providerRunId | Underlying training provider and their internal run ID |
evalDatasetName / rubricName | Eval configuration (if the run is configured for mid-training evals) |
filteredDatasetName | Training dataset name |
startedAt / completedAt / createdAt | Lifecycle timestamps |
statusDetail / errorMessage | Populated when the run fails or ends in a non-success state |
--error, the output collapses to just status, statusDetail, and errorMessage.
Examples
inf training cancel
Cancel a queued or running training job. Completed and already-cancelled jobs will reject the call.
Arguments
| Argument | Required | Description |
|---|---|---|
id | Yes | The training job ID |
Options
| Flag | Required | Description | Default |
|---|---|---|---|
-y, --yes | Yes in non-TTY environments | Skip the confirmation prompt | Off |
-y is passed. In non-TTY environments (CI, scripts) the command refuses to run without -y.
Examples
inf training logs
Stream or view log entries for a training run. Logs are color-coded by level, and each line is prefixed with the training phase it came from (torchrun_init, training, inference_export, …) so you can tell setup crashes apart from training crashes at a glance.
Arguments
| Argument | Required | Description |
|---|---|---|
id | Yes | The training job ID |
Options
| Flag | Required | Description | Default |
|---|---|---|---|
-l, --limit <n> | No | Maximum number of log entries | 50 |
--level <level> | No | Filter by log level (e.g. error, warn, info) | All levels |
--phase <phase> | No | Filter by training phase (substring match — e.g. torchrun_init, training) | No filter |
-f, --follow | No | Continuously poll for new logs (like tail -f) | Off |
Ctrl+C.
Examples
inf training poll
Wait for a training run to complete, printing status updates as the status changes.
Arguments
| Argument | Required | Description |
|---|---|---|
id | Yes | The training job ID |
Options
| Flag | Required | Description | Default |
|---|---|---|---|
-i, --interval <seconds> | No | Poll interval in seconds | 10 |
completed, failed, cancelled, or timed_out. On a failed exit, the CLI points you at inf training get <id> --error for the full error payload.