Documentation Index
Fetch the complete documentation index at: https://docs.inference.net/llms.txt
Use this file to discover all available pages before exploring further.
Kick off and manage model training runs from the command line. Discover recipes and trainable base models, queue new runs (with override flags for the trickier trainingConfig knobs), cancel in-flight jobs, and zoom in on errors without scrolling through raw logs.
Alias: inf train
The full training loop is paste-able from the terminal:
# 1. Materialize training and eval datasets
inf dataset create -n my-train-split -t training --file ./train.jsonl
inf dataset create -n my-eval-split -t eval --file ./eval.jsonl
# → Datasets ds_trn_abc12 / ds_evl_def34 created.
# 2. Create (or reuse) a rubric the evals will score against
inf eval rubric create -n my-rubric -f ./rubric.md
# → Rubric rub_xyz56 / version rv_ver78 created.
# 3. Pick a recipe and base model
inf training recipes
inf training models
# 4. Queue the training run
inf training create \
--name distill-hardreasoning-qwen3.5-4b-v2 \
--recipe inf-public-training-recipe:qwen-3.5-4b-fft \
--training-dataset ds_trn_abc12 \
--eval-dataset ds_evl_def34 \
--rubric rub_xyz56 \
--sample-packing false \
--num-epochs 5 \
--task-id distill-hardreasoning-v2
# → Training job job_90ab12 queued.
# 5. Track progress until it finishes
inf training poll job_90ab12
inf training models
Discover the base models you can fine-tune and (with --judge) the judge models that can score checkpoints.
Options
| Flag | Required | Description | Default |
|---|
--judge | No | List judge models instead of base models (use with --judge-model) | Off |
The table shows each model’s canonical alias, full name, and ID prefix. Pair with --json when scripting to preserve full IDs — those full IDs are what you pass to --base-model / --judge-model on inf training create.
Examples
# List base models you can fine-tune
inf training models
# List judge models instead
inf training models --judge
# Dump full IDs for scripting
inf training models --json | jq -r '.[].id'
inf training recipes
Recipes bundle a base model + judge model + GPU plan + full trainingConfig. inf training recipes lists everything visible to the active project (public recipes + the project’s own recipes).
Options
| Flag | Required | Description | Default |
|---|
--include-archived | No | Include archived recipes | Off |
--public-only | No | Show only public recipes | Off |
--project-only | No | Show only the active project’s recipes | Off |
Only super-admins can fork a public recipe into a project recipe. If you need to customize a recipe’s trainingConfig, use the override flags on inf training create rather than trying to clone the recipe.
inf training recipes get
Inspect a specific recipe, including its full trainingConfig. Useful for spotting knobs you may want to override at queue time.
inf training recipes get <id>
Arguments
| Argument | Required | Description |
|---|
id | Yes | Recipe ID |
Examples
# List recipes visible to the active project
inf training recipes
# Inspect a specific recipe (including its trainingConfig)
inf training recipes get inf-public-training-recipe:qwen-3.5-4b-fft
# Only show project-owned recipes
inf training recipes --project-only
inf training create
Queue a new training run. You can either specify a recipe (recommended — it pre-fills base model, judge, GPU plan, and trainingConfig) or pass individual flags. Override flags let you tweak specific trainingConfig fields without forking the recipe.
inf training create \
--name <name> \
--training-dataset <id> \
--eval-dataset <id> \
--rubric <id>
Alias: inf training queue
Options
| Flag | Required | Description | Default |
|---|
-n, --name <name> | Yes | Display name for the run | — |
--training-dataset <id> | Yes | Training-type dataset ID (also accepts --training-dataset-id) | — |
--eval-dataset <id> | Yes | Eval-type dataset ID (also accepts --eval-dataset-id) | — |
--rubric <id> | Yes | Rubric ID (also accepts --rubric-id) | — |
--rubric-version-id <id> | No | Pin to a specific rubric version | Latest version |
--recipe <id> | No | Training recipe ID — pre-fills base model, judge, GPU plan, and trainingConfig (also --recipe-id) | — |
--base-model <id> | No | Base model to fine-tune (overrides the recipe, also --base-model-id) | Recipe default |
--judge-model <id> | No | Judge model for evals (overrides the recipe, also --judge-model-id) | Recipe default |
--num-nodes <n> | No | Number of training nodes | Recipe default |
--gpus-per-node <n> | No | GPUs per node (1–8) | Recipe default |
--task-id <id> | No | Associate the run with an existing task | — |
--sample-packing <bool> | No | Override trainingConfig.sample_packing (true/false) | Recipe default |
--num-epochs <n> | No | Override trainingConfig.num_epochs | Recipe default |
--max-steps <n> | No | Override trainingConfig.max_steps (use -1 for “no cap”) | Recipe default |
--learning-rate <n> | No | Override trainingConfig.learning_rate | Recipe default |
--dry-run | No | Print the exact tRPC payload that would be sent and exit — nothing is created | Off |
GPU hardware selection is managed server-side and is not configurable from the CLI. When --rubric-version-id is omitted, the CLI fetches the latest version of --rubric before queueing — pin a version if you need reproducibility across re-runs. Prints the new training job ID along with an inf training poll <id> follow-up command. The run status starts as queued while datasets are prepared, then moves to running.
Escape hatches for recipe-pinned configs
The public recipes are opinionated: sample_packing is often true, and num_epochs / max_steps are tuned for the typical dataset size. When your dataset is small or shaped differently, those defaults can cause torchrun to crash (for example, with args.max_steps must be set to a positive value if dataloader does not have a length, was -1 — which is what happens when sample packing collapses a tiny dataset into fewer batches than the FSDP shard count).
The --sample-packing, --num-epochs, --max-steps, and --learning-rate flags give you override knobs without needing to fork the recipe (only super-admins can). Pair them with --dry-run to sanity-check the resulting payload before burning a job slot.
Examples
# Minimal queue with a recipe
inf training create \
--name distill-v1 \
--recipe inf-public-training-recipe:qwen-3.5-4b-fft \
--training-dataset ds_trn_abc12 \
--eval-dataset ds_evl_def34 \
--rubric rub_xyz56
# Override recipe defaults for a small dataset and verify the payload first
inf training create \
--name distill-hardreasoning-qwen3.5-4b-v2 \
--recipe inf-public-training-recipe:qwen-3.5-4b-fft \
--training-dataset ds_trn_abc12 \
--eval-dataset ds_evl_def34 \
--rubric rub_xyz56 \
--sample-packing false \
--num-epochs 5 \
--dry-run
# Use the `queue` alias
inf training queue \
--name distill-v1 \
--base-model model_abc \
--judge-model model_xyz \
--training-dataset ds_trn_abc12 \
--eval-dataset ds_evl_def34 \
--rubric rub_xyz56
inf training list
Display a table of training runs in the active project.
Alias: inf training ls
Options
| Flag | Required | Description | Default |
|---|
-s, --status <status> | No | Filter by status: exporting_datasets, queued, running, cycling, completed, failed, cancelled, or timed_out | All statuses |
-l, --limit <n> | No | Maximum number of results | 20 |
--offset <n> | No | Offset for pagination | 0 |
The table shows the run ID (8-char prefix), name, status (color-coded), base model, progress (currentStep/totalSteps), current loss, and creation date.
Examples
# Show only running training jobs
inf training list --status running
# Get the first 50 training runs
inf training list --limit 50
# Page through results
inf training list --limit 20 --offset 40
inf training get
View detailed information about a specific training run. Without --error, prints the full detail view. With --error, dumps only the fields that matter when a run crashed — ideal for triaging failed jobs from CI or a script.
Arguments
| Argument | Required | Description |
|---|
id | Yes | The training job ID |
Options
| Flag | Required | Description | Default |
|---|
--error | No | Print only the status, status detail, and error message in a highlighted block (for failed runs) | Off |
Output
The default detail view includes every field on the training job:
| Field | Description |
|---|
id, name | Run identifier and display name |
status | One of the status values listed under inf training list |
baseModelId | Base model the run fine-tunes from |
adapter | Adapter type (e.g. LoRA) |
currentStep / totalSteps | Progress counters |
currentLoss | Most recent training loss |
numNodes | Number of nodes participating in the run |
provider / providerRunId | Underlying training provider and their internal run ID |
evalDatasetName / rubricName | Eval configuration (if the run is configured for mid-training evals) |
filteredDatasetName | Training dataset name |
startedAt / completedAt / createdAt | Lifecycle timestamps |
statusDetail / errorMessage | Populated when the run fails or ends in a non-success state |
With --error, the output collapses to just status, statusDetail, and errorMessage.
Examples
# Full detail view
inf training get job_abc123
# Just the error payload for a failed run
inf training get job_abc123 --error
inf training cancel
Cancel a queued or running training job. Completed and already-cancelled jobs will reject the call.
Arguments
| Argument | Required | Description |
|---|
id | Yes | The training job ID |
Options
| Flag | Required | Description | Default |
|---|
-y, --yes | Yes in non-TTY environments | Skip the confirmation prompt | Off |
In an interactive terminal, the CLI asks for confirmation unless -y is passed. In non-TTY environments (CI, scripts) the command refuses to run without -y.
Examples
# Cancel interactively
inf training cancel job_abc123
# Cancel non-interactively (required in CI)
inf training cancel job_abc123 --yes
inf training logs
Stream or view log entries for a training run. Logs are color-coded by level, and each line is prefixed with the training phase it came from (torchrun_init, training, inference_export, …) so you can tell setup crashes apart from training crashes at a glance.
Arguments
| Argument | Required | Description |
|---|
id | Yes | The training job ID |
Options
| Flag | Required | Description | Default |
|---|
-l, --limit <n> | No | Maximum number of log entries | 50 |
--level <level> | No | Filter by log level (e.g. error, warn, info) | All levels |
--phase <phase> | No | Filter by training phase (substring match — e.g. torchrun_init, training) | No filter |
-f, --follow | No | Continuously poll for new logs (like tail -f) | Off |
Log entries are timestamped and color-coded: errors in red, warnings in yellow, info in blue. In follow mode, the CLI polls for new logs every 3 seconds until you press Ctrl+C.
Examples
# View the last 50 log entries
inf training logs job_abc123
# Stream logs in real time
inf training logs job_abc123 --follow
# Only show error logs
inf training logs job_abc123 --level error
# Only show setup-phase logs (everything before training starts)
inf training logs job_abc123 --phase torchrun_init
inf training poll
Wait for a training run to complete, printing status updates as the status changes.
Arguments
| Argument | Required | Description |
|---|
id | Yes | The training job ID |
Options
| Flag | Required | Description | Default |
|---|
-i, --interval <seconds> | No | Poll interval in seconds | 10 |
The CLI prints a status line each time the status changes, showing the status, progress, and current loss. It exits automatically when the run reaches completed, failed, cancelled, or timed_out. On a failed exit, the CLI points you at inf training get <id> --error for the full error payload.
Examples
# Poll every 10 seconds (default)
inf training poll job_abc123
# Poll every 30 seconds
inf training poll job_abc123 --interval 30