| Term | Definition |
|---|---|
| Task | A user-defined objective that groups LLM calls (e.g., “summarize document,” “classify ticket”). Tasks persist even as the implementation changes. |
| Rubric | A plain English description of how to judge model output on a quality dimension. Scored numerically by an LLM judge. |
| Recipe | A pre-configured training setup including base model, training parameters, and compute config. |
| Dataset | A curated set of inference samples used for evaluation or training. |
| Eval dataset | A dataset used to measure model quality. Should remain stable over time. Must not overlap with training data. |
| Training dataset | A dataset the model learns from during fine-tuning. Evolves as you iterate on data quality. |
| Inference | A single LLM request-response pair captured by the platform. |
| Deployment | A trained model running on a dedicated GPU, accessible via an OpenAI-compatible API. |
| Mid-training eval | A periodic evaluation run during training that scores model checkpoints against the rubric. Used for early stopping. |
| LLM-as-a-judge | The evaluation mechanism where an LLM scores model outputs against rubric criteria. |
| TTFT | Time to first token. Measures streaming responsiveness. |
| Early stopping | Automatically halting training when eval scores degrade, preventing overfitting. |
| Overfitting | When a model memorizes training data instead of learning generalizable patterns. Detected by degrading eval scores. |
| Distillation | Training a smaller model to replicate the quality of a larger model, reducing cost and latency. |