Skip to main content
TermDefinition
TaskA user-defined objective that groups LLM calls (e.g., “summarize document,” “classify ticket”). Tasks persist even as the implementation changes.
RubricA plain English description of how to judge model output on a quality dimension. Scored numerically by an LLM judge.
RecipeA pre-configured training setup including base model, training parameters, and compute config.
DatasetA curated set of inference samples used for evaluation or training.
Eval datasetA dataset used to measure model quality. Should remain stable over time. Must not overlap with training data.
Training datasetA dataset the model learns from during fine-tuning. Evolves as you iterate on data quality.
InferenceA single LLM request-response pair captured by the platform.
DeploymentA trained model running on a dedicated GPU, accessible via an OpenAI-compatible API.
Mid-training evalA periodic evaluation run during training that scores model checkpoints against the rubric. Used for early stopping.
LLM-as-a-judgeThe evaluation mechanism where an LLM scores model outputs against rubric criteria.
TTFTTime to first token. Measures streaming responsiveness.
Early stoppingAutomatically halting training when eval scores degrade, preventing overfitting.
OverfittingWhen a model memorizes training data instead of learning generalizable patterns. Detected by degrading eval scores.
DistillationTraining a smaller model to replicate the quality of a larger model, reducing cost and latency.