Eval - Inference.net Documentation

Model providers ship updates constantly and prompts drift. Eval gives you a repeatable way to measure model quality before and after every change. They can also help compare model options for a given task. If you’re planning to fine-tune a custom model, run evals first. A validated rubric and eval dataset are prerequisites for training. They’re the measuring stick that determines when the model has learned enough, or when to stop to prevent overfitting.

How it works

Define a rubric - describe what “good” looks like in plain English
Pick a dataset - samples from captured traffic or uploaded JSONL
Select models - the candidates you want to compare
Run the eval - each sample goes through each model, and an LLM judge scores every output
Compare results - side-by-side scores show which model wins

Key concepts

Concept	Description
LLM-as-a-judge	A capable LLM reads your rubric, examines the model output, and returns a scored judgment.
Rubric	A plain English description of a quality dimension, scored numerically. Defines what “good” means for your use case.
Direct rubric	The LLM judge grades the model output directly against the rubric, without comparing it to a reference answer.
Adherence rubric	The LLM judge grades the model output based on how closely it matches a reference response.
Eval dataset	A stable, curated set of challenging examples that acts as your benchmark. Pick the hard cases.
Offline vs online	Offline evals run against collected samples. Online evals score live traffic as it flows through. Offline is available today; online is coming soon.
Train-eval splits	Training and eval data must never overlap. If a model trains on eval examples, the eval becomes meaningless. See the zero-overlap rule.

Next steps

Writing rubrics

Create rubrics from templates, AI generation, or plain English.

Run a model comparison

Compare models head to head on your data.

How LLM-as-a-Judge works

Understand the evaluation mechanism.

Read the results

Interpret scores and make decisions.

Dataset Formats and Schemas

How LLM-as-a-Judge Works

⌘I

​How it works

​Key concepts

​Next steps