Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.inference.net/llms.txt

Use this file to discover all available pages before exploring further.

An eval measures which model is better for your task, and by how much. You define a rubric that describes what “good” looks like, run your data through candidate models, and let an LLM judge score the outputs. This is how you know whether a smaller, cheaper model can replace the one you’re using today. This guide uses the Customer Support Chatbot demo project, which comes pre-loaded with a dataset and rubric so you can run an eval immediately — no data required. Once you’ve seen how it works, you can apply the same process to your own data.

Start the demo project

If you haven’t already, create the demo project:
  1. From the dashboard, navigate to the Learn page (or the Create a Project page).
  2. Find Customer Support Chatbot and click Start with demo project.
Customer Support Chatbot demo project card with Start with demo project button
This creates a new project in your account pre-loaded with everything you need:
ArtifactNamePurpose
Eval datasetcustomer-support-evalSample customer support conversations to evaluate against
Training datasetcustomer-support-trainUsed later for training a model
RubricCustomer support rubricDefines what a good customer support response looks like — tone, format, and accuracy criteria

Run an eval

1

Navigate to Evals

Open your Customer Support Chatbot project and go to the Evals tab. Click New Eval.
2

Select the rubric and dataset

The demo project’s rubric and the customer-support-eval dataset are already available in your project. Select them.
Eval setup form with rubric and dataset selected
3

Pick models to compare

Choose two or more models to evaluate. You can pick any combination from the model catalog — OpenAI, Anthropic, open-source, or any other available model. For a quick comparison, try picking a large model and a smaller one to see how they stack up.
4

Run the eval

Click Run. Each sample from the dataset is sent to each model, and an LLM judge scores every response against the rubric.
5

Compare the results

When the eval completes, the comparison view shows side-by-side scores across all models and samples. Look at overall scores to see which model wins, and drill into individual samples to understand where models differ.
Eval results comparison view showing scores across models

What you just learned

  • Rubrics define your quality bar in plain English — the LLM judge uses them to score outputs
  • Evals run your data through multiple models and score the results, giving you a data-driven comparison
  • You can re-run evals anytime — after changing the rubric, adding models, or later after training a custom model to see how it compares

Next steps

Train a custom model

Use the same demo project to train and deploy a model.

Write a rubric

Learn how to write your own rubrics for your specific use case.

Read the results

Deep dive on interpreting the comparison view.

Build a dataset

Create datasets from your own data — captured traffic or uploaded files.