Optimize an Agent End to End with Catalyst Tracing and HALO

Most agents have problems you can’t always see from the outside: silent failures that still report OK, tool calls firing twice, a consistently slow call quietly burning latency and spend. They’re hard to catch, and with most tracing platforms you’re left parsing spans yourself. Plugging your traces into an LLM doesn’t always fix it either. Traces are usually a huge amount of context such that a general model can’t reliably find the patterns across runs that are causing your agent to underperform. That’s why we built HALO (Hierarchical Agent Loop Optimizer): a novel way to inspect traces with an RLM, tuned for context-ingestion efficiency so you get maximum signal out of every run. This guide walks the full loop: trace, measure, analyze, fix, repeat. You’ll install Catalyst tracing, record real traces, explore them in the dashboard, run HALO to find what to improve, and apply the fixes it hands back. We follow a real example the whole way through: our own internal GTM agent with significant day-to-day traffic.

Install tracing once, then a repeating loop: capture traces, run HALO, apply fixes, and back to capturing traces

Prefer to watch? The video below walks through the same loop end to end. It covers the same ground as this guide in video form, so use whichever you like (or both).

Before you start

You need:

A free Inference account.
An app or agent that makes LLM calls. Any provider or framework works (OpenAI, Anthropic, Gemini, LangChain, LangGraph, the Vercel AI SDK, OpenAI Agents, and more).
An API key from the dashboard.

Step 1: Install tracing

Tracing captures the full execution of your agent: every LLM call, tool call, framework step, and any custom spans you wrap. HALO reads those traces, so the richer your traces, the better the analysis. The fastest path is to let the Inference CLI drive a coding agent that wires the SDK in for you.

Install the CLI and sign in

npm install -g @inference/cli && inf auth login

Your browser opens to authenticate.

Run instrumentation in your project

From your project root:

cd /path/to/your/project && inf instrument --mode tracing

The CLI scans your codebase for LLM clients and agent frameworks, installs the Catalyst tracing SDK, wires setup() into your entrypoint, adds stable service and agent identity, and shows you every change before applying it.

Prefer to wire it in yourself? Install the Catalyst tracing SDK directly — @inference/tracing for TypeScript or inference-catalyst-tracing for Python:

bun add @inference/tracing

npm install @inference/tracing

pnpm add @inference/tracing

yarn add @inference/tracing

pip install inference-catalyst-tracing

The Tracing Quickstart guide has the full manual setup for TypeScript and Python, plus how to wrap an agent span so your runs group cleanly in the dashboard.

Step 2: Record traces

Run your app the way you normally would. Traces stream to Catalyst as your code executes. The goal at this stage is volume and variety: the more real runs you capture, the more signal HALO has to work with. Exercise the paths you actually care about, including the ones that go wrong. Errors, retries, and slow paths are exactly what HALO is looking for.

You can still experiment with HALO using development data but the full value comes from being able to analyze traces of production level traffic.

Confirm traces are arriving before moving on. You can check from the CLI:

inf trace list --range 1h

Or just open the Traces tab in the dashboard and you should see a list of traces in the table if everything was set up correctly.

The Traces tab in the Inference dashboard showing a live list of captured traces with columns for time, root span, kind, service, environment, and agent

If you see rows here, tracing is live and you’re ready to start digging in.

Step 3: Explore a trace

Click any trace and a detail sheet slides open. This is where the real picture of a run lives, and it has a few tabs worth knowing.

The trace tree

The first tab shows the top-level trace and every span nested underneath it. Click into any span to expand it. For each one you get:

Inputs and outputs, the exact payload that went in and came back out.
Span attributes, including cost and input/output token counts per span, so you can see precisely which step spent what.
Raw JSON, the unmodified span data, which you can also download for offline use or to feed into your own tooling.

This is the view you reach for when you want to know exactly what happened on a single run, step by step.

The trace detail sheet open on the trace/spans tab, with the span tree on one side and an expanded span showing its inputs, outputs, span attributes including cost and token counts, and the raw JSON and download controls

The timeline

Switch to the timeline tab and the same run is plotted along a timeline. You see the total length of the run, then the duration of each individual span (each agent step, tool call, MCP call, and so on) stacked underneath. This is the fastest way to spot what’s actually slow: a single tool call eating ten seconds jumps right out, where it would be easy to miss in the tree.

The thread

The thread tab is the human-readable view of the whole run. It renders the entire conversation top to bottom, from the first message to the last, the way you’d read a chat. For a chat-style agent that’s the back-and-forth with the user. For a job that kicks off subagents, it’s the data those subagents return and what the main agent did with it. When you just want to read what happened without parsing spans, this is the tab.

Step 4: Filter and search

A list of traces is only useful if you can find the one you want. The Traces tab gives you a deep filter panel down the left side: filter by status, model, service, provider, user, session, numeric ranges like token count and cost, and any custom span attribute you’ve attached. Search is where it gets powerful. This is not metadata search on names or IDs. It’s a deep search across the full content inside every span: the messages, the tool inputs, the tool outputs, the errors, and any custom attributes you’ve attached. Type a single word that shows up deep inside a chat conversation, a tool’s JSON payload, or an error string, and it surfaces every trace where that word appears, combined with whatever filters you have selected. Finding the one run where a user said “refund” and the model errored is a single query, not an afternoon of scrolling.

The Traces tab with a deep search for a keyword and status and model filters applied, showing matching traces with the search term highlighted inside the chat content

Step 5: Group your runs into agents

Everything so far works for plain LLM traces, and that alone is useful. But where this really pays off is with full agent loops. Any run you wrap with a stable agent identity (an agentId, an agentName, and a few optional fields) gets promoted out of the raw trace list and rolled up under the Agents tab. Instead of scattered individual calls, you get one workspace per agent: its metrics, its sessions, its traces, and its HALO analysis, all grouped together. If you instrumented with the CLI or an AI coding agent, it likely set an agentId for you. It’s worth a quick look to make sure that ID is the stable, readable value you actually want, because that’s the key everything groups on. Depending on your framework you may need to tweak where the wrapper goes. Here’s what setting it looks like with the OpenAI Agents SDK. The same shape applies to any framework:

import { agentSpan, setup } from "@inference/tracing";
import * as agents from "@openai/agents";
import { Agent, run } from "@openai/agents";
import OpenAI from "openai";

const tracing = await setup({
  modules: { openai: OpenAI, openaiAgents: agents },
});

const supportAgent = new Agent({
  name: "SupportAgent",
  instructions: "Help customers with order questions.",
  model: "gpt-4o-mini",
});

await agentSpan(
  {
    agentId: "support-agent",      // Stable ID everything groups on. Keep it constant across deploys and renames.
    agentName: "Support Agent",    // Human-readable label shown in the Agents dashboard.
    spanName: "support-agent.run", // Name of the top-level span for this run.
    sessionId: conversationId,     // Ties multi-turn runs into one conversation. Think chatId or like a Slack threadId.
    userId: "user_8675309",        // Optional. Lets you filter every trace down to a single user.
    role: "support",               // Optional. Useful when one workflow has several agents (triage, refunds, billing).
    system: "openai",              // The framework or provider powering the run.
  },
  async (span) => {
    const input = "Where is order ABC-123?";
    span.setInput(input);
    const result = await run(supportAgent, input);
    span.setOutput(String(result.finalOutput ?? ""));
  },
);

await tracing.shutdown();

from agents import Agent, Runner
from inference_catalyst_tracing import agent_span, setup

tracing = setup()

support_agent = Agent(
    name="SupportAgent",
    instructions="Help customers with order questions.",
    model="gpt-4o-mini",
)

with agent_span(
    tracing.tracer,
    agent_id="support-agent",        # Stable ID everything groups on. Keep it constant across deploys and renames.
    agent_name="Support Agent",      # Human-readable label shown in the Agents dashboard.
    span_name="support-agent.run",   # Name of the top-level span for this run.
    session_id=conversation_id,      # Ties multi-turn runs into one conversation. Think chatId or like a Slack threadId.
    user_id="user_8675309",          # Optional. Lets you filter every trace down to a single user.
    role="support",                  # Optional. Useful when one workflow has several agents (triage, refunds, billing).
    system="openai",                 # The framework or provider powering the run.
) as span:
    user_message = "Where is order ABC-123?"
    span.set_input(user_message)
    result = await Runner.run(support_agent, input=user_message)
    span.set_output(str(result.final_output or ""))

tracing.shutdown()

Agents group on agentId. Set it once on the top-level span and every run for that agent rolls up together. If your runs aren’t grouping the way you expect, the ID is almost always the thing to check. See Agent identity.

Step 6: Explore your agent

Open the Agents tab and you get a list of every agent you’ve traced, grouped by agentId. Each one is a card with its name, some high-level info, and a small graph of its usage, so you can see all your agents and how busy they’ve been at a glance.

The Agents tab showing the list of traced agents grouped by agentId, where each card has the agent name, high-level info, and a usage graph

Click into an agent and you land in its workspace. This is where most teams spend their time. It has a handful of sub-tabs.

Overview

The overview is your agent’s home page. It rolls up run counts, error rate, latency, token usage, and cost, all scoped to this one agent, and charts them over time so you can spot a spike in errors or watch cost creep up after a change. It’s the first place to look when you want to know how the agent has been behaving lately.

Want to track more than the built-in metrics? Signals turn your traces into structured labels you define in plain language — “is this NSFW?”, “did the user get frustrated?”, “was the task completed?” An LLM judge labels matching spans automatically, so the result charts right here alongside your other metrics and becomes something you can filter and break down by. See Measure Your Agent’s Quality with Signals for the full walkthrough.

Sessions

A session is the whole back-and-forth rolled up into a single view. Think of it as a conversation. You set the sessionId to something like a chat ID or a Slack thread ID, and every trace tagged with that ID becomes part of the same session. The sessionId should be unique and persistent for a single instance of a conversation that has a clear start and end. The Sessions sub-tab lists one row per session with high-level metrics, and you can drill into any session to see each trace and span inside it, with the exact same trace detail (tree, timeline, thread) you saw back in Step 3.

Traces

The Traces sub-tab is the same trace view from Steps 3 and 4, just pre-filtered to this agent. All the same deep search and filtering applies, scoped to the runs that belong here.

Analysis

The Analysis sub-tab is the HALO workspace, and it’s where the next step happens.

Step 7: Run HALO

Most observability tools stop at charts and leave the digging to you. HALO does the digging, with an architecture built specifically for this problem. HALO is an open-source agent-loop optimizer hosted right inside the Agents dashboard, and the key detail is that it’s an RLM unlike a regular general purpose model. An RLM (Recursive Language Model) swaps the usual llm.completion(prompt) for an rlm.completion(prompt). Instead of cramming every trace into one context window, it holds your traces as a variable in a code environment and lets the model programmatically examine, decompose, and recursively call sub-models over them. That distinction matters here. Traces are enormous, and a general-purpose model run over them either blows past its context window or overfits to the error in a single trace instead of generalizing to the systemic, harness-level problem behind it. Because an RLM’s context is effectively unbounded, HALO can surface trends and recurring patterns across thousands of runs over time while still catching the granular one-off failure buried deep in a single span. As far as we know, no other trace-analysis tool is built this way. From there the flow is straightforward: HALO decomposes your traces, identifies the systemic failure modes, and writes up concrete fixes with citations back to the exact traces each finding came from. This guide uses the hosted version, which runs the same engine against the traces you’ve already collected with no extra setup. You can also self-host it and point it at an exported trace file. Open the Analysis sub-tab. Your previous HALO runs and chats live in a list down the left. On the right is a prompt window you can type anything into. We give you a sensible default prompt that works reasonably well out of the box, but you’ll get sharper results the more specific you are. You can also set the time range HALO should analyze, plus advanced options like span limit, max depth, and max turns.

The Analysis sub-tab, with the list of previous HALO runs and chats on the left and the prompt window on the right, showing the time-range picker and the advanced options for span limit, max depth, and max turns

A few tips for the prompt:

The default is a good general starting point. Tighten it to your actual question for a sharper answer: “Why is the enrichment agent timing out?”, “Which tool calls return empty results most often?”, “Find redundant LLM calls in the planning loop.”
Tighter time windows give HALO more focused signal. A single problem agent over the last 24 hours beats a firehose of everything from the last month.

Put HALO on a schedule

For agents already in production, you don’t want to remember to run HALO by hand. Open the schedule create sheet and set it on a repeating schedule (hourly, daily, weekly) so it reviews recent traces automatically and you read reports as they land. You can set up as many schedules as you want, each with its own prompt, so different schedules analyze your traces from different perspectives. One could watch for cost regressions daily while another hunts for reliability issues weekly.

The HALO schedule create sheet, showing the schedule cadence, the prompt field, and the time range, with the ability to create multiple schedules each with its own prompt

Step 8: Read the report and keep chatting

HALO works through the traces in the window and produces a report. It’s ranked by impact and it goes deep, so the report is long enough that it’ll scroll well past the fold. On our own GTM agent, this exact run surfaced things we’d have spent days finding by hand, and in some cases never found at all, because the spans looked healthy:

Tools being called with invalid inputs.
Queries that were failing without bubbling up, so the span still showed an ok status while the work underneath had actually broken.
Duplicate tool calls that repeated the same work for no benefit.
A ranked list of recommended fixes for each finding.

Every finding cites the specific trace IDs it came from, so you can click straight from a finding into the trace tree that produced it and confirm it matches what you’re seeing before you act. We shipped fixes for the top findings the same afternoon: the silent query failures now surface as real errors instead of hiding behind an ok status, and the redundant tool calls are gone. That’s the whole point of the loop. It turns invisible problems into a short, ranked to-do list.

A completed HALO report showing ranked findings like invalid tool inputs, silent query failures on ok-status spans, and duplicate tool calls, along with a list of recommended fixes

Two things make the report genuinely actionable:

Hand it to a coding agent. Copy a finding and its recommended fix straight into your coding agent and let it make the edit. The fixes are usually specific enough to apply directly: tightening a prompt, adding a guardrail, removing a redundant call, adjusting a tool description.
Keep chatting with HALO. The report isn’t the end of the conversation. Once it lands you can ask follow-ups in the same thread: “give me more concise ways to fix these,” “what are the highest priorities in your opinion?”, or “which tool calls are the most expensive, and what alternatives would get the same result?” If you’re not getting the results you want, adjust your prompt and run it again.

Step 9: Apply the fixes

A finding is only useful once it’s in your code, and this is the step where the whole loop pays off. There are three ways to get a fix from a HALO report into your codebase, from most automated to most manual. Pick whichever fits how you work.

Let your coding agent apply them (the MCP)

Connect the Inference MCP server to your coding agent (Claude Code, Cursor, or any MCP client) and just ask it to pull the report and apply the fixes:

Apply the the latest fixes from the HALO report for the Gator Flue Agent

A coding agent connected to the Inference MCP server pulling the latest HALO report for the gator flow agent and editing the prompts, tool definitions, and harness code directly in the repo

Just ask in natural language and your assistant resolves the agent name to its identity, finds the HALO reports, and edits your prompts, tool definitions, and harness logic directly, grounded in the same findings and trace citations you read in the dashboard. That’s the moment the loop closes itself. The same traces you’ve been sending all along get analyzed, turned into a ranked to-do list, and applied to your harness from a single line of natural language to make changes. See the MCP server guide for setup and more example prompts.

From the CLI

Prefer the terminal? The inf halo commands pull the same report from the command line. Read it with inf halo conversation get <conversation-id> and pipe it into whatever coding agent you use, or read it inline.

Copy it yourself

The manual path always works, and we mentioned it back in Step 8: copy the recommendation (and any suggested change) straight from the report into your coding agent, or make the edit by hand.

Copy the recommendation from the report, or paste it into your coding agent.
Make the edit in your prompt, tool definition, or harness logic.
Save and move on to the next finding.

Step 10: Close the loop

A fix isn’t done until you’ve confirmed it worked.

Ship the change and let your app run.
Capture a fresh window of traces.
Run HALO again over the new window (or wait for the next scheduled run).
Confirm the finding is gone, and pick up the next one.

That’s the HALO loop. Each pass tightens your agent: fewer errors, lower latency, less wasted spend. The teams that get the most out of HALO run this continuously rather than once.

What’s coming

The loop above works today. And because all of it runs off the traces you’re already sending, the features we’re building next light up on that same data with no new instrumentation. Install tracing today and these turn on automatically as they ship, against the history you’ve already collected:

Connect your GitHub repo. Link the repo behind your agent so HALO can read your actual code as additional context, then ground its suggestions in your real prompts, tool definitions, and harness logic instead of inferring them from traces alone.

All of this builds on the traces you’re already collecting. The earlier you instrument, the more history HALO has to work with the day these land.

Where to go next

That’s the whole loop. If you haven’t instrumented yet, create a free account and run inf instrument --mode tracing in your project. Your first useful HALO report is about 20 minutes away.

Your traces are yours. We don’t train on them, and we never use your data to improve our own models. What we do give you is the option to put that data to work for you: when you’re ready, you can train your own custom model on the runs you’ve already captured, entirely within our platform and under your control. A small model fine-tuned on your agent’s real traffic is often faster and cheaper than a general-purpose one at the same task, and instrumenting today means that dataset is sitting ready whenever you want it. That’s the second half of the loop: optimize the harness with HALO, and optimize the model with your own traces.

Try it on a demo repo first

Run this whole loop on a ready-made, pre-instrumented agent before pointing it at your own.

Run inference through the gateway

One API for every model, with usage, latency, and cost traced automatically.

Train a custom model

Turn the traces you’ve captured into a fine-tuned model that’s faster and cheaper at your task.

HALO on GitHub

The open-source HALO engine, methodology, and benchmarks. MIT licensed.

Tracing integrations

Every framework and provider Catalyst tracing supports, with setup for each.

​Before you start

​Step 1: Install tracing

​Step 2: Record traces

​Step 3: Explore a trace

​The trace tree

​The timeline

​The thread

​Step 4: Filter and search

​Step 5: Group your runs into agents

​Step 6: Explore your agent

​Overview

​Sessions

​Traces

​Analysis

​Step 7: Run HALO

​Put HALO on a schedule

​Step 8: Read the report and keep chatting

​Step 9: Apply the fixes

​Let your coding agent apply them (the MCP)

​From the CLI

​Copy it yourself

​Step 10: Close the loop

​What’s coming

​Where to go next

Try it on a demo repo first

Run inference through the gateway

Train a custom model

HALO on GitHub

Tracing integrations

Before you start

Step 1: Install tracing

Step 2: Record traces

Step 3: Explore a trace

The trace tree

The timeline

The thread

Step 4: Filter and search

Step 5: Group your runs into agents

Step 6: Explore your agent

Overview

Sessions

Traces

Analysis

Step 7: Run HALO

Put HALO on a schedule

Step 8: Read the report and keep chatting

Step 9: Apply the fixes

Let your coding agent apply them (the MCP)

From the CLI

Copy it yourself

Step 10: Close the loop

What’s coming

Where to go next