The ZyG Blog

The ZyG Blog

The ZyG Blog

Fine-Tune Your Agent, Not the LLM

Fine-Tune Your Agent, Not the LLM

Fine-Tune Your Agent, Not the LLM

Shahar Katz, Growth Squad

We fine-tuned two models on the same task with the same base LLM. One was 16% more confident and significantly more consistent. The only difference: whether the agent generated the training data, or a bare LLM did. Even the same LLM is off-policy if it's not the same agent.

At ZyG, we run a media buying agent that manages advertising budgets autonomously. It calls APIs, fetches real campaign metrics, runs analysis sub-agents, and autonomously performs actions on media platforms. ZyG has in-house media buyer experts who review every decision and provide ongoing feedback - approve, decline, or correct.

That feedback is gold for fine-tuning. But turning expert corrections into training data is harder than it sounds, because a tool-using agent is not a general-purpose LLM. It's an LLM operating within a narrow distribution defined by a specific system prompt, a fixed set of tools, and the data those tools return. The base model was never trained on this distribution. Fine-tuning must teach the model to operate as this agent, not to become a better general-purpose model.

The problem with the obvious approach

The standard approach: take the expert's feedback, feed it to a corrector LLM, and have it generate a "corrected" JSON decision. Then fine-tune on that. This is sequence-level knowledge distillation - the same pattern behind Alpaca and Orca. We call it Off-Policy Corrector Distillation (OPCD).

The problem: the corrector is not the agent. Even if it's the same underlying LLM (Gemini 2.5 Pro in our case), it runs without the agent's 2,000-line system prompt, without calling the Meta Ads API, and without seeing the agent's data. Its output is on-policy for the LLM; it's off-policy for the agent. We identified three specific ways this breaks down:

Vocabulary mismatch: The agent decided to increase the budget because of overall good performance in the past week. The expert says "scale down because of yesterday's performance." The agent would produce a structured JSON with metric citations that aligns with the human feedback, but organically and with matched reasoning: "performance was good at $110 but yesterday's CPA was $240, which is overwhelmingly above target, so we should scale down." A corrector LLM can't bridge that gap without access to the actual data; it would produce a vague result like "yesterday's performance was bad so we scaled down the budget," not something you want to train on.

Information leakage: The expert may know the outcome of a decision (hindsight) or draw on external knowledge the agent can't access at inference time.

Wrong optimization target: The gap between the expert's ideal and the agent's output has two parts: what the agent can't reach (architecture gap) and what it didn't reach (performance gap). Only the second is actionable. Off-policy correction conflates both.

Oracle injection: let the agent write its own training data

Our method - On-Policy Oracle Injection (OPOI) - is simple: inject the expert's verdict into the agent's prompt as a small steering signal, run the actual agent end-to-end with its real tools, capture the full trace, then clean out the oracle references. The agent produces its own ground truth. The oracle steers the conclusion; the vocabulary, tool calls, and reasoning are the agent's own.

This isn't perfectly on-policy as the oracle is still an external signal. But it's a minimal intervention: a few lines added to the prompt, cleaned afterwards. The claim: it's less contaminative to inject a small oracle and clean it than to replace the agent's entire output with one from a different source.

What the numbers show

We fine-tuned two models from the same base (Gemini 2.5 Pro): one on OPOI data (agent-generated) and one on OPCD data (corrector-generated). Then we evaluated both on identical held-out prompts using likelihood-ratio OOD detection and action accuracy with consensus scoring.

The OOD score is the log-likelihood ratio between the base model and the fine-tuned model:

log S=log pbase(x)−log pfine-tuned(x)

When log S is negative, the fine-tuned model is more confident than the base - the input is in-distribution for the fine-tuned model. More negative means more deeply internalised. If fine-tuning on agent-generated data (OPOI) produces more negative scores than fine-tuning on corrector-generated data (OPCD), it means the agent's distribution was learned more strongly.

Metric

Base

LLM (OPCD)

LLM (OPOI)

OPOI vs OPCD

Action
accuracy

69.7%

71.9%

72.7%

Similar

Consensus

0.919

0.854

0.894

OPOI wins

OOD avgLogprob

-0.56

-0.352

-0.294

16% more confident

OOD ratio vs base

-0.202

-0.277

37% stronger shift

Consensus = Fraction of 3 runs that agree on the same action for a given input. OOD avgLogprob = mean per-token log-probability (higher = more confident). OOD ratio = log-likelihood ratio between base and fine-tuned model (more negative = more in-distribution).

Both models learn the right answers (+3% over the base). But the OPOI model is 16% more confident, has a 37% stronger distribution shift from the base, and shows less consistency degradation. The OPCD model flips between actions across runs. It learned what to answer but not why, because those answers came from a foreign distribution.

What this means for agent builders

If you're fine-tuning a tool-using agent from expert feedback, the provenance of your training data matters as much as its correctness. Having a corrector LLM generate "the right answer" teaches the model a distribution it will never operate in. Having the agent generate the right answer itself - steered by an oracle, cleaned afterwards - teaches the distribution it actually needs.

The on-policy distillation literature identifies agent-level fine-tuning as an open problem. OPOI is one practical answer: it extends privileged-information self-distillation to settings where the agent can't find the answer on its own and no automated reward function exists. Both methods teach the right answers; only on-policy training teaches the agent's distribution.

Full paper: Shahar Katz, On-Policy Oracle Injection for Fine-Tuning Tool-Using Agents.

We fine-tuned two models on the same task with the same base LLM. One was 16% more confident and significantly more consistent. The only difference: whether the agent generated the training data, or a bare LLM did. Even the same LLM is off-policy if it's not the same agent.

At ZyG, we run a media buying agent that manages advertising budgets autonomously. It calls APIs, fetches real campaign metrics, runs analysis sub-agents, and autonomously performs actions on media platforms. ZyG has in-house media buyer experts who review every decision and provide ongoing feedback - approve, decline, or correct.

That feedback is gold for fine-tuning. But turning expert corrections into training data is harder than it sounds, because a tool-using agent is not a general-purpose LLM. It's an LLM operating within a narrow distribution defined by a specific system prompt, a fixed set of tools, and the data those tools return. The base model was never trained on this distribution. Fine-tuning must teach the model to operate as this agent, not to become a better general-purpose model.

The problem with the obvious approach

The standard approach: take the expert's feedback, feed it to a corrector LLM, and have it generate a "corrected" JSON decision. Then fine-tune on that. This is sequence-level knowledge distillation - the same pattern behind Alpaca and Orca. We call it Off-Policy Corrector Distillation (OPCD).

The problem: the corrector is not the agent. Even if it's the same underlying LLM (Gemini 2.5 Pro in our case), it runs without the agent's 2,000-line system prompt, without calling the Meta Ads API, and without seeing the agent's data. Its output is on-policy for the LLM; it's off-policy for the agent. We identified three specific ways this breaks down:

Vocabulary mismatch: The agent decided to increase the budget because of overall good performance in the past week. The expert says "scale down because of yesterday's performance." The agent would produce a structured JSON with metric citations that aligns with the human feedback, but organically and with matched reasoning: "performance was good at $110 but yesterday's CPA was $240, which is overwhelmingly above target, so we should scale down." A corrector LLM can't bridge that gap without access to the actual data; it would produce a vague result like "yesterday's performance was bad so we scaled down the budget," not something you want to train on.

Information leakage: The expert may know the outcome of a decision (hindsight) or draw on external knowledge the agent can't access at inference time.

Wrong optimization target: The gap between the expert's ideal and the agent's output has two parts: what the agent can't reach (architecture gap) and what it didn't reach (performance gap). Only the second is actionable. Off-policy correction conflates both.

Oracle injection: let the agent write its own training data

Our method - On-Policy Oracle Injection (OPOI) - is simple: inject the expert's verdict into the agent's prompt as a small steering signal, run the actual agent end-to-end with its real tools, capture the full trace, then clean out the oracle references. The agent produces its own ground truth. The oracle steers the conclusion; the vocabulary, tool calls, and reasoning are the agent's own.

This isn't perfectly on-policy as the oracle is still an external signal. But it's a minimal intervention: a few lines added to the prompt, cleaned afterwards. The claim: it's less contaminative to inject a small oracle and clean it than to replace the agent's entire output with one from a different source.

What the numbers show

We fine-tuned two models from the same base (Gemini 2.5 Pro): one on OPOI data (agent-generated) and one on OPCD data (corrector-generated). Then we evaluated both on identical held-out prompts using likelihood-ratio OOD detection and action accuracy with consensus scoring.

The OOD score is the log-likelihood ratio between the base model and the fine-tuned model:

log S=log pbase(x)−log pfine-tuned(x)

When log S is negative, the fine-tuned model is more confident than the base - the input is in-distribution for the fine-tuned model. More negative means more deeply internalised. If fine-tuning on agent-generated data (OPOI) produces more negative scores than fine-tuning on corrector-generated data (OPCD), it means the agent's distribution was learned more strongly.

Metric

Base

LLM (OPCD)

LLM (OPOI)

OPOI vs OPCD

Action
accuracy

69.7%

71.9%

72.7%

Similar

Consensus

0.919

0.854

0.894

OPOI wins

OOD avgLogprob

-0.56

-0.352

-0.294

16% more confident

OOD ratio vs base

-0.202

-0.277

37% stronger shift

Consensus = Fraction of 3 runs that agree on the same action for a given input. OOD avgLogprob = mean per-token log-probability (higher = more confident). OOD ratio = log-likelihood ratio between base and fine-tuned model (more negative = more in-distribution).

Both models learn the right answers (+3% over the base). But the OPOI model is 16% more confident, has a 37% stronger distribution shift from the base, and shows less consistency degradation. The OPCD model flips between actions across runs. It learned what to answer but not why, because those answers came from a foreign distribution.

What this means for agent builders

If you're fine-tuning a tool-using agent from expert feedback, the provenance of your training data matters as much as its correctness. Having a corrector LLM generate "the right answer" teaches the model a distribution it will never operate in. Having the agent generate the right answer itself - steered by an oracle, cleaned afterwards - teaches the distribution it actually needs.

The on-policy distillation literature identifies agent-level fine-tuning as an open problem. OPOI is one practical answer: it extends privileged-information self-distillation to settings where the agent can't find the answer on its own and no automated reward function exists. Both methods teach the right answers; only on-policy training teaches the agent's distribution.

Full paper: Shahar Katz, On-Policy Oracle Injection for Fine-Tuning Tool-Using Agents.

Are you a product innovator, entrepreneur or DTC brand seeking scale?

Are you a product innovator, entrepreneur or DTC brand seeking scale?