AI Tools

Building Evals for Deep Agents

Imagine AI agents that don't just pass tests—they master real-world chaos. Deep Agents shows us how evals aren't checkboxes; they're the chisel carving tomorrow's intelligence.

AI agent traces visualized as evolving neural pathways under eval pressure

Key Takeaways

  • Targeted evals trump eval volume—focus on production behaviors for true agent smarts.
  • Dogfooding traces and artisanal tests create strong, self-documenting eval suites.
  • This mirrors software testing evolution, predicting reliable agents as the new platform.

What if your AI agent’s brilliance isn’t magic—it’s a relentless grind of targeted tests, each one nudging it closer to human-like smarts?

Deep Agents evals. They’re the unsung heroes powering this open-source, model-agnostic harness behind beasts like Fleet and Open SWE. Picture evals as invisible vectors—each success or failure twisting the agent’s behavior, like wind sculpting dunes over eons. Miss an efficient file read? Tweak that prompt, refine the tool description, and watch the shift. It’s evolution, accelerated.

But here’s the trap.

Blindly piling on hundreds—thousands—of evals? That’s the illusion of progress. You game the suite, score high, then watch your agent flop in production.

More evals ≠ better agents. Instead, build targeted evals that reflect desired behaviors in production.

Deep Agents nails this. They catalog production must-haves: multi-file retrieval, chaining 5+ tool calls smoothly. No aggregate benchmarks. Just laser-focused curation.

How Do Deep Agents Curate Evals That Stick?

Dogfooding first. Teams live with these agents, traces spilling failures like confetti. Open SWE, their background coding warrior, hits diverse codebases—conventions clashing, contexts shifting. Every glitch? Traced to LangSmith, dissected by Polly or Insights agents. Boom: new eval born, regression-proofing the fix.

External pulls next. Snag gems from Terminal Bench 2.0 or BFCL, tweak for the agent. Harbor sandboxes coding evals safely.

Artisanal ones? Hand-crafted unit tests for read_file precision or state-tracking wizardry. Diversity rules: single-step zaps to 10-turn marathons.

They tag ruthlessly—file_operations, retrieval, tool_use. A taxonomy for the win.

Category What It Tests
file_operations File tools (read/write/edit/ls/grep/glob), parallel calls, pagination
retrieval Multi-file info hunts, search smarts, multi-hop synthesis
tool_use Tool picks, chaining, state across turns
memory Context recall, preference extraction, durable persistence
conversation Clarifying vagueness, multi-turn action fidelity
summarization Overflow handling, compaction recovery
unit_tests Prompt passthrough, interrupts, subagent routing

Groups like these? No single-score nonsense. Real middle-ground insights.

And traces? Shared LangSmith project turns the team into eval guardians. Spot failure modes, fix, rerun—cheaply, since targeted means no eval bloat.

My hot take: this mirrors software’s unit-testing revolution in the ’90s. Back then, devs ditched big-bang integration for atomic tests—birth of agile empires. Deep Agents evals? Same leap for agents. Bold prediction: in two years, they’ll make agents as debuggable as CRUD apps, birthing an agent economy rivaling apps today.

Why Dogfooding Turns Traces into Gold?

Traces aren’t logs—they’re behavior X-rays.

Scale hits? Polly dives in, spotting patterns humans miss. Claude Code or Deep CLI? Same play, LangSmith CLI pulling traces.

Open SWE example: bug-fix PRs exploding from traced flubs. Production diversity breeds strong evals—far beyond synthetic benchmarks.

It’s shared ownership. Anyone jumps traces, proposes nudges. Cost? Slashed by ditching fluff evals.

But hype alert—companies love bragging benchmark wins. Deep Agents skips that, chasing production fidelity. Smart.

What Metrics Make Deep Agents Tick?

The original spills into model selection, but the ethos shines: verifiable, self-documenting evals with docstrings explaining the “why.” Categories enable sliced runs—file ops at 95%? Greenlight.

End-to-end only. No mock stubs. Real agent runs, user-simulated where needed.

Pressure mounts over time. Each passing eval? Cumulative nudge toward prod perfection.

Think agents as living software—evals the compiler feedback loop. We’re witnessing platform shift: agents atop LLMs, evals the OS kernel.

Energy here? Electric. Wonder at traces birthing intelligence, categories mapping agent minds—it’s poetry in code.

Review loops seal it. Output traces dissected, coverage gaps plugged. Team-wide, always.

Is This the Future of Agent Reliability?

Absolutely.

Targeted evals dodge the “eval hell” of overtesting irrelevants. Production-mirrored, they’re the North Star.

Historical parallel: like Darwin’s finches, evals adapt agents to niches—file jungles, tool chains, memory mazes.

Critique? Even they warn against eval addiction. Wise.

Deep Agents open-sources this—your move, builders.


🧬 Related Insights

Frequently Asked Questions

What are Deep Agents evals?

Targeted tests shaping agent behavior, from dogfooding traces to artisanal units, tagged for categories like tool_use.

How to build evals for AI agents?

Catalog prod behaviors, source from traces/benchmarks/hand-craft, add docstrings/tags, trace to shared hub for reviews.

Why targeted evals over massive benchmarks?

Avoids illusion of progress; focuses on verifiable prod skills, saves costs, drives real improvements via traces.

Marcus Rivera
Written by

Tech journalist covering AI business and enterprise adoption. 10 years in B2B media.

Frequently asked questions

What are Deep Agents evals?
Targeted tests shaping agent behavior, from dogfooding traces to artisanal units, tagged for categories like tool_use.
How to build evals for AI agents?
Catalog prod behaviors, source from traces/benchmarks/hand-craft, add docstrings/tags, trace to shared hub for reviews.
Why targeted evals over massive benchmarks?
Avoids illusion of progress; focuses on verifiable prod skills, saves costs, drives real improvements via traces.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by LangChain Blog

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.