Tracing MCP Server LLM Calls Fully

Your AI agent stalls on a ‘summarize’ tool. Two seconds vanish. Was it the code? The API? Or — hidden — the LLM the server delegated to?

For devs wiring up agentic workflows, that’s the daily grind with MCP servers. No visibility into sampling calls meant guessing games on performance. Now? Traces light up those ghosts, metrics dashboard the truth, and optimization stops being witchcraft.

Why MCP Sampling Broke Your Traces

MCP specs let servers — tool-less on API keys — bounce LLM work back to clients. Smart, right? Your orchestrator hits ‘summarize’; server needs GPT-4o to chew the text; it samples the client LLM. Response flows back. Clean delegation.

But traces? Crickets. Middleware snags tool calls fine. Sampling? A method invocation deep in handler guts. No span. A 2.1-second tool where 1.8 seconds burned on generation — invisible. You’d tweak the wrong 300ms.

“The tool call triggers an LLM call which is invisible in the trace. The middleware from article #7 traces tools/call summarize — but the sampling call inside it? Ghost. No span, no duration, no model name.”

That’s the black box they cracked last time. This? The sequel that makes it demo-ready.

Look, we’ve seen this movie. Early microservices choked on untraced RPCs — remember Zipkin’s rise? MCP sampling is agentic AI’s RPC. Ignore it, and your tools become distributed mysteries. Fix it now, and you’re ahead of the multi-agent swarm coming.

How They Wrapped the Ghost Calls

Four tweaks. Dead simple.

First, toadEyeMiddleware for the basics — spans on every tool.

Then, traceSampling wrapper around ctx.mcpReq.requestSampling(). Pass model, tokens. Boom: SpanKind.CLIENT span named “chat gpt-4o”. Captures duration (1834ms!), gen_ai.request.model, even mcp.server.name.

Nested perfection:

tools/call summarize 2.1s └── chat gpt-4o (sampling) 1.8s

Actual logic? 300ms. Optimize that, not phantoms.

Code’s a one-liner import. Handler body: wrap your sampling. Five minutes to try — server up, client agent pinging, traces pouring into your OTel backend.

But metrics? That’s where it gets Bloomberg-sharp. Prometheus queries baked in. No vague counters.

The Dashboard That Answers ‘What’s Breaking?’

Glanceable table up top:

Red on errors? Drill down.

Timeseries for call rates by tool. Agent shifting from calculate to search? Lines show it. P50/P95 durations per tool — search P95 spiking to 2s? Pager duty incoming.

Errors stacked: RateLimitError on search (8.7%), Validation on calculate (0%). Resources by URI — hot data sources screaming.

Bottom: merged table per tool.

Tool	Rate	Avg (ms)	P95 (ms)	Errors
calculate	8.2	12.3	24.1	0%
get-weather	3.1	145.2	312.8	3.2%
search	1.1	890.4	2134	8.7%

This isn’t fluff. It’s the boardroom view: costs correlate to durations, errors to revenue leaks. Scale agents? You’ll thank these four stats.

Is This the Zipkin Moment for AI Agents?

Here’s my edge: no one says it, but MCP observability lags LangChain’s by miles — those ecosystems trace everything, hype or not. MCP’s purer spec shines, yet tooling hid gems like sampling.

Prediction? As agents chain servers (multi-MCP incoming), this tracing becomes table stakes. Ignore it, your prod agents black out under load. My bet: toad-eye forks everywhere by Q2, standards body nods, OpenAI clients mandate it.

Skeptical? It’s not corporate spin — open-source, reproducible. But watch: vendors will rebrand this as ‘enterprise agent mesh’ at $10k/mo. Grab the free version first.

Why Does Full MCP Tracing Matter for Your Stack?

Devs on Claude or GPT agents — MCP’s your bridge to custom tools. Without this, sampling’s a liability: costs spike unseen, latencies compound in chains.

Market math: GPT-4o at $5/1M input tokens. A hammered summarize tool sampling 500 tokens per call? At 12 req/s, that’s $3.6k/month blind. Traces expose it; dashboard throttles.

Real people? Indie devs shipping agents hit scale pains first. No more ‘it works on my machine’ — traces prove it.

Teams? SREs sleep better. One P95 alert beats 50 user tickets.

And clients? Faster agents, lower bills. Win.

But — caveat — it’s OTel-first. Grafana ready? Good. Raw Prometheus? Grindier setup. Still, baseline beats zero.

🧬 Related Insights

Read more: Microsoft’s Top Minds: Agentic AI Is Gutting Junior Developer Ranks
Read more: Distributed Locks: The GC Pause That Tripled a Customer’s Bill

Frequently Asked Questions

What is MCP sampling?

MCP servers use it to delegate LLM calls back to clients, since servers lack API keys. Essential for tools needing generation, like summarizers.

How do you trace MCP server LLM calls?

Wrap ctx.mcpReq.requestSampling() in traceSampling({model: ‘gpt-4o’}). Middleware handles the rest — spans nest under tools.

Does this dashboard work out-of-box?

Yes, Prometheus queries provided. Plug into Grafana for tables, charts, alerts on rates, durations, errors.

Tracing MCP Server LLM Calls Fully

Key Takeaways

Why MCP Sampling Broke Your Traces

How They Wrapped the Ghost Calls

The Dashboard That Answers ‘What’s Breaking?’

Is This the Zipkin Moment for AI Agents?

Why Does Full MCP Tracing Matter for Your Stack?

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

Why MCP Sampling Broke Your Traces

How They Wrapped the Ghost Calls

The Dashboard That Answers ‘What’s Breaking?’

Is This the Zipkin Moment for AI Agents?

Why Does Full MCP Tracing Matter for Your Stack?

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

AI's Silent Failures: Why Observability Has to Be Baked In, Not Bolted On

Stay in the loop

Key Takeaways