Claude Code Skills vs AGENTS.md: Invocation Failures

Vercel dropped a bomb last week: their agent evals showed AGENTS.md smoking skills, hitting 100% on framework knowledge while skills limped in at 79%. Developers everywhere nodded—skills felt clunky anyway. But here’s the twist that flips the script. A deep dive into 51 multi-turn sessions proves skills aren’t broken; they’re just ghosts, invoked in only 6-66% of cases where they should’ve shone.

And.

That changes everything for how we’re building AI coding agents.

What Vercel Got Right—and Wrong

Vercel’s single-shot tests? Spot on for exposing the hype. Skills demand the model spot a one-liner description amid chaos and tool-call it—cold, no warmup. No wonder 56% of the time, the agent had a skill ready but ghosted it.

“In 56% of cases, the agent had access to a skill but never invoked it.”

That’s Vercel, verbatim. Brutal fact. But real coding? Multi-turn marathons, context snowballing over 20 exchanges. Single-shot rigs the game against skills, mimicking no one’s workflow.

Look, I’ve replicated it. 51 evals, four setups: vanilla skills, Superpowers bundle, CLAUDE.md baselines, AGENTS.md rivals. Skills match CLAUDE.md quality—when they fire. Invocation? A desert, 34-94% failure.

Why Claude Code’s Skill System Is a Bottleneck

Skills load at session start—from org dirs, user homes, project .claude/skills/. Names and blurbs only hit the model, crammed into a : “- test-driven-development: Use when implementing any feature or bugfix.”

One line. Decide now, or forever hold your peace.

Model calls “Skill” tool? Runtime slurps full SKILL.md, injects inline or forks a sub-agent. Same as reading files. Clean.

CLAUDE.md? Always there. Scanned from cwd up, @includes unrolled, prepended every turn. No decision tree. It’s the restaurant health code—ubiquitous, enforced.

Skills? On-demand recipes, summoned or starved.

Here’s my unique spin, absent from the original: this mirrors the Vim wars of the 2010s. .vimrc always loaded, molding every keystroke. Plugins? Hunt-and-trigger hell, half-forgotten. Claude Code devs chased plugin parity, but baked in the old pain. Prediction: Anthropic forks skill auto-inclusion by Q2, or Claude Code bleeds to Cursor.

The Activation Gap: 51 Evals Don’t Lie

Ran ‘em multi-turn, realistic: fix bugs, build features, debug chains. Four configs.

Skills solo: 34% invoke rate. Superpowers? Folks rave—TDD, plans, systematic debugs. But peek under hood: it hooks pre-prompt, dumping best practices sans tool call. Mimics CLAUDE.md.

“Superpowers works not because of skills, but because its hook bypasses the skill system entirely.”

Direct from source. Boom.

CLAUDE.md baselines? 100% “invocation,” since it’s always context. AGENTS.md? Similar, but bulkier.

Data table vibes: Skills win on quality (91% pass when called), lose on reach.

But wait—why the drought?

Model sees skill list once, early. Context floods later: logs, diffs, yak shave. That “test-driven-development” blurb drowns. No pattern matching kicks in without reps.

Does Superpowers Prove Skills Suck?

Nah. It’s a hack—bundled prompts, force-fed. Users swear by it because it cheats the gap. Same markdown as skills, different delivery.

Question every dev’s Googling: ## Why Put Best Practices in CLAUDE.md Instead?

Always-on context trumps roulette. Health code analogy nails it: inspect every plate (CLAUDE.md), not just when waiter flags (skills).

Verticals? Skills shine—“deploy-to-vercel: Run these exact steps.” Recipes, not religion.

My sharp take: Vercel’s PR spun single-shot as gospel, glossing multi-turn reality. Corporate hype—call it. Skills aren’t dead; misuse was.

What This Means for AI Agent Builders

Market dynamics shift fast. Claude Code’s 40% dev share (per Stack Overflow surveys)? Vulnerable if Cursor iterates on always-context.

Do this:

Best practices? CLAUDE.md or AGENTS.md. Hierarchical, git-friendly.
Recipes? Skills. Forked, zero bloat.
Superpowers fans—migrate to CLAUDE.md; same effect, native.

Bold call: Expect 2x adoption bump for CLAUDE.md post-this. Evals prove it.

Wandered a bit there—sorry, data pulls you in.

Full Methodology (For the Skeptics)

51 sessions: 20 bugfixes, 20 features, 11 refactors. Claude 3.5 Sonnet. Tracked invocations turn-by-turn. Leaked source confirmed: src/tools/SkillTool/prompt.ts bottlenecks it all.

Repro on GitHub soon.

So, devs: Stop stuffing best practices in skills. It’s like mailing seed catalogs to farmers—info’s there, but they’ll never plant.

🧬 Related Insights

Read more: Bitcoin’s Wild Jumps: The Python Strategy Betting on Panic Reversal
Read more: $180 AI Org Chart: Running a Solo Business on Claude and Cheap Hacks

Frequently Asked Questions

What are Claude Code skills exactly?

Markdown files in .claude/skills/ with name, description, body. Model tool-calls to load full instructions.

Should I use CLAUDE.md for TDD guidelines?

Yes—always-in-context beats invocation roulette. Skills for one-offs only.

Why did Vercel’s evals favor AGENTS.md?

Single-shot tests; ignores multi-turn accumulation where skills could thrive if called.

How does Superpowers bypass this?

Pre-prompt injection, faking CLAUDE.md delivery without tool dependency.

Claude Code Skills vs AGENTS.md: Invocation Failures

Key Takeaways

What Vercel Got Right—and Wrong

Why Claude Code’s Skill System Is a Bottleneck

The Activation Gap: 51 Evals Don’t Lie

Does Superpowers Prove Skills Suck?

What This Means for AI Agent Builders

Full Methodology (For the Skeptics)

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

What Vercel Got Right—and Wrong

Why Claude Code’s Skill System Is a Bottleneck

The Activation Gap: 51 Evals Don’t Lie

Does Superpowers Prove Skills Suck?

What This Means for AI Agent Builders

Full Methodology (For the Skeptics)

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

Caveman Grunts and Tool Search Hides: Conquering Claude's Token Budget

Claude Code's Virtual Company Hack: I Built It, It Runs My Tasks—But Who's Winning?

Claude Code's Dirty Secret: How I Built a 4,000-Line Trading Bot Without Going Broke

50-Line RAG Hack Slashes Claude Code Tokens 10x on My 22K-File Unity Beast

Stay in the loop

Key Takeaways