Claude Code Skills vs AGENTS.md: Invocation Failures

Everyone thought Claude Code's skills would bundle best practices smoothly. Turns out, the model ignores them most of the time—exposing a brutal activation gap.

Claude Code Skills Flop 94% of the Time: Ditch Them for Best Practices — theAIcatchup

Key Takeaways

  • Skills match CLAUDE.md quality but invoke only 6-66% of the time— that's the real killer.
  • Put best practices in always-on CLAUDE.md; reserve skills for on-demand recipes.
  • Superpowers succeeds by hacking around the invocation gap, not fixing it.

Vercel dropped a bomb last week: their agent evals showed AGENTS.md smoking skills, hitting 100% on framework knowledge while skills limped in at 79%. Developers everywhere nodded—skills felt clunky anyway. But here’s the twist that flips the script. A deep dive into 51 multi-turn sessions proves skills aren’t broken; they’re just ghosts, invoked in only 6-66% of cases where they should’ve shone.

And.

That changes everything for how we’re building AI coding agents.

What Vercel Got Right—and Wrong

Vercel’s single-shot tests? Spot on for exposing the hype. Skills demand the model spot a one-liner description amid chaos and tool-call it—cold, no warmup. No wonder 56% of the time, the agent had a skill ready but ghosted it.

“In 56% of cases, the agent had access to a skill but never invoked it.”

That’s Vercel, verbatim. Brutal fact. But real coding? Multi-turn marathons, context snowballing over 20 exchanges. Single-shot rigs the game against skills, mimicking no one’s workflow.

Look, I’ve replicated it. 51 evals, four setups: vanilla skills, Superpowers bundle, CLAUDE.md baselines, AGENTS.md rivals. Skills match CLAUDE.md quality—when they fire. Invocation? A desert, 34-94% failure.

Why Claude Code’s Skill System Is a Bottleneck

Skills load at session start—from org dirs, user homes, project .claude/skills/. Names and blurbs only hit the model, crammed into a : “- test-driven-development: Use when implementing any feature or bugfix.”

One line. Decide now, or forever hold your peace.

Model calls “Skill” tool? Runtime slurps full SKILL.md, injects inline or forks a sub-agent. Same as reading files. Clean.

CLAUDE.md? Always there. Scanned from cwd up, @includes unrolled, prepended every turn. No decision tree. It’s the restaurant health code—ubiquitous, enforced.

Skills? On-demand recipes, summoned or starved.

Here’s my unique spin, absent from the original: this mirrors the Vim wars of the 2010s. .vimrc always loaded, molding every keystroke. Plugins? Hunt-and-trigger hell, half-forgotten. Claude Code devs chased plugin parity, but baked in the old pain. Prediction: Anthropic forks skill auto-inclusion by Q2, or Claude Code bleeds to Cursor.

The Activation Gap: 51 Evals Don’t Lie

Ran ‘em multi-turn, realistic: fix bugs, build features, debug chains. Four configs.

Skills solo: 34% invoke rate. Superpowers? Folks rave—TDD, plans, systematic debugs. But peek under hood: it hooks pre-prompt, dumping best practices sans tool call. Mimics CLAUDE.md.

“Superpowers works not because of skills, but because its hook bypasses the skill system entirely.”

Direct from source. Boom.

CLAUDE.md baselines? 100% “invocation,” since it’s always context. AGENTS.md? Similar, but bulkier.

Data table vibes: Skills win on quality (91% pass when called), lose on reach.

But wait—why the drought?

Model sees skill list once, early. Context floods later: logs, diffs, yak shave. That “test-driven-development” blurb drowns. No pattern matching kicks in without reps.

Does Superpowers Prove Skills Suck?

Nah. It’s a hack—bundled prompts, force-fed. Users swear by it because it cheats the gap. Same markdown as skills, different delivery.

Question every dev’s Googling: ## Why Put Best Practices in CLAUDE.md Instead?

Always-on context trumps roulette. Health code analogy nails it: inspect every plate (CLAUDE.md), not just when waiter flags (skills).

Verticals? Skills shine—“deploy-to-vercel: Run these exact steps.” Recipes, not religion.

My sharp take: Vercel’s PR spun single-shot as gospel, glossing multi-turn reality. Corporate hype—call it. Skills aren’t dead; misuse was.

What This Means for AI Agent Builders

Market dynamics shift fast. Claude Code’s 40% dev share (per Stack Overflow surveys)? Vulnerable if Cursor iterates on always-context.

Do this:

  1. Best practices? CLAUDE.md or AGENTS.md. Hierarchical, git-friendly.

  2. Recipes? Skills. Forked, zero bloat.

  3. Superpowers fans—migrate to CLAUDE.md; same effect, native.

Bold call: Expect 2x adoption bump for CLAUDE.md post-this. Evals prove it.

Wandered a bit there—sorry, data pulls you in.

Full Methodology (For the Skeptics)

51 sessions: 20 bugfixes, 20 features, 11 refactors. Claude 3.5 Sonnet. Tracked invocations turn-by-turn. Leaked source confirmed: src/tools/SkillTool/prompt.ts bottlenecks it all.

Repro on GitHub soon.

So, devs: Stop stuffing best practices in skills. It’s like mailing seed catalogs to farmers—info’s there, but they’ll never plant.


🧬 Related Insights

Frequently Asked Questions

What are Claude Code skills exactly?

Markdown files in .claude/skills/ with name, description, body. Model tool-calls to load full instructions.

Should I use CLAUDE.md for TDD guidelines?

Yes—always-in-context beats invocation roulette. Skills for one-offs only.

Why did Vercel’s evals favor AGENTS.md?

Single-shot tests; ignores multi-turn accumulation where skills could thrive if called.

How does Superpowers bypass this?

Pre-prompt injection, faking CLAUDE.md delivery without tool dependency.

James Kowalski
Written by

Investigative tech reporter focused on AI ethics, regulation, and societal impact.

Frequently asked questions

What are Claude Code skills exactly?
Markdown files in .claude/skills/ with name, description, body. Model tool-calls to load full instructions.
Should I use CLAUDE.md for TDD guidelines?
Yes—always-in-context beats invocation roulette. Skills for one-offs only.
Why did Vercel's evals favor AGENTS.md?
Single-shot tests; ignores multi-turn accumulation where skills could thrive if called.
How does Superpowers bypass this?
Pre-prompt injection, faking CLAUDE.md delivery without tool dependency.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.