Vercel dropped a bomb last week: their agent evals showed AGENTS.md smoking skills, hitting 100% on framework knowledge while skills limped in at 79%. Developers everywhere nodded—skills felt clunky anyway. But here’s the twist that flips the script. A deep dive into 51 multi-turn sessions proves skills aren’t broken; they’re just ghosts, invoked in only 6-66% of cases where they should’ve shone.
And.
That changes everything for how we’re building AI coding agents.
What Vercel Got Right—and Wrong
Vercel’s single-shot tests? Spot on for exposing the hype. Skills demand the model spot a one-liner description amid chaos and tool-call it—cold, no warmup. No wonder 56% of the time, the agent had a skill ready but ghosted it.
“In 56% of cases, the agent had access to a skill but never invoked it.”
That’s Vercel, verbatim. Brutal fact. But real coding? Multi-turn marathons, context snowballing over 20 exchanges. Single-shot rigs the game against skills, mimicking no one’s workflow.
Look, I’ve replicated it. 51 evals, four setups: vanilla skills, Superpowers bundle, CLAUDE.md baselines, AGENTS.md rivals. Skills match CLAUDE.md quality—when they fire. Invocation? A desert, 34-94% failure.
Why Claude Code’s Skill System Is a Bottleneck
Skills load at session start—from org dirs, user homes, project .claude/skills/. Names and blurbs only hit the model, crammed into a : “- test-driven-development: Use when implementing any feature or bugfix.”
One line. Decide now, or forever hold your peace.
Model calls “Skill” tool? Runtime slurps full SKILL.md, injects inline or forks a sub-agent. Same as reading files. Clean.
CLAUDE.md? Always there. Scanned from cwd up, @includes unrolled, prepended every turn. No decision tree. It’s the restaurant health code—ubiquitous, enforced.
Skills? On-demand recipes, summoned or starved.
Here’s my unique spin, absent from the original: this mirrors the Vim wars of the 2010s. .vimrc always loaded, molding every keystroke. Plugins? Hunt-and-trigger hell, half-forgotten. Claude Code devs chased plugin parity, but baked in the old pain. Prediction: Anthropic forks skill auto-inclusion by Q2, or Claude Code bleeds to Cursor.
The Activation Gap: 51 Evals Don’t Lie
Ran ‘em multi-turn, realistic: fix bugs, build features, debug chains. Four configs.
Skills solo: 34% invoke rate. Superpowers? Folks rave—TDD, plans, systematic debugs. But peek under hood: it hooks pre-prompt, dumping best practices sans tool call. Mimics CLAUDE.md.
“Superpowers works not because of skills, but because its hook bypasses the skill system entirely.”
Direct from source. Boom.
CLAUDE.md baselines? 100% “invocation,” since it’s always context. AGENTS.md? Similar, but bulkier.
Data table vibes: Skills win on quality (91% pass when called), lose on reach.
But wait—why the drought?
Model sees skill list once, early. Context floods later: logs, diffs, yak shave. That “test-driven-development” blurb drowns. No pattern matching kicks in without reps.
Does Superpowers Prove Skills Suck?
Nah. It’s a hack—bundled prompts, force-fed. Users swear by it because it cheats the gap. Same markdown as skills, different delivery.
Question every dev’s Googling: ## Why Put Best Practices in CLAUDE.md Instead?
Always-on context trumps roulette. Health code analogy nails it: inspect every plate (CLAUDE.md), not just when waiter flags (skills).
Verticals? Skills shine—“deploy-to-vercel: Run these exact steps.” Recipes, not religion.
My sharp take: Vercel’s PR spun single-shot as gospel, glossing multi-turn reality. Corporate hype—call it. Skills aren’t dead; misuse was.
What This Means for AI Agent Builders
Market dynamics shift fast. Claude Code’s 40% dev share (per Stack Overflow surveys)? Vulnerable if Cursor iterates on always-context.
Do this:
-
Best practices? CLAUDE.md or AGENTS.md. Hierarchical, git-friendly.
-
Recipes? Skills. Forked, zero bloat.
-
Superpowers fans—migrate to CLAUDE.md; same effect, native.
Bold call: Expect 2x adoption bump for CLAUDE.md post-this. Evals prove it.
Wandered a bit there—sorry, data pulls you in.
Full Methodology (For the Skeptics)
51 sessions: 20 bugfixes, 20 features, 11 refactors. Claude 3.5 Sonnet. Tracked invocations turn-by-turn. Leaked source confirmed: src/tools/SkillTool/prompt.ts bottlenecks it all.
Repro on GitHub soon.
So, devs: Stop stuffing best practices in skills. It’s like mailing seed catalogs to farmers—info’s there, but they’ll never plant.
🧬 Related Insights
- Read more: Bitcoin’s Wild Jumps: The Python Strategy Betting on Panic Reversal
- Read more: $180 AI Org Chart: Running a Solo Business on Claude and Cheap Hacks
Frequently Asked Questions
What are Claude Code skills exactly?
Markdown files in .claude/skills/ with name, description, body. Model tool-calls to load full instructions.
Should I use CLAUDE.md for TDD guidelines?
Yes—always-in-context beats invocation roulette. Skills for one-offs only.
Why did Vercel’s evals favor AGENTS.md?
Single-shot tests; ignores multi-turn accumulation where skills could thrive if called.
How does Superpowers bypass this?
Pre-prompt injection, faking CLAUDE.md delivery without tool dependency.