17 points. That’s the performance chasm in a study pitting three agent frameworks against 731 coding problems. Same model. Identical tasks. Only difference? Instruction scaffolding.
We chase the next big LLM — Sonnet 3.5, o1-preview, whatever GPT-Next is called — like it’s the holy grail. But here’s the kicker: your crummy CLAUDE.md or copilot-instructions.md wrecks more havoc than picking the ‘wrong’ model.
And nobody tests them. Not you, not your team. You tweak a paragraph, git push, cross fingers. Sound familiar?
A recent study tested three agent frameworks running the same model on 731 coding problems. Same model. Same tasks. The only difference was the instruction scaffolding. The spread was 17 points.
I’ve seen it all in 20 years covering this Valley circus. Back when unit tests were ‘optional,’ codebases rotted overnight. Now we CI/CD everything — except the files steering your AI agents. Dead file refs from six-month-old renames. Bloated fluff chewing tokens. Contradictions between docs. It’s amateur hour.
Why Do Instructions Crush Model Choice?
Think APIs without contracts. Chaos. Your AI instructions? Same deal, but with hallucinations on steroids.
Scan any repo’s instruction files after three months. Dead refs everywhere — ‘check src/auth.ts,’ but it’s authentication.ts now. AI hunts ghosts, spits garbage.
Fluff kills too. ‘Make sure to thoroughly test every edge case robustly.’ Twenty-five tokens of nothing. In a 200K window, that’s code context flushed. I’ve parsed hundreds; ‘it is important that,’ ‘please ensure’ infest them like weeds.
Worse: conflicts. ‘Always semicolons.’ Then, pages later, ‘Follow Prettier’ — which strips ‘em. Model flips a coin. Multi-person edits? Guaranteed drift.
Bloat balloons. 300-line CLAUDE.md plus 200-line AGENTS.md? Forty percent context window gone pre-code. Performance nosedives across the board.
My unique angle: this mirrors the NoSQL hype crash of 2010. Everyone ditched relational DBs for ‘scalable’ schemas, ignored data integrity. Billions lost. Today, we’re schema-less with AI prompts — until tools like agenteval force discipline.
Agenteval: No-BS Linter for AI Docs
CLI. Dead simple.
curl -fsSL https://raw.githubusercontent.com/lukasmetzler/agenteval/main/install.sh | bash egenteval lint
Parses Markdown, token-counts, flags dead refs, fluff, vagueness. Deterministic. Sub-second. No LLM magic — just rules.
First run on my project:
CLAUDE.md ERROR Referenced file “docs/schema.md” does not exist → Remove the reference or create the missing file info Section “Testing” contains 1 filler phrase(s) → Rewrite without phrases like ‘make sure to’ info Vague instructions: “be careful with error handling” → Replace with a specific example or threshold
Actionable. Every time.
Supports CLAUDE.md, AGENTS.md, .github/copilot-instructions.md, .cursorrules, even Anthropic skills. Scoped too.
But linting’s table stakes. Agenteval goes deeper.
Can You Prove Your Tweaks Work?
Harvest: Mines git history for AI commits — detects Claude, Copilot, Cursor, Devin, 14 tools total. Spits real benchmarks from your past work, with instruction snapshots.
Run: Feeds tasks to agents in isolated worktrees. Scores on correctness (right files?), precision (no extras?), efficiency (tokens?), conventions.
Compare: Side-by-side runs. Did that rewrite boost scores?
No fake tasks. Your history’s the gold standard.
Skeptical? Good. Valley loves ‘eval’ tools that game metrics. Agenteval sidesteps — real commits, no synthesis. But watch: if adoption spikes, expect PR spin on ‘pass rates.’ Who profits? The tool’s open-source (for now), but expect enterprise forks charging SaaS bucks.
Look, we’ve normalized testing code. APIs. UIs. But AI instructions? The puppet strings pulling your productivity. Ignore ‘em, and that 17-point gap becomes your daily tax.
Agenteval won’t save sloppy teams. But for serious shops — the ones scaling AI agents beyond toys — it’s the sanity check we should’ve built years ago.
One punchy warning: Dupe instructions across files. CLAUDE.md says tabs; AGENTS.md too. Double tokens, zero gain. Update one, forget the other? Boom, contradictions. Seen it tank deploys.
This isn’t hype. It’s the unglamorous grind making AI dev reliable. Finally.
Why Does This Matter for Developers?
Tokens cost cash. Fluff’s a leak. Dead refs waste cycles. Precision scores catch over-eager agents rewriting half your repo.
Historical parallel: JUnit’s rise killed cowboy coding. Agenteval could do that for prompts.
Prediction: In 12 months, top teams mandate instruction CI. Laggards eat 20% efficiency hits.
But — and it’s a big but — git-harvest assumes clean commit history. Messy repos? Garbage in, garbage out. Clean yours first.
Worth the install? If you’re past ‘wow, Copilot!’ phase — yes.
🧬 Related Insights
- Read more: Gemma 4 on a $1500 Laptop: $10/Day APIs Erased in Hours
- Read more: Three Pals Whip Up a Steam Party Game in Under a Year—Not a Line of Code Written
Frequently Asked Questions
What is agenteval and how do I install it?
CLI tool for linting and evaluating AI instruction files. Curl the install script, run ‘agenteval lint’.
Does agenteval work with GitHub Copilot and Cursor?
Yes, supports copilot-instructions.md, .cursorrules, CLAUDE.md, AGENTS.md, and more.
How does agenteval benchmark real performance?
Harvests your git history for AI tasks, replays with scores on correctness, precision, efficiency, conventions.