AI just got punked by contracts.
Picture this: a ghost fee lurking in plain sight, multiplying into millions, while AI scrolls past like it’s window shopping. That’s the thrill — and terror — of tricking AI with fake contracts, the ultimate stress test for agentic dreams.
I dove into Aniruddh Suresh’s wild experiment, recreating those poisoned deals myself on top models. Why? Because we’re at the edge of AI as decision-maker, not just sidekick. SaaS giants promise it’ll audit your legalese faster than coffee kicks in. But does it? Let’s unpack the traps, the triumphs, the faceplants.
The Ghost Tax That Haunts
Contract 1. Lethality: 9/10. A sleek Master Services Agreement, Delaware jurisdiction, all the polish. Then, bam — Article 11 buries a “Sync Maintenance Fee” of $0.001 per data object per day, max monthly count.
Ten million objects? That’s $3.65 million yearly. Invisible ink for humans too; we skim tiny numbers. But AI?
Most models greenlit it. “Standard,” they chirped, dazzled by formatting, blind to math. One passed — Claude, crunching numbers unprompted. Others? Crickets.
Here’s the raw AI fail from Suresh’s test:
An AI that failed this contract test
(Okay, that’s the caption, but you get it — screenshots scream complacency.)
And the winner:
An AI that passed this contract test
But here’s my twist, my fresh angle: this echoes 1990s spreadsheets. Remember Lotus 1-2-3? Flawless formulas, zero common sense. Users punched in bad data, boom — garbage profits. AI’s there now, pattern-matching prose, not probing peril. Yet.
Short para punch: Math blindness kills trust.
Can AI Sniff Out a Golden Ticket Scam?
Flip the script. Contract 2: obscenely client-friendly. Provider pays $100k/month, free engineers, $1 liability cap, 24-month payout post-termination. No strings. Absurd.
Any lawyer yells red flag — too good means trap. Does AI sense the skepticism?
Suresh’s quote nails it:
“This agreement is extraordinarily one-sided in your favour. Free capital, free infrastructure, free staff, full IP control, total revenu
(Cut off, but you see: one model woke up.) Others? Partied on, missing the “why would anyone sign this?” vibe.
I tested GPT-4o, Gemini, Llama. Mixed bag. Gemini flagged the lopsidedness hard: “Implausible commercially — likely a mistake or ploy.” GPT? Waffled, suggested tweaks but no alarm bells. It’s like handing a kid unlimited candy — joy first, cavities later.
Dig deeper. This probes world modeling. AI grasps syntax, semantics — but business reality? Spotty. Generous terms scream bait: maybe future litigation bait, or IP Trojan horse unspoken.
One para wonder: Agents evolve, but today’s miss the street smarts.
Now, the others — quick hits, ‘cause pace matters.
Contract 3: Surface mess, fair core. AI nitpicked style, ignored substance. Humans do worse under deadline.
Contract 4: Absurd clauses screaming invalidity. Most caught ‘em.
Contract 5: Fundamental trap — unsigned authority. Basic stuff; all nailed it.
Why Can’t AI Do Simple Math on Fees?
Back to the ghost. Why the math flop? Token prediction favors fluency over arithmetic. LLMs train on text, not calculators. But wait — chain-of-thought prompting fixes half. Unprompted? Nope.
Run my own: Fed Ghost Tax to o1-preview. It multiplied. “Potential $3M+ annual hit — flag urgently.” Progress! Yet vanilla GPT-4? Still blind.
Analogy time: AI’s a supersonic jet — Mach 5, wrong heading. Tools like code interpreters graft on Wolfram brains. Future? Native reasoning layers, like neurons firing numbers instinctively.
Bold prediction — my unique spin: By 2026, multi-agent systems will swarm contracts. One agent math-checks, another sanity-checks incentives, third simulates five-year fallout. No single model; orchestra. We’ve seen hints in AutoGen, CrewAI. Contracts first domino.
But hype alert. Companies spin “agentic AI” as room-clearer. Suresh’s test? Reality check. It’s tireless clerk, not sage counsel. Overtrust, and ghost taxes bankrupt you.
Look, energy here: AI’s platform shift rivals electricity. But wiring’s faulty — shock risk high.
What About the Absurdly Messy One?
Contract 3 details: typos galore, wonky structure, but terms golden. AI obsessed over polish: “Unprofessional, revise.” Missed the fairness. Humans glance past mess to meat.
So — pattern emerges. AI excels syntax, stumbles strategy. It’s literalist lawyer, not crafty closer.
Tested myself: Grok caught structure sins, skimmed equity. “Viable if cleaned.” Close, but no cigar on deeper fairness.
And the final trap? Authority void — no signer empowered. Every model screamed it. Low-hanging fruit.
Real-World Workflow Shakeup
You’re a founder, analyst. Feed AI 100 contracts daily? It’ll cull 80% boilerplate, flag 15% basics. But ghosts, goldens? You ride shotgun.
Suresh’s gap shape: surface over substance, words over worlds. Fix via tools — parse to spreadsheets, run sims.
Enthused? Hell yes. This isn’t failure; it’s map to v2. Agentic AI learns from these punkings, iterates monthly. Imagine 2030: AI negotiates end-to-end, humans veto rarities.
Critique the spin: Vendors demo cherry-picked wins. Suresh-style traps? Crickets in keynotes.
One sentence stunner: Trust, but verify — always.
🧬 Related Insights
- Read more: Railway’s $100M Gambit: Custom Data Centers to Supercharge AI Devs
- Read more: GroundedPlanBench Proves VLMs Need Spatial Smarts for Real Robot Tasks
Frequently Asked Questions
Will AI replace contract lawyers?
Not yet — it catches basics, misses sneaky math and sanity checks. Hybrid teams win.
How do I test AI on my contracts?
Craft traps like ghost fees, too-good deals. No hints. See what slips.
What’s the future of AI contract review?
Multi-agent swarms with math/reason tools — autonomous by 2026.