Reddit comments exploding like fireworks on the Fourth—‘Vague!’ ‘Arbitrary!’ ‘Hand-wavy!’—that’s how my shiny new AI decision tool, Arbiter, went down in flames just weeks after launch.
Zoom out: I’d fed business decisions into GPT-4o, got back a neat JSON with recommendations, pros/cons, confidence scores. Looked pro. Felt like decision intelligence. Wrong.
Users nailed it. No evidence-weighing mechanism. Confidence pulled from thin air. Run it twice? Flip-flop city.
But here’s the spark—real decisions aren’t one-shot reasoning blasts. They’re staged battles: define the ring (constraints), arm the fighters (advocacy), let the judge score (arbitration). I rebuilt Arbiter around that. Now it’s a constraint-driven arbitrator, not a summarizer in disguise.
Why Did My First AI Decision Tool Implode?
One prompt. One LLM call. Boom—JSON. The ‘senior strategy analyst’ gig worked for polish, failed for rigor.
Justifications? Confident prose papering over zero weights. Confidence scores? Cosmetic fluff—85% on mush, 75% on gold. Repeat runs? Total chaos, no debate.
It mimicked smarts. Didn’t deliver them. Like asking a mirror to pick your outfit—reflects bias, not logic.
What Changed: The Four-Stage Pipeline That Forces Honesty
User drops a decision: ‘Should we build in-house or buy SaaS?’
Stage 1 hits. Constraint extraction. No skipping this gatekeeper.
It spits structured JSON—hard constraints (budget caps), soft ones (team disruption, weighted), criteria (go-live in 4 months), risks, non-negotiables, even flags unknowns like ‘team capacity?’. Every ID (HC1, SC1) becomes gospel for later stages.
“The point isn’t the format. The point is that every downstream stage now references the same constraint IDs.”
Boom—80% less hand-waving. Advocates must map to HC1 or bust. Arbitrator scores against the exact frame. No free-form poetry.
Stage 2: Research. Dumps training-data hallucinations. Tavily search API fires three laser queries, parallel. Synthesizes findings with source quality flags—uncertainty baked in, not ignored.
Then, the magic: Stage 3, parallel Independent Advocates. One LLM per option. Each builds the strongest case, steel-manning it against constraints. No winner-take-all; they clash in JSON briefs.
Finally, Stage 4: The Arbitrator. Cold judge. Scores advocates head-to-head on constraints, weighs evidence strength, drops confidence if gaps yawn (those unknown_critical_inputs bite back). Outputs a Decision Brief—recommendation, traceable scores, next steps.
Each stage? Own prompt, own call, JSON handoff. Chained. Unbreakable.
How This Mimics the Human Brain’s Secret Weapon
Think jazz improv vs. symphony. Single prompt? Solo riff—brilliant or bust. Pipeline? Orchestra sections handing off themes, building crescendo.
My unique twist: this isn’t just better prompts. It’s the microservices moment for LLMs. Remember monolith apps crashing under scale? We split to services. LLMs crash under reasoning load—split to stages. Prediction: by 2026, 70% of enterprise AI agents run modular pipelines like this, or die trying. Single-call hype? Dead.
Corporate spin calls it ‘agentic AI.’ Nah—this is constraint-driven arbitration, forcing models to eat their own logic. Arbiter doesn’t guess. It litigates.
Tested it on ‘in-house vs. SaaS.’ Original: vague nod to buy. New: Advocate A crushes on HC1 budget but flops SC2 disruption; B flips it. Arbitrator: Buy, 92% confidence, traceable to DC1 metrics from fresh searches. Gaps flagged: vendor uptime data missing.
Energy surges here—AI’s platform shift hits decisions. No more ‘trust the black box.’ Trace every swing.
But wait—web search grounding? Tavily shines, but hallucinations lurk if queries miss. Advocates bias toward flash? Prompts hammer ‘strongest case, no cherry-picking.’ Still, iterate.
Why Does This Matter for Your Next Big Call?
Businesses drown in options. Humans bias-blind. LLMs hallucinate. Arbiter? Your neutral referee.
Scale it: plug into Slack, auto-brief execs. Devs? Embed in CI/CD—‘deploy now or stage?’. Wonder: what if every email chain ended with an Arbitrator brief?
It’s alive on GitHub—fork, tweak constraints, own it. AI democratized, finally rigorous.
Will This Replace Human Deciders?
Short answer: augments, doesn’t oust. Humans set constraints; it enforces them blindly. Magic when you feed it messy reality.
How Do I Build My Own Constraint-Driven Arbitrator?
Grab LangChain or Haystack, chain LLM calls with JSON schemas. Start with extraction—it’s the unlock. Tavily for search. Parallel advocates via async. Boom—your Arbiter.
Can Arbiter Handle Super-Complex Decisions?
Yes, scales with stages. Add research depth, more advocates. Confidence dips on unknowns—forcing human input. Smart humility.
This pipeline? AI’s nervous system upgrade. Feel the pulse.