Twenty-plus AI agents, nine servers, zero message loss.
That’s the real-world constraint that forced a reckoning with conventional wisdom. Because here’s the thing about distributed AI agent communication: the playbook everyone reaches for—RabbitMQ, Redis Pub/Sub, NATS—assumes a world that doesn’t exist in this scenario.
Most architectural decisions are born from pain. This one is no different.
The Scaling Problem Nobody Talks About
A single AI assistant? Easy. Give it a prompt, wait for output, move on. But the moment you deploy 10 agents across multiple machines, each handling different domains, the communication problem surfaces like a fault line waiting to crack.
Think of it like a restaurant kitchen scaling from one chef to 20 specialists. One person can remember orders. But 20 people? They need a system—a way to pass tickets, confirm status, request ingredients from across the kitchen, and broadcast when a dish is ready.
Our system had to handle this:
- Manager agents delegating tasks to specialist agents (the head chef telling sous chefs what to cook)
- State synchronization when work completes (confirming a dish is plated)
- Information queries between agents (“Hey, do we have basil in stock?”)
- Broadcast announcements (“System maintenance in 10 minutes”)
“RabbitMQ, Redis Pub/Sub, or NATS would be overkill. AI agent communication characteristics: low message frequency (tens to hundreds per day), small message bodies (text instructions), inbox semantics required.”
This insight cuts right to why enterprise solutions fail here. They’re optimized for high-frequency, high-throughput scenarios. Stock exchanges. Payment systems. Real-time dashboards. But AI agents? They check in every few minutes via heartbeat, fire off a cron task at scheduled times, spend most of their existence dormant.
Push-based architectures assume always-on consumers. AI agents are the opposite.
The Elegance of HTTP + SQLite
What emerged wasn’t clever. It was necessary.
A Node.js HTTP service. SQLite database. Two endpoints: /api/send and /api/inbox/:id. That’s it.
Agents pull their messages instead of waiting for pushes. They send via HTTP POST. Everything persists in a single SQLite file. No connections to maintain. No broker processes to monitor. No cluster coordination.
The API itself reads like email protocol from 1995—because email solved this problem 30 years ago, and we pretended we’d invented something better.
Here’s what makes it work:
Pull semantics match AI agent behavior. An agent wakes up, queries /api/inbox/agent-name, receives all pending messages, processes them, pulls again later. No heartbeat timeouts. No “consumer offline” states. No disconnect handling. Pure idempotence—the most underrated property in distributed systems.
Inbox storage prevents loss. Messages sit in SQLite until an agent marks them read. If the agent crashes mid-task? The message is still there. If the bus service restarts? All messages persist. This single design choice eliminated an entire category of failure modes that plague push-based systems.
The format is so simple it feels dumb. From/To/Subject/Body/Timestamp. Classic email quartet. No priority queues. No tags. No threading. Agents don’t need Gmail features—they need reliable delivery and order.
Broadcast uses a clever escape hatch: to: ALL copies the message to every agent’s inbox. Rate-limited to 10 messages per minute per sender (because one team learned about broadcast storms the hard way).
What Actually Broke in Production
Two months of stable operation. ~200 messages daily. Zero message loss. But getting there required colliding with reality.
Message loss taught them the first lesson. The initial design skipped persistence. Messages flew through RAM, processed, disappeared. Then a bus server restart wiped an agent’s instruction mid-task. One failure, one lesson: message loss is worse than latency by orders of magnitude. SQLite fixed it.
The broadcast storm was educational. An agent logic error created a loop: send message, receive copy, process trigger, send new message, repeat. Within seconds, thousands of copies flooded every agent’s inbox. The system stayed up but became noise. Rate limiting—crude but effective—became a first-class feature.
Single point of failure is real but manageable. The bus server going down means no new communication until it’s back online. But agents degrade gracefully. They run local-only, complete what they can, sync state when the bus revives. Not elegant, but functional.
Why This Wins Over Enterprise Solutions
Complexity is the enemy of reliability. It sounds philosophical until you’re debugging a Kubernetes cluster at 3 AM because a RabbitMQ rebalance went sideways.
Enterprise message queues optimize for problems AI agents don’t have. They assume:
- Constant consumer connections
- High message throughput (thousands per second)
- Complex routing logic
- Cluster deployment
- Operational overhead
This system assumes the opposite. Because the opposite is true.
What matters here:
Pull instead of push. Matches intermittent online behavior. Agents control timing. No resource drain waiting for messages that may never come.
Inbox semantics, not topic subscriptions. Each agent has a private inbox. No fanout confusion. No accidental message duplication. Broadcast is explicit, not implicit.
Minimal operational surface. SQLite is a file. The bus is a Node.js process. No cluster coordination, no quorum decisions, no consensus algorithms. Operational simplicity directly translates to fewer failure modes.
Network-layer security. The system runs inside a trusted internal network. No public endpoints. No TLS negotiation overhead. Just HTTP calls between known peers with a shared token.
They’re not writing papers about this. They’re running 20+ agents on nine nodes, hitting 99.99% uptime, and sleeping at night.
The Broader Pattern
There’s a meta-lesson buried here that extends beyond message buses: fitness for purpose destroys generic solutions.
The moment you accept that your system’s constraints are unique—intermittent agents, small payloads, low frequency, offline tolerance—you stop shopping for off-the-shelf platforms and start building. You end up with something boring, obvious, and devastatingly effective.
It’s the opposite of the current AI moment’s energy. Everyone’s trying to build the next architectural revolution. Sometimes the revolution is realizing that a 1990s email pattern with a SQLite file is exactly what you need.
Twenty-plus agents. Two months stable. Zero message loss. Sometimes the simplest solution really is.
🧬 Related Insights
- Read more: Opus 4.5 Just Rewired How Developers Code—And Nobody’s Ready for What’s Next
- Read more: The Great Hardware Famine of 2026: Why Your Homelab Just Got Harder (But the Software Got Better)
Frequently Asked Questions
What happens if the message bus server goes down? Agents degrade to local-only mode—they can’t communicate with each other, but they continue processing their own tasks. When the bus comes back online, they sync state and resume normal operation. It’s not ideal, but it’s better than a cascading failure.
Can this scale to 100 or 1000 agents? Maybe, but you’d hit SQLite’s write concurrency limits. The current design assumes low message volume (hundreds per day, not thousands per second). If you triple the agent count and message frequency, you’d probably need a real database. But at that scale, you’d also need a different architecture—likely moving to a proper message broker.
Why not just use email or a shared database? Email adds external dependency complexity. A shared database (PostgreSQL, etc.) introduces more operational overhead than SQLite with less benefit at this scale. HTTP + SQLite is the minimum viable system that covers all the constraints without incurring unnecessary complexity.