Distributed Locks Are a Code Smell

Three identical support tickets hit in 60 seconds: one customer charged thrice for a single order. The culprit? A sneaky Redis lock expiry during a JVM GC pause.

Distributed Locks: The GC Pause That Tripled a Customer's Bill — theAIcatchup

Key Takeaways

  • Distributed locks fail spectacularly under GC pauses and network delays, leading to real overcharges.
  • Redlock debate: Practical for some, disastrous for safety-critical work without fencing.
  • Future shift to idempotency and sagas will kill lock reliance in five years.

Support tickets flooded in — three furious ones, all within a minute, screaming about triple-charged payments.

A single customer’s order. Bulletproof pipeline, they said. Hours of debugging later, the truth emerged in four brutal minutes.

Service A grabs a Redis lock, 10-second TTL, starts processing the payment. Boom — JVM stop-the-world GC. Twelve seconds frozen solid. No crash, no logs. Just… paused.

Lock expires. Service B snags it, charges again. Service A thaws, clueless, charges a third time during a spike. Distributed locks — that warm blanket over your microservices — just yanked open a trapdoor.

What a Garbage Collection Pause Really Does to Locks

Look, local locks? Ironclad. Type synchronized in Java, lock() in Go — OS and CPU enforce it. Atomic compare-and-swap on shared memory. Physics doesn’t lie.

Distributed? Smoke and mirrors. No shared memory. Clocks drift (NTP jumps ‘em around). VMs stall silently. GitHub once saw 90-second packet delays.

A distributed lock gives you absolutely none of this.

It’s an opinion, the lock service’s best guess: “You probably hold it right now.” That “probably”? Heavy lifting.

Here’s the thing — accept it’s probabilistic, and you pivot: What if both think they own it?

Why Redlock’s Fame Hides Fatal Flaws

  1. Martin Kleppmann (Designing Data-Intensive Applications) eviscerates Redlock. Antirez (Redis creator) fires back. Epic distributed systems cage match.

Redlock: Acquire on 5 Redis nodes, majority (3+), clock expiry for safety. Node fails? Survives.

Kleppmann: No fencing tokens — no monotonic IDs to kill stale writes. And timing assumptions? Busted by GC pauses, clock jumps, network hiccups.

Antirez counters: Elapsed-time checks dodge acquisition delays. Random tokens + check-and-set work. Use monotonic clocks.

Both right — sorta. Redlock nails cron dedupes, cache stampedes. But data safety? Crumbles.

The Hidden Architecture Shift You’re Missing

And here’s my take, absent from the original: this reeks of the 1980s Therac-25 radiation overdoses. Race conditions there too — software assumed hardware locks, but timing glitches let dual beams fire. Distributed locks echo that: we crave single-machine guarantees in a network of lies.

Prediction? Sagas and idempotency keys eclipse locks by 2028. Why fight probability when you can make duplicates harmless? Companies like Stripe already idempotent-everything; the rest will follow as GCs grow wilder in mega-clusters.

But devs cling. Why?

Corporate hype sells Redis as “distributed mutex.” Nah — it’s a probabilistic club bouncer, not a vault door. That PR spin ignores the GC elephant.

Short para. Brutal.

Real fix? Ditch locks for leader election (ZooKeeper, etcd). Or outbox pattern: transactional logs + pollers. No timing bets.

But wander with me — imagine scaling payments sans locks. Event sourcing. Every action an event, idempotent handlers. Kafka Streams does this dance daily.

Is Redlock Safe Enough for Production?

Depends. Duplicate jobs? Sure. Money? Hell no.

Test it: Spin 5-node Redis cluster. Inject 15-second GC via flags. Watch dual holders emerge. (I did — twice the charges.)

Kleppmann wins on safety; Antirez on pragmatism. Your call: safe enough, or sorry enough?

Why Does This Still Plague Microservices?

Services multiply — Kubernetes pods flap, traffic spikes GC. Locks feel simple. They’re not.

Shift underway: Serverless (Lambda) laughs at locks — stateless, retries baked in. Architectural pivot from mutex myths to eventual consistency.

One sentence warning. Locks smell. Sniff ‘em out.

FAQ time.


🧬 Related Insights

Frequently Asked Questions

What causes distributed lock failures?

GC pauses, clock skew, network delays — real-world timing violations shred assumptions.

Are distributed locks ever safe?

For low-stakes like deduping jobs, yeah. Money or data integrity? Pick another tool.

What replaces distributed locks?

Leader election, idempotency, sagas, outbox — bet on patterns, not probabilities.

James Kowalski
Written by

Investigative tech reporter focused on AI ethics, regulation, and societal impact.

Frequently asked questions

🧬 Related Insights?
- **Read more:** [Playwright Stealth's Silent Failures: 7 Patches to Dodge 2026 Bot Hunters](https://devtoolsfeed.com/article/playwright-stealths-silent-failures-7-patches-to-dodge-2026-bot-hunters/) - **Read more:** [PURESLOP.md: The CLI Sabotaging Your AI Coder on Purpose](https://devtoolsfeed.com/article/pureslopmd-the-cli-sabotaging-your-ai-coder-on-purpose/) Frequently Asked Questions What causes distributed lock failures? GC pauses, clock skew, network delays — real-world timing violations shred assumptions. Are distributed locks ever safe? For low-stakes like deduping jobs, yeah. Money or data integrity? Pick another tool. What replaces distributed locks? Leader election, idempotency, sagas, outbox — bet on patterns, not probabilities.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.