Relvy AI Automated On-Call Runbooks

Imagine your pager silent at 3 AM because an AI nailed the fix. Relvy AI does just that, swapping wild LLM guesses for rock-solid runbook machines.

Relvy AI: The Runbook Revolution Ending On-Call Hell for Devs — theAIcatchup

Key Takeaways

  • Relvy AI uses runbook DAGs and pre-analysis tools to sidestep LLM pitfalls like hallucinations and context overflow.
  • Local VPC deployment ensures security; notebook logs provide full transparency.
  • This shifts on-call from reactive chaos to deterministic execution — a platform leap for DevOps.

Alert blasts through Slack. P99 latency on your critical API? Through the roof. No human stirring yet — but Relvy AI’s already traversing its runbook DAG, correlating that spike to a fresh deploy from 20 minutes ago.

That’s the pitch, anyway. Relvy AI promises automated on-call runbooks for engineering teams, ditching the hallucinatory mess of generic LLMs for something brutally deterministic. And here’s the thing: in a world where on-call burnout costs teams millions (PagerDuty’s own surveys peg it at $3.6 million per large org annually), this could be the pivot we’ve needed.

But.

Relvy doesn’t reinvent the wheel — they shatter it. Current LLMs like Claude 3.5 Sonnet or GPT-4o flop hard on root cause analysis, scraping by with under 40% accuracy on benchmarks. Why? Context overflow from terabytes of telemetry drowns them. No enterprise smarts to flag ‘normal’ cron-induced spikes. And that exploration drag? It torches your time-to-mitigation window.

Why Relvy AI Crushes the OpenRCA Problem

They anchor everything in a Runbook State Machine. Forget open-ended LLM chit-chat. Alerts trigger a DAG of diagnostic nodes — each a targeted tool call, not a prose poem.

Take their TelemetryTool. It spits Z-score anomalies or STL decomps, hands the agent clean JSON: {“anomalies_detected”: 3, “period”: “past 30m”}. No raw logs bloating the context window. Boom — token count plummets, hallucinations evaporate.

Then correlate_with_deployment? Grabs the last five commits. Structured truth, not vibes.

“By using these targeted tools, we reduce the token load significantly. The agent receives a structured JSON object describing the anomaly, which acts as a ‘ground truth’ anchor, preventing the hallucination of non-existent error patterns.”

Relvy’s own words — and they’re spot on. This isn’t AI hype; it’s engineering hygiene.

Local-first too. Docker/Helm deploys in your VPC. No telemetry exfil to the cloud. Datadog, Prometheus, Honeycomb? All fair game, zero latency.

Three threads hum in parallel: observation polls for anomalies, RAG-boosted reasoning matches signatures to runbooks, action layer fires CLI mitigations — rollbacks, restarts, traffic shifts.

Does Relvy Beat PagerDuty’s Human Handover?

PagerDuty? Solid for routing pings, but mitigation’s still you at 3 AM, bleary-eyed in New Tab. Relvy hands off only on low-confidence ambiguity, surfacing a notebook with cells logging every step: input data, agent thoughts, viz.

Example cell:

{ “step”: “Check Endpoint Latency”, “status”: “completed”, “data”: { “avg_latency”: “450ms”, “p99_latency”: “1200ms”, “anomaly_confidence”: true }, “agent_thought”: “P99 deviated 3.2 std devs from 7-day avg” }

Transparency kills the black-box fear. Engineers trust it because they can replay the tape.

Market dynamics scream opportunity. On-call tools hit $2B+ TAM, growing 15% YoY per Gartner. But pure AI plays like Microsoft’s Copilot for DevOps? Still generative fluff, prone to the same RCA pitfalls Relvy sidesteps.

My take: Relvy’s betting on determinism over dazzle, and that’s smart. Remember Jenkins in 2010? CI/CD was manual hell till scripted pipelines locked it down. Relvy does that for incidents — unique insight here — turning SRE from firefighting to orchestration.

Skeptical? Fair. Runbooks need constant tuning, or they ossify. Relvy’s DAGs must auto-evolve via RAG, or teams bail. But early adopters (they hint at Fortune 500 pilots) report 70% TTM cuts. If scales, PagerDuty stock dips 10-15% in 18 months — bold call, but data backs it.

The Hidden Gotcha in Relvy’s Stack

Enterprise context. LLMs blind to your quirks — Endpoint_A’s cron spike? Normal. B’s? Catastrophe. Relvy layers RAG over your docs, runbooks, past incidents. But onboarding? Non-trivial. Map your observability first, or it’s DOA.

Security latency nil, sure — VPC-bound. But that notebook? Audit gold, compliance dream for SOC2.

And the PR spin? None here. Relvy calls out LLM limits upfront. Refreshing in AI-land.

Zoom out: On-call’s a $4B drag yearly (Atlassian stats). Relvy targets 20-30% automation capture. Makes sense — if your stack’s mature.

Bootstrappers? Skip. Raw chaos needs humans.

Prediction: By Q4 2025, 15% of SRE teams run Relvy-like agents. Determinism wins.

Why Does This Matter for On-Call Teams?

Burnout’s real. 68% of engineers dread pager duty (PagerDuty State of Incident Report). Relvy offloads the rote — anomaly hunt, deploy corr, low-risk fixes.

Frees you for RCA deep dives, architecture fixes. Or sleep.

Numbers: Confidence >80%? Auto-mitigate. Else, human-in-loop with primed notebook. Hybrid heaven.

Competition? Incident.io, Firehydr — response-focused, light on autonomy. Blameless? Post-mortems. Relvy’s the executor.

One punchy caveat: High-cardinality data still bites. Their tools summarize, but edge cases (microservices soup) demand custom nodes.

Worth it? For scale-ups with observability hygiene, yes. Hype-free upgrade.


🧬 Related Insights

Frequently Asked Questions

What is Relvy AI? Relvy AI automates on-call responses using runbook DAGs and tool interfaces, integrating with Datadog/Prometheus for deterministic incident mitigation.

Does Relvy AI replace on-call engineers? No — it handles routine diagnostics and fixes, escalating ambiguities with transparent notebooks for human review.

How much does Relvy AI cost? Pricing isn’t public yet; expect per-incident or seat-based, starting around $50/engineer/month based on similar tools.

James Kowalski
Written by

Investigative tech reporter focused on AI ethics, regulation, and societal impact.

Frequently asked questions

What is Relvy AI?
Relvy AI automates on-call responses using runbook DAGs and tool interfaces, integrating with Datadog/Prometheus for deterministic incident mitigation.
Does Relvy AI replace on-call engineers?
No — it handles routine diagnostics and fixes, escalating ambiguities with transparent notebooks for human review.
How much does Relvy AI cost?
Pricing isn't public yet; expect per-incident or seat-based, starting around $50/engineer/month based on similar tools.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by Dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.