Testing AI in Rails: RSpec for Non-Deterministic Outputs

AI features in Rails sound cool until tests flake out. Here's how RSpec tames the chaos with smart, no-nonsense strategies.

Rails AI Testing: RSpec's Fix for Flaky Hell — theAIcatchup

Key Takeaways

  • Ditch exact string matches; test output shapes and patterns instead.
  • VCR + WebMock: 70% stubs, 25% cassettes, 5% live for bulletproof suites.
  • Error handling specs prevent most prod AI failures—stub those 429s now.

Flaky AI tests suck.

And they’ve been sucking since devs first glued LLMs into apps. You’ve got your chatbots, your RAG pipelines humming along in dev—then tests fail randomly because OpenAI decided to rephrase its answer. Original tutorials? They skip this mess. But skip it, and your Rails app crumbles in production. Time to test smarter, not harder.

Why Test Non-Deterministic AI Outputs in Rails?

Look, Rails devs love their green test suites. Green means shippable. But AI spits out different crap every call. “What is Ruby’s main strength?” One run: poetry. Next: haiku. Traditional asserts? Dead on arrival.

Here’s the quote that nails it:

AI outputs are non-deterministic. Ask the same question twice, get two different answers. Traditional “assert equals” testing breaks down.

Spot on. Ignore this, and you’re building on sand. I’ve seen teams burn cash on live API tests—hitting rate limits, racking bills. Worse: prod bugs when the model tweaks itself overnight.

My hot take? This is the Y2K for AI devs. Back in ‘99, everyone ignored date overflows. Boom—systems tanked. Today, non-determinism is your overflow. Fix it now, or watch your app glitch when GPT-5 drops weird replies.

Short fix? Bundle up RSpec, WebMock, VCR. Rails generate that spec install. Boom—armed.

VCR: Record Once, Replay Forever?

VCR’s the star here. Records real HTTP calls to cassettes. First run: pays OpenAI. After? Free replays. Deterministic bliss.

Take this spec:

VCR.use_cassette('chat_about_ruby') do
  service = ChatService.new
  response = service.ask("What is Ruby's main strength?")
  expect(response).to be_a(String)
  expect(response.length).to be > 20
  expect(response.downcase).to match(/ruby|programming|language/)
end

Punchy. Tests shape, not exact words. Length check? Regex for keywords? Smart. No more “expected ‘Ruby rocks’ got ‘Ruby is awesome’” fails.

But here’s the snark—VCR ain’t perfect. Change your prompt? Old cassette’s stale. Regenerate often, or you’re testing yesterday’s model. And sensitive keys? Filtered, sure. But leak one, and OpenAI bans you. (Ask me how I know a buddy who did.)

Stubs: Ditch the API for Unit Speed

Unit tests? Stub everything. No network. Fake embeddings, mock 429s. Fast as hell.

let(:fake_embedding) { Array.new(1536) { rand(-1.0..1.0) } }
stub_request(:post, 'https://api.openai.com/v1/embeddings')
  .to_return(...)

Expect vector length. All floats. Done. Seventy percent of your suite—pure logic, error paths, transforms. No AI roulette.

Error handling? Crucial. Most AI apps die here.

Stub rate limits:

stub_request(...).to_return(status: 429, body: { error: { message: 'Rate limit exceeded' } }.to_json)
expect { service.ask("anything") }.not_to raise_error

Timeouts. Malformed JSON. Test ‘em. Prod won’t forgive “API barfed, app down.”

Schema Checks: Tame Structured Outputs

AI JSON? Parse and validate.

parsed = JSON.parse(result)
expect(parsed).to have_key('summary')
expect(parsed['key_points']).to be_an(Array)

Keys present. Arrays right type. Patterns hold. Wording varies? Who cares.

Prompt tweaks? Spec the prompt itself.

it 'produces output under 200 words' do
  VCR.use_cassette('summarizer_long_doc') do
    word_count = result.split.size
    expect(word_count).to be <= 200
  end
end

Entities preserved? Regex again. ‘Ruby’ and ‘Matsumoto’ stick around.

How Production-Ready Are These RSpec Tricks?

Split your suite: 70% stubs (lightning fast), 25% VCR (real-ish), 5% live (tagged, rare). Run rspec spec/services/ for AI bits. Doc format? --format documentation.

Skeptical? Me too—at first. But this beats flaky chaos. Prediction: teams ignoring it? Six months to brittle messes. OpenAI tweaks models quarterly—your untested prompts break. Meanwhile, VCR crews ship confidently.

Rails AI ain’t toy. Chat interfaces, voice-to-text—real user cash on line. Test structure, errors, shapes. Exact strings? For suckers.

One caveat. VCR cassettes bloat. Compress ‘em. And for mega-scale? Consider factories for fakes. But start here.


🧬 Related Insights

Frequently Asked Questions

How do you test AI features in Rails with RSpec?

Use VCR for recordings, WebMock for stubs. Test shapes, lengths, schemas—not exact text.

What’s the best way to handle non-deterministic AI outputs?

Stub units, cassette integrations. Add error stubs for 429s, timeouts.

Does VCR work for OpenAI in Rails tests?

Yes—records once, replays free. Filter keys, set record: :once.

Sarah Chen
Written by

AI research editor covering LLMs, benchmarks, and the race between frontier labs. Previously at MIT CSAIL.

Frequently asked questions

How do you test AI features in Rails with RSpec?
Use VCR for recordings, WebMock for stubs. Test shapes, lengths, schemas—not exact text.
What's the best way to handle non-deterministic AI outputs?
Stub units, cassette integrations. Add error stubs for 429s, timeouts.
Does VCR work for OpenAI in Rails tests?
Yes—records once, replays free. Filter keys, set record: :once.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.