Flaky AI tests suck.
And they’ve been sucking since devs first glued LLMs into apps. You’ve got your chatbots, your RAG pipelines humming along in dev—then tests fail randomly because OpenAI decided to rephrase its answer. Original tutorials? They skip this mess. But skip it, and your Rails app crumbles in production. Time to test smarter, not harder.
Why Test Non-Deterministic AI Outputs in Rails?
Look, Rails devs love their green test suites. Green means shippable. But AI spits out different crap every call. “What is Ruby’s main strength?” One run: poetry. Next: haiku. Traditional asserts? Dead on arrival.
Here’s the quote that nails it:
AI outputs are non-deterministic. Ask the same question twice, get two different answers. Traditional “assert equals” testing breaks down.
Spot on. Ignore this, and you’re building on sand. I’ve seen teams burn cash on live API tests—hitting rate limits, racking bills. Worse: prod bugs when the model tweaks itself overnight.
My hot take? This is the Y2K for AI devs. Back in ‘99, everyone ignored date overflows. Boom—systems tanked. Today, non-determinism is your overflow. Fix it now, or watch your app glitch when GPT-5 drops weird replies.
Short fix? Bundle up RSpec, WebMock, VCR. Rails generate that spec install. Boom—armed.
VCR: Record Once, Replay Forever?
VCR’s the star here. Records real HTTP calls to cassettes. First run: pays OpenAI. After? Free replays. Deterministic bliss.
Take this spec:
VCR.use_cassette('chat_about_ruby') do
service = ChatService.new
response = service.ask("What is Ruby's main strength?")
expect(response).to be_a(String)
expect(response.length).to be > 20
expect(response.downcase).to match(/ruby|programming|language/)
end
Punchy. Tests shape, not exact words. Length check? Regex for keywords? Smart. No more “expected ‘Ruby rocks’ got ‘Ruby is awesome’” fails.
But here’s the snark—VCR ain’t perfect. Change your prompt? Old cassette’s stale. Regenerate often, or you’re testing yesterday’s model. And sensitive keys? Filtered, sure. But leak one, and OpenAI bans you. (Ask me how I know a buddy who did.)
Stubs: Ditch the API for Unit Speed
Unit tests? Stub everything. No network. Fake embeddings, mock 429s. Fast as hell.
let(:fake_embedding) { Array.new(1536) { rand(-1.0..1.0) } }
stub_request(:post, 'https://api.openai.com/v1/embeddings')
.to_return(...)
Expect vector length. All floats. Done. Seventy percent of your suite—pure logic, error paths, transforms. No AI roulette.
Error handling? Crucial. Most AI apps die here.
Stub rate limits:
stub_request(...).to_return(status: 429, body: { error: { message: 'Rate limit exceeded' } }.to_json)
expect { service.ask("anything") }.not_to raise_error
Timeouts. Malformed JSON. Test ‘em. Prod won’t forgive “API barfed, app down.”
Schema Checks: Tame Structured Outputs
AI JSON? Parse and validate.
parsed = JSON.parse(result)
expect(parsed).to have_key('summary')
expect(parsed['key_points']).to be_an(Array)
Keys present. Arrays right type. Patterns hold. Wording varies? Who cares.
Prompt tweaks? Spec the prompt itself.
it 'produces output under 200 words' do
VCR.use_cassette('summarizer_long_doc') do
word_count = result.split.size
expect(word_count).to be <= 200
end
end
Entities preserved? Regex again. ‘Ruby’ and ‘Matsumoto’ stick around.
How Production-Ready Are These RSpec Tricks?
Split your suite: 70% stubs (lightning fast), 25% VCR (real-ish), 5% live (tagged, rare). Run rspec spec/services/ for AI bits. Doc format? --format documentation.
Skeptical? Me too—at first. But this beats flaky chaos. Prediction: teams ignoring it? Six months to brittle messes. OpenAI tweaks models quarterly—your untested prompts break. Meanwhile, VCR crews ship confidently.
Rails AI ain’t toy. Chat interfaces, voice-to-text—real user cash on line. Test structure, errors, shapes. Exact strings? For suckers.
One caveat. VCR cassettes bloat. Compress ‘em. And for mega-scale? Consider factories for fakes. But start here.
🧬 Related Insights
- Read more: WebGPU Puts AI Superpowers in Every Browser Tab
- Read more: KiroGraph: Your Codebase’s Instant Semantic Superhighway
Frequently Asked Questions
How do you test AI features in Rails with RSpec?
Use VCR for recordings, WebMock for stubs. Test shapes, lengths, schemas—not exact text.
What’s the best way to handle non-deterministic AI outputs?
Stub units, cassette integrations. Add error stubs for 429s, timeouts.
Does VCR work for OpenAI in Rails tests?
Yes—records once, replays free. Filter keys, set record: :once.