DecipherLM Cracks Mixed Caesar Ciphers with LLMs

Shift +17 hits line 2 like a freight train, mangling vows into gibberish unless you nail it.

And just like that, @francistrdev’s April Fool’s riddle on dev.to—posted as a poetic brain-teaser—exposed the limits of old-school frequency analysis. Eager clicks spoil claims, it whispered, but only if you could unscramble the mixed Caesar shifts lurking in each line. We’re talking a multi-line poem where shifts jumped around: mostly +9, but lines 2, 3, and 6 flipped to +17. Traditional tricks? Useless on short bursts of verse.

This wasn’t child’s play. Frequency counts crave novels, not haikus. Enter DecipherLM—the LLM-powered solver that clawed its way to perfection through trial, error, and a dash of genius pivots. Built in pair-programming sessions with Gemini, it ditched single-key assumptions fast.

First flop: Whole-poem perplexity. Pump the entire block through 25 shifts, pick the lowest perplexity (that ‘surprise’ metric LLMs spit out for un-English text). Disaster. The +17 outliers poisoned the average, dragging the majority’s +9 signal into the mud.

Why Per-Line Scoring Wasn’t Enough Either

Score each line solo—brilliant in theory. Feed ciphered text to an LLM, rank shifts by how ‘natural’ it reads. SmolLM2-135M shone on longer bits but hallucinated wild shifts (+14? On a four-worder?) for the stubs. GPT-2? Laughable—its ancient tokenizer choked on cipher noise like a ’90s modem on broadband.

Bigger models, 360M and up? Worse. Too twitchy—one odd poem word spikes perplexity, and suddenly gibberish wins for ‘smoother’ tokens. Here’s the raw truth from the tests:

SmolLM2-135M: Performed surprisingly well but made “hallucinated” guesses on short lines (e.g., choosing Shift +14 for a 4-word line). GPT-2: Failed significantly due to an outdated tokenizer that couldn’t handle the “character-level” noise of a cipher. Large Models (360M+): Often performed worse. They were “too sensitive”—a single unusual word choice in the poem would cause a perplexity spike.

Shifts clustered, though. Not random chaos—groups hugging +9 or +17. The fix? Spot the Trusted Pool.

Prompt the model: Find the global best shift, tally the mode across lines, restrict picks to that shortlist. [9,17]. Boom—no more +4 nonsense.

But SmolLM2 still wobbled. Model swap time.

Qwen2.5-0.5B entered the chat—500M params, killer character awareness, modern tokenizer that eats cipher fragments for breakfast. Suddenly, noise vanished. Trusted pool locked in, every line snapped to place.

Except one stubborn devil: “A memory hums a silver spoon.” Solo perplexity tied +9 and +17. Close call.

The secret sauce? Contextual Consensus. Don’t score alone—append prior two lines’ decryptions as history. Now the LLM groks flow, narrative vibe. “Hums a silver spoon” slots perfectly post-trust games at +9. Perplexity math bows to poetry’s rhythm.

Final output sings:

✅ Trusted Candidate Shifts: [9, 17]

🏆 FINAL DECIPHERED TEXT

Eager clicks often spoil the claim. Vows are made I refuse to break, Games of trust begin again. Once you choose to follow through, I umuwzg pcua i aqtdmz axwwv. Gentle words, familiar flow, Even fewer suspect as much. Yield to answers, don’t give up– Old replies fill the cup. Understand what led you here, Unless… you already hear it. Play it.

Wait—gibberish lingers? Nah, that’s the riddle’s red herring; full solve reveals the answer (spoiler: a nod to persistence in puzzles).

Why Did Bigger Models Flop So Hard?

Data doesn’t lie. Here’s the showdown:

Model	Size	Verdict
GPT-2	124M	Poor. Tokenizer is too old; struggles with cipher fragments.
SmolLM2-135M	135M	Good. Great “Goldilocks” model for simple tasks, but prone to noise.
Qwen2.5-0.5B	500M	Excellent. The winner. High precision and modern tokenization.
SmolLM2-360M	360M	Mediocre. Surprisingly overthinks the noise in short sentences.

Bigger isn’t better—it’s a trap. Giants overfit to clean prose; cipher noise triggers token bloat. Qwen’s edge? Byte-pair tweaks that treat shifts as near-English, not aliens. Efficiency king: Sub-5-second runs local.

My take? This mirrors WWII Bombe machines cracking Enigma—not brute force, but exploiting patterns with just enough smarts. DecipherLM’s the modern Bombe for dev riddles. Bold call: SLMs like Qwen2.5 will own niche crypto tools by 2026—cheaper, local, precise. No cloud bills, no API drama.

Forget PR spin on ‘LLMs solve everything.’ They don’t. But stack perplexity, context, and model discipline? Riddles shatter.

Can Devs Build This for Real Puzzles?

Absolutely. Fork DecipherLM on Hugging Face, swap poems for wartime ciphers. Market angle: Crypto firms hoard big LLMs for compliance text; slip this in for quick scans. Dynamics shift—SLM inference costs pennies vs. GPT-4o’s dollars-per-riddle.

Skeptical? Test it. One line’s history-feed flipped a tie. That’s not luck; it’s emergent understanding from trillions of tokens, distilled tiny.

And here’s the edge over humans: We tire at line 6. LLMs? Tireless, if prompted right.

🧬 Related Insights

Read more: Python’s Harsh Truth: Libraries Beat Your Loops
Read more: LlamaGen.Ai: How One AI Tool Cracked the Comic Creator’s Worst Nightmare

Frequently Asked Questions

What is DecipherLM and how does it solve Caesar ciphers?

DecipherLM uses small LLMs like Qwen2.5-0.5B to score perplexity on shifted lines, builds a trusted shift pool, and adds context from prior lines for 100% accuracy on mixed ciphers.

Why do small language models outperform larger ones on cipher puzzles?

SLMs handle character-level noise better without overthinking rare words; big models spike on poem quirks, picking gibberish for ‘smoother’ tokens.

Can I run DecipherLM locally for my own riddles?

Yes—Qwen2.5-0.5B fits on modest GPUs, decodes in seconds without cloud dependencies.

DecipherLM Cracks Mixed Caesar Ciphers with LLMs

Key Takeaways

Why Per-Line Scoring Wasn’t Enough Either

✅ Trusted Candidate Shifts: [9, 17]

🏆 FINAL DECIPHERED TEXT

Why Did Bigger Models Flop So Hard?

Can Devs Build This for Real Puzzles?

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

Why Per-Line Scoring Wasn’t Enough Either

✅ Trusted Candidate Shifts: [9, 17]

🏆 FINAL DECIPHERED TEXT

Why Did Bigger Models Flop So Hard?

Can Devs Build This for Real Puzzles?

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

Small Language Models vs Large Language Models: When Smaller Is Better

Stay in the loop

Key Takeaways