Qwen & Gemma Fail Zork: AI Agent Limits

Picture your AI sidekick, primed for adventure, suddenly vomiting Thai script mid-Zork quest. That's the chaos when Qwen and Gemma tackle text adventures — and it exposes why agents falter on simple navigation.

Qwen AI model outputting Thai script during Zork gameplay on a terminal screen

Key Takeaways

  • Tight prompting causes multilingual glitches in models like Qwen, highlighting uncontrolled access risks.
  • Dynamic state summaries and thought parameters boost Zork scores but can't conquer maze amnesia.
  • Local models expose agent scaffolding flaws that frontier AIs mask — essential stress test for production.

My RTX 5080 hummed quietly in the dim room as Qwen 2.5 14B decided Thai was the way forward in Zork — actual Thai script, laced with Chinese glyphs, instead of ‘go north.’

Getting Qwen and Gemma to play Zork sounded simple. Pure text. No visuals to hallucinate. Just a graph of rooms, puzzles, that infamous maze. But here’s the thing: these models ace isolated reasoning — dissect a log file, trace a single service — yet crumble when holding a mental map across turns. I built the rig to probe that exact fracture.

Jericho parses the original Zork binary, spits state. Ollama serves the models locally. Pi Coding Agent orchestrates. Feed in game output — location, inventory, score — prompt the model for a command. Loop. No hand-holding. ‘You’re the player. Execute only.’

Qwen broke first. Not with bad moves. With therapy-speak.

“In Zork, you typically want to explore your surroundings by using commands like LOOK and EXAMINE…”

Trained to assist, not embody. Crank the constraints — ‘English only, no speech, tools exclusively’ — and bam. Thai. ‘推进完毕’ (progress done). Multilingual bleed-under-pressure. It’s not a bug; it’s the model’s token soup bubbling over when you dam the English channel. Qwen’s a beast in code, translation. But chain it as an agent? Unpredictable tongues signal deeper chaos — what if it rm -rf’s your prod in Mandarin?

Gemma 4 26B, fresh from Google, stayed English. Stable. But scores? Zilch. Ten points max, teasing competence before looping. West. East. West. Stuck.

Why Do Even Big Models Ditch Zork’s Map?

Look, the architecture’s the villain. Static prompts — ‘observe, think, act’ — teach format, not flow. Model mimics ReAct papers, outputs JSON theater instead of calling tools. Fix: dynamite the prompt. Embed state summary per turn: ‘You: kitchen. Lamp: held. Score: 10. Valids: north, open.’ Add ‘thought:’ field for scratchpad reasoning, quarantined inside the tool call. No chit-chat escape hatch.

That nudged scores to 50-ish. Gemma lit the lamp, dropped the grate, grabbed the sword. Progress! But the maze — oh, the maze. Twelve rooms, all white houses, twisters identical. Models map once, fine. Twice? Vapor. No persistent graph in context; each turn’s a fresh amnesia. They reinvent directions, chase ghosts.

And temperature? Zero locks tools — reliable calls, dumb plays. 0.7 sparks creativity — mazes navigated sporadically — but 20% tool fails, chit-chat creeps. It’s a vise: squeeze reasoning, lose reliability. Or vice versa.

Here’s my angle the original misses: this echoes 1980s expert systems. Remember SHRDLU? Block worlds in a sandbox — parse, stack, query flawlessly. Scale to Zork’s open graph? Nope. Brittleness identical. Today’s agents? Same trap, scaled on bigger params. Prediction: without baked-in graph nets or external memory (vector stores tuned for spatial graphs), no frontier model conquers adventures sans hints. Small locals like Qwen/Gemma? Perfect canaries — expose scaffolding rot before you bet the farm.

But corporate spin? ‘Just prompt better!’ Nah. Google’s Gemma drop hyped MoE magic for reasoning. Zork begs to differ — it’s state persistence, not token math.

Can Local Models Ever Beat Zork?

Short answer: not solo. Scaffolding saves ‘em. Dynamic state injection tripled scores. Thought param let reasoning chain without assistant mode. Still, maze claims 90% of runs. Why? Context collapse. 128k tokens sound vast? Zork’s 200 rooms, objects, lore fill it quick — but models prioritize recency, not topology.

Tweak: vectorize the map. Embed rooms by description, query nearest. (I hacked a prototype — scores hit 180, but that’s cheating pure agency.) Or fine-tune on Zork traces. But that’s not ‘reasoning’; it’s memorization. The why: transformers are sequence junkies, not graph natives. Maze demands topology — shortest paths, cycles — alien to next-token prediction.

Glimpses of hope. One Gemma run: mapped the maze verbally, backtracked systematically. Scored 220. Rare. Like watching a toddler solve Rubik’s blindfolded.

Deeper: this torches the ‘agents everywhere’ hype. Your cloud alert bot? Fine for linear traces. Distributed systems? Maze-like queues, services. It’ll loop, hallucinate queues. Stress-test with Zork before prod.

Worse for open-source locals. Qwen’s multilingual gift turns curse under harness. Gemma’s MoE shines in bursts — but consistency? Nope.

What Scaffolding Actually Fixes — And Breaks

State summaries: win. Cut hallucinated rooms by 70%.

Tool-only output: prevents Thai jailbreaks.

But creativity tax. Temp 0: robot. Loops ‘inventory’ forever.

Unique insight — historical parallel: Infocom’s Zork spawned parser tech inspiring modern NLP. Now LLMs parse it — but can’t inhabit. Ironic full-circle. Bold call: by 2026, agent frameworks embed GNNs standard. Else, mazes everywhere — git histories, microservices — stay unconquered.

PR spin callout: ‘Small models stress-test scaffolding’ — yes, but don’t pretend it’s equal. Frontier like o1-preview crushes Zork (Ramp’s Claude did RCT). Locals lag architecture, not just params.

So, build strong. Validate outputs cross-lingually. Persist maps externally. Or watch your agent wander Thai deserts.


🧬 Related Insights

Frequently Asked Questions

Why did Qwen output Thai when playing Zork?

Tight constraints suppressed English patterns, so it defaulted to high-probability multilingual tokens from its training data — a pressure-valve failure mode.

How well did Gemma perform in Zork compared to Qwen?

Gemma stayed in English and hit occasional 50-point runs with better prompting, but both looped endlessly in the maze without persistent state.

What does Zork reveal about building AI agents?

Agents excel at single-step reasoning but lose the plot on graph navigation — demanding external memory and dynamic scaffolding for real-world tasks like tracing distributed systems.

James Kowalski
Written by

Investigative tech reporter focused on AI ethics, regulation, and societal impact.

Frequently asked questions

Why did Qwen output Thai when playing Zork?
Tight constraints suppressed English patterns, so it defaulted to high-probability multilingual tokens from its training data — a pressure-valve failure mode.
How well did Gemma perform in Zork compared to Qwen?
Gemma stayed in English and hit occasional 50-point runs with better prompting, but both looped endlessly in the maze without persistent state.
What does Zork reveal about building <a href="/tag/ai-agents/">AI agents</a>?
Agents excel at single-step reasoning but lose the plot on graph navigation — demanding external memory and dynamic scaffolding for real-world tasks like tracing distributed systems.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by Dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.