Everyone expected LLMs to storm video games via massive vision models — feast on screenshots, spit out controller inputs, burn GPU cycles like Atari agents on steroids.
Wrong.
Russell Harper’s PvP-AI demo on the Commander X16 emulator — that’s LLM plays an 8-bit Commander X16 game using structured “smart senses” — shreds that script. Structured text feeds the model game state: enemy positions, player health, bullet counts. No fuzzy pixels. No waveform audio. Just crisp, JSON-like summaries echoing the game’s old-school touch and EMF sensors. And GPT-4o? It doesn’t just play. It schemes.
Here’s the shift: this isn’t brute-force screen-scraping. It’s architectural elegance — feeding LLMs what they crave (language) instead of forcing a square peg into a visual hole. Expectation was compute-heavy hacks. Reality? Lean, mean agentic AI that learns across sessions.
I connected the ChatGPT API (model gpt-4o) to an 8-bit shoot-‘em-up game, PvP-AI, running on a Commander X16 emulator.
What the Hell Are ‘Smart Senses’?
Think vintage hardware vibes. Commander X16 — David’s love letter to 8-bit nostalgia — packs these quirky inputs: touch pads for player position, EMF coils sensing magnetic fields from metal bits on-screen. Harper hijacks ‘em, turns raw signals into text blobs: “Player at x=50, y=30; Enemy1 at x=80, health=2; Bullets: 3 left.”
Simple. Brutal. Effective.
The LLM gets this per turn, ponders — via persistent notes — then outputs moves: fire, thrust, dodge. Over three games? It evolves. Spots patterns in the built-in AI’s dumb loops. Discovers an exploit: time shots to clip the edge, rack impossible scores.
But — here’s my dig — OpenAI’s PR would spin this as ‘multimodal mastery.’ Nah. This thrives because it’s unimodal text. No vision-model baggage. It’s the anti-hype: proves LLMs shine when you speak their language, not yours.
How GPT-4o Went Full Strategist
First game: cautious probes. Fire sparingly, hug walls.
Second: bolder arcs, predictive dodges — “Enemy telegraphs left; preempt.”
Third: exploit city. Notes pile up — “AI stalls at screen edge; rapid-fire pins it.” Boom. High score.
Watch the recordings on Harper’s site (pvp-ai.russell-harper.com/#v3). The emulator chugs at authentic 1.79MHz; LLM responds in seconds. Latency? Negligible for turn-based. But scale to real-time? That’s the rub — and opportunity.
Wander a bit: this mirrors 1990s chess engines. Deep Blue didn’t ‘see’ a board; it slurped FEN notation. Text state crushed pixel-parsing. Fast-forward (sorry, can’t say that — wait, kidding), GPT-4o rediscovers that truth for agents. My unique angle? This isn’t evolution; it’s regression to efficiency. Modern AI bloats on vision; retro constraints force smarts. Prediction: embedded IoT agents — drones, robots — will ditch cameras for sensor text. Cheaper. Sharper. Unstoppable.
Why Does This Matter for AI Agents?
Developers, listen up. You’re building agents? Ditch screenshots.
Pixels leak noise — OCR fails on glitches, lighting shifts kill models. Structured senses? Parse once, feed forever. Harper’s hack ports anywhere: APIs summarizing sim states, logs from factories, telemetry from cars.
Skeptical take: is this real agency? It’s prompted turns, not autonomous loops. Fair. But notes persist — strategy compounds. That’s proto-memory. Add tools (fire API? Check health endpoint?), and you’ve got Devin-level autonomy on a 6502 CPU.
Corporate spin check: Anthropic/Anthropic-types hype ‘constitutional AI’ for safety. Here? Raw GPT-4o finds exploits sans guardrails. Risky? Sure. But that’s agency — unintended brilliance.
One-paragraph ramble: Imagine Minecraft bots, not scraping renders, but querying block APIs. Or StarCraft: text fog-of-war diffs. Efficiency explodes; costs plummet. This demo whispers the future: interface design wins wars. Not model size.
Can Text-Only Inputs Revolutionize Game AI?
Short answer: yes, for niches.
Full: vision models (GPT-4V, Gemini) gobble tokens on frames — 10x cost for marginal gains. Text? Pennies. Commander X16 proves it in pixel-perfect retro land.
Critique Harper’s setup — it’s turn-based shmup, not twitch FPS. But extend: real-time via streaming summaries? Async reasoning? We’re close.
Historical parallel (my insight): 1970s PLATO games used text grids for battleships. LLMs close the loop — inference on ancient iron. Bold call: by 2026, indie games ship with LLM hooks standard. Modders feast.
The Exploit That Stole the Show
LLM’s notes evolve: “Opponent AI pathing flaw — hesitates at y=200.”
Then: cheese mode. Position, spam. Built-in AI — rule-based relic — folds.
This? Pure emergence. No fine-tune. Just context window + chain-of-thought.
🧬 Related Insights
- Read more: AI Lets You Build Team-Scale Solo—Now the Real Grind Begins
- Read more: Deploynix’s Free Tier: The Freelancer’s Escape from Hosting Hell
Frequently Asked Questions
What is PvP-AI on Commander X16?
It’s an 8-bit shoot-em-up emulator game where players duel in arenas, using touch/EMF inputs for position and firing. Retro as hell, runs on open-source X16 hardware.
How does GPT-4o play games without pixels?
Via ‘smart senses’ — text summaries of game state from sensors: positions, health, ammo. Model reasons in language, outputs actions.
Will LLMs replace traditional game AI?
Not fully — vision needed for open-world chaos. But for structured sims? Absolutely, especially with text interfaces.