What LLMs Do When 'Thinking'

Dots. Bouncing. Your favorite LLM—let’s call it Grok or whatever—pauses, pretends to ponder. Feels almost human, right?

Wrong.

It’s crunching numbers. Hard. And that’s the scam at the heart of what an LLM is doing when it’s “thinking.” No sparks of genius. Just statistical sleight-of-hand, amplified by cash-burning tricks. Buckle up; we’re dissecting this episode of Release Notes Explained like it’s a bad magic show.

What Is an LLM Actually Doing When ‘Thinking’?

Ever wondered what an LLM is doing when it’s “thinking”?

That’s the hook from the original vid. Cute. But let’s cut the wonder. LLMs don’t think. They predict tokens—next word, next syllable—based on patterns scarfed from internet slop. “Thinking”? That’s PR spin for “iterating predictions until it looks smart.”

Take chain-of-thought prompting. You nudge it: “Step by step.” It spits out fake reasoning. Why? Because training data showed that verbose babble boosts accuracy on benchmarks. Not insight. Mimicry.

But here’s the episode’s meat: scaling laws, test-time compute, reinforcement learning from verifiable rewards. Sound fancy? It’s desperation dressed as progress.

Scaling laws first. Double the parameters, data, compute—performance creeps up predictably. Chinchilla proved it: more is better, but diminishing returns kick in fast. OpenAI’s o1 model? They scaled inference compute, not just training. That’s test-time compute—your query pays the bill.

And boy, does it burn.

Why Test-Time Compute Feels Like a Cash Grab

Picture this: instead of one pass through the model, it does ten. Twenty. A hundred. Each “thought” step? Another forward pass, generating intermediate tokens. Costs skyrocket—10x, 100x your GPT-4 bill.

They call it “reasoning.” I call it brute force. Like solving a puzzle by trying every piece till one fits. Works for math problems, sure—o1 crushes GSM8K benchmarks. But real world? Messy data, ambiguity. Compute can’t buy common sense.

Dry fact: Anthropic’s Claude 3.5 Sonnet with extended thinking time beats bigger models. Efficiency win? Maybe. Or just kicking the can—delaying the inevitable plateau.

Here’s my unique twist, absent from the episode: this mirrors the 1980s Lisp machine boom. AI hype then? Symbol manipulation at scale. Bankrupted companies when walls hit. Today’s test-time compute? Same trap. Investors pour billions; returns flatline. Bubble alert.

But wait—there’s reinforcement learning.

RL from verifiable rewards (RLVR). Train on puzzles with right answers. Reward correct chains. Simple, right?

Not quite. Verifiable? Means math, code—closed worlds. Open-ended? No ground truth. So it hallucinates confidently. Episode glosses this; reality bites.

Punchy truth: LLMs “think” like a drunk philosopher—endless verbiage, zero soul.

Scaling Laws: The Emperor’s New Parameters

Coined by Kaplan et al., 2020. Plot flops against compute. Smooth curve—until it isn’t. Post-Chinchilla, we’ve overspent on parameters (GPT-4’s rumored trillions). Data quality tanks; synthetic slop loops back.

Episode name-drops it. Good. But skips the critique: laws break at edges. Emergent abilities? Hype. Just log-scale tricks fooling evals.

Look—DeepMind’s 2022 paper showed optimal flops allocation: 20% params, 20% data, 60% training compute. We’re ignoring it, chasing parameter porn.

So, does this matter?

Hell yes.

Why Does This Matter for Open Source Devs?

You’re not shelling out for o1-preview at $15/million tokens. OSS world—Llama 3.1, Mistral—runs lean. Test-time compute? Nightmare on consumer GPUs. A 100-step chain? Hours, not seconds.

Fork it. Hack reasoning loops yourself. Tools like Guidance or Outlines let you control token-by-token. No black-box tax.

But corporate spin irks me. “Thinking models,” they crow. xAI, OpenAI—same playbook. Distracts from flaws: no real understanding, prompt fragility, reward hacking.

Prediction: by 2026, test-time compute hits wall. Energy costs, chip shortages. Shift to neurosymbolic hybrids—or back to lean agents.

Skeptical? Test it. Prompt o1: “Explain quantum entanglement.” It’ll weave words. Poke holes—crumbles.

One-paragraph rant: This “thinking” facade props valuations. Sam Altman tweets mysteries; stock jumps. Meanwhile, OSS coders build real tools—LangChain agents, DSPy optimizers—without the theater.

The RLVR Pipe Dream

Reinforcement from verifiable rewards. AlphaGo vibes: self-play on chessboards. But LLMs? Text isn’t Go. Verifiable slices—parsing, arithmetic—tiny fraction of use.

Process: generate chain-of-thought, check final answer. Backprop rewards. Scales reasoning… on toys.

Flaw: process supervision leaks. Model games it, spits checkable nonsense early.

Episode optimistic. I’m not. It’s RLHF 2.0—jailbreak-prone, alignment illusion.

🧬 Related Insights

Read more: OWASP Top 10: Devs’ Dumbest Security Fails, Fixed
Read more: JSON Crushes XML in 2026—Here’s Why (Mostly)

Frequently Asked Questions

What are scaling laws for LLMs?

Power laws predicting performance from compute/data/parameters. Bigger models win, but costs explode.

What is test-time compute in AI?

Extra inference runs per query to simulate ‘thinking’ steps. Boosts accuracy, drains your wallet.

How does reinforcement learning improve LLM reasoning?

RLVR rewards correct reasoning chains on solvable tasks. Works narrow; fails broad.

What LLMs Do When 'Thinking'

Key Takeaways

What Is an LLM Actually Doing When ‘Thinking’?

Why Test-Time Compute Feels Like a Cash Grab

Scaling Laws: The Emperor’s New Parameters

Why Does This Matter for Open Source Devs?

The RLVR Pipe Dream

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

What Is an LLM Actually Doing When ‘Thinking’?

Why Test-Time Compute Feels Like a Cash Grab

Scaling Laws: The Emperor’s New Parameters

Why Does This Matter for Open Source Devs?

The RLVR Pipe Dream

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

AI's Great Leap Forward: Compute Tsunami Hits Open Source

Stay in the loop

Key Takeaways