AI Research

Word Embeddings from Shannon 1948, Pre-Word2Vec

Word embeddings didn't spring from Word2Vec. They trace to Shannon's 1948 brilliance. This week's AI digest reveals the deep history — and fixes for bland bots.

Word Embeddings: Shannon's 1948 Secret, Not Word2Vec Myth — theAIcatchup

Key Takeaways

  • Word embeddings originated with Shannon's 1948 information theory, predating neural networks.
  • RoPE enables massive context windows via clever rotations, hand-computable elegance.
  • Tune RAG chunk overlap to 10-20% for real recall gains; ignore at peril.

Word embeddings? 1948.

Claude Shannon dropped the bomb that year, embedding words into numerical space via information theory — decades before neural nets stole the spotlight. Look, everyone credits Mikolov’s 2013 Word2Vec for vector magic, king-king analogies like Paris minus France plus Berlin equals Germany. But that’s surface-level hype. Dig deeper: Shannon’s work on semantic information encoded meanings as points in a high-dimensional haze, predicting redundancy and surprise in language. It’s the architectural root — probability distributions masquerading as vectors — that still props up every LLM today.

And here’s the kicker: we’re looping back. Modern efficiency plays, like quantized models on phones, echo Shannon’s zero-shot entropy hacks. No gradients needed.

How Shannon’s Math Beat Neural Nets to the Punch

Picture 1948. Transistors barely exist. Yet Shannon’s “A Mathematical Theory of Communication” maps language as Markov chains, assigning each word a coordinate based on conditional probs. Word A links to B with p(B|A); that’s your embedding dimension right there — a sparse vector of likelihoods.

Fast-forward. Neural embeddings densify it, train on co-occurrence. But why the obsession? Because Shannon nailed entropy as geometry. Words cluster by mutual information; outliers scream novelty. Neural nets just brute-forced the same insight with backprop muscle.

Why word embeddings trace back to Shannon’s 1948 information theory, not neural networks.

That’s the newsletter’s mic-drop line. Spot on. Yet most AI lore skips it, peddling a neural-native myth. My take? It’s deliberate amnesia — easier to fund “revolutionary” nets than admit we’re polishing 75-year-old glass.

Short para: History matters.

Now, layer in positional encodings. Without them, transformers treat sequences like bags of words — order vanishes. This week’s deep-dive hand-computes every method, culminating in RoPE. Rotary Position Embeddings rotate queries and keys in complex space, baking position as a rotation matrix. No added params; pure elegance.

But — em-dash alert — why RoPE over absolute sins? Absolute encodings (learned or sinusoidal) bloat at sequence lengths past 512. Relative ones like T5’s flop on extrapolation. RoPE? Scales to 1M+ tokens because rotations compound geometrically, preserving dot-product distances. It’s Shannon-esque: relative info over absolute coords.

We computed RoPE by hand this week. θ_i = 10000^{-2i/d}. Rotate q with cosθ, sinθ. Brutal, beautiful math.

Is Gemma 4 Finally Frontier-Ready?

Google’s Gemma 4 hits leaderboards, but rankings? Sketchy territory for open models. LMSys Arena pits it against Llama-3.1-405B, Claude 3.5. Scores tantalizing — 1280+ Elo. Yet open-source evals lag closed giants in nuance.

Here’s my unique angle: Gemma echoes 1980s Lisp machines. Modular, hackable hardware-software stacks birthed AI then; now, open weights on TPUs do it again. Prediction? By 2026, Gemma forks dominate edge inference, crushing proprietary clouds on cost. Architectural shift: from API serfdom to local sovereignty.

Skepticism check. Rankings hype closed models’ safety rails as “smarts.” Open ones raw-dog reality — bugs expose true caps.

One sentence: Rankings lie.

Shift gears. AI labs vs. governments. Anthropic balks at surveillance deals; OpenAI swoops. Nuance? Both dipped toes in defense — Anthropic’s Claude aids targeting sims. Real clash: scope creep. Governments want backdoors; labs want cash without chains.

As AI capabilities mature, the relationship between AI labs and governments is getting complicated, fast.

Nuanced truth. Precedent? Labs fragment — safety-first vs. power-hungry. Watch Palantir consolidate the scraps.

Why Your RAG Pipeline Chunks Like Crap

Chunk overlap. Ignored param. Zero overlap? Context splits mid-sentence; retrieval grabs halves. Boom, hallucinated glue.

Tune to 10-20%. Eval recall first. Sprawling advice: Test domain queries, measure F1 on retrieved chunks, iterate. Production RAG demands it — or you’re piping noise.

And AI writing slop? That bland sameness — “dive into,” “realm of” — stems from prompt poverty. New free guide bans 50+ phrases, dual-model review. Smart fix. But why so uniform? Tokenizers favor frequent fluff; base models regurgitate web sludge.

Why Positional Encoding Evolves — Or Dies

Eight layers from prompts to agents. Stateless chat → tool-calling → multi-agent orchestration. Posenc underpins it all.

Traditional XAI? Useless for agents. Saliency maps shatter on parallelism. Build graph-based attribution instead — trace decisions across threads.

Modular text-to-KG? One-command gold. Raw corpus → entities → relations → Cypher queries. Why? LLMs choke on flat text; graphs inject structure.

Deep breath. That’s the week’s guts.

Look, AI’s architectural spine — embeddings, positions, retrieval — revives Shannon because compute walls loom. Neural bloat hits limits; info-theoretic sparsity wins. Bold call: 2030 sees hybrid systems, Shannon-vectors hybridized with sparse MoEs, slashing inference 10x on wearables.

Critique the spin. Newsletters like LAI peddle tips amid ads — fine, but that anti-slop guide? It’s meta: AI fixing AI blandness. Circular, but effective.

What Happens When Embeddings Go Agentic?

Agents demand dynamic embeddings. Static Word2Vec dies; learn on-the-fly from trajectories. Shannon parallel: adaptive Huffman codes for streams. Future? Entropy-minimizing embeddings, self-updating per task.

Gemma 4 tests it — open agents incoming.

Wander a bit: Recall 1960s semantic nets? Cyc project buried under scale fails. Today, with vectors, they resurrect as KG-retrievers.

Punchy: Cycles repeat.

Dense para time. Governments push because agents weaponize — autonomous drones need strong posenc for long-horizon planning; embeddings encode battlefields as state vectors. Anthropic’s “no”? PR sheen over profit calculus. OpenAI fills void, but watch EU regs kneecap all. Meanwhile, chunking tips save RAG deploys daily — overlap your 512-token chunks by 100, watch recall jump 15%. RoPE hand-math reveals why LLMs extrapolate: relative angles decay predictably, unlike ALiBi’s linear hacks. Gemma ranks high ‘cause lightweight — 27B params punch 70B weights. Slop guide? Paste, prompt, profit; bans “use” forever.


🧬 Related Insights

Frequently Asked Questions

What are word embeddings’ real origins?

Claude Shannon’s 1948 info theory encoded words as probability vectors — neural nets just densified it later.

How does RoPE positional encoding work?

Rotates query/key vectors by position angle; scales to megatokens without param explosion.

Can Gemma 4 beat closed models?

Leaderboard says yes on raw smarts, but open evals expose safety gaps — edge wins await.

Sarah Chen
Written by

AI research editor covering LLMs, benchmarks, and the race between frontier labs. Previously at MIT CSAIL.

Frequently asked questions

What are word embeddings' real origins?
Claude Shannon's 1948 info theory encoded words as probability vectors — neural nets just densified it later.
How does <a href="/tag/rope-positional-encoding/">RoPE positional encoding</a> work?
Rotates query/key vectors by position angle; scales to megatokens without param explosion.
Can Gemma 4 beat closed models?
Leaderboard says yes on raw smarts, but open evals expose safety gaps — edge wins await.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by Towards AI

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.