Large language models have reshaped the landscape of artificial intelligence, powering everything from chatbots and code assistants to scientific research tools. Yet for many practitioners and enthusiasts, the inner workings of these systems remain opaque. Understanding how LLMs actually function is essential for anyone building with, evaluating, or making decisions about AI technology.
This guide breaks down the three foundational pillars of modern LLMs: the transformer architecture, the attention mechanism, and the tokenization process that converts human language into something a neural network can process.
The Transformer Architecture: A Paradigm Shift
Before transformers arrived in 2017, most natural language processing relied on recurrent neural networks (RNNs) and long short-term memory networks (LSTMs). These architectures processed text sequentially, one word at a time, which created two serious problems: they were slow to train and struggled to maintain context over long passages.
The transformer, introduced in the landmark paper Attention Is All You Need by Vaswani et al., solved both problems by processing entire sequences in parallel. Instead of reading a sentence word by word, a transformer examines all words simultaneously and learns the relationships between them.
A transformer consists of two main components: an encoder that reads and understands input text, and a decoder that generates output text. Models like BERT use only the encoder, while GPT-style models use only the decoder. The original transformer used both for machine translation tasks.
Each encoder and decoder layer contains two sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network. Residual connections and layer normalization wrap each sub-layer, stabilizing training and allowing gradients to flow through deep networks without vanishing.
Self-Attention: How Models Understand Context
The self-attention mechanism is arguably the most important innovation in the transformer. It allows the model to weigh the importance of every word in a sequence relative to every other word, capturing long-range dependencies that previous architectures missed.
Here is how self-attention works step by step:
- Query, Key, and Value vectors: For each token in the input, the model creates three vectors by multiplying the token's embedding by learned weight matrices. The query represents what the token is looking for, the key represents what the token offers, and the value represents the actual information carried.
- Attention scores: The model computes a dot product between each query and all keys, producing a score that indicates how much attention one token should pay to another. These scores are scaled by the square root of the key dimension to prevent extremely large values.
- Softmax normalization: The scaled scores pass through a softmax function, converting them into probabilities that sum to one. This creates an attention distribution over all tokens in the sequence.
- Weighted sum: Finally, the model multiplies each value vector by its corresponding attention weight and sums the results, producing a context-aware representation for each token.
Consider the sentence: The cat sat on the mat because it was tired. When processing the word it, the attention mechanism assigns high weights to cat rather than mat, correctly resolving the pronoun reference. This ability to capture contextual relationships is what makes transformers so powerful.
Multi-Head Attention
Rather than computing attention once, transformers use multi-head attention, running several attention computations in parallel with different learned weight matrices. Each head can focus on different types of relationships: one might capture syntactic structure, another semantic similarity, and another positional patterns. The outputs of all heads are concatenated and linearly transformed into the final representation.
GPT-4 class models typically use 96 or more attention heads across over 100 layers, giving them an enormous capacity to model complex language patterns.
Tokenization: Converting Language to Numbers
Neural networks cannot process raw text directly. Tokenization is the critical preprocessing step that converts human-readable text into sequences of integers that the model can work with.
Why Not Just Use Words?
Using whole words as tokens creates an impossibly large vocabulary. English alone has hundreds of thousands of words, and when you add technical jargon, names, and multilingual text, the vocabulary becomes unmanageable. Word-level tokenization also cannot handle misspellings, neologisms, or morphological variations gracefully.
Subword Tokenization
Modern LLMs use subword tokenization methods, most commonly Byte Pair Encoding (BPE) or its variants like SentencePiece. These algorithms strike a balance between character-level and word-level tokenization.
BPE works by starting with individual characters and iteratively merging the most frequent pairs. Common words like the become single tokens, while rare words are broken into meaningful subword units. For example, unhappiness might be tokenized as [un, happi, ness], allowing the model to understand its components even if it has never seen the exact word before.
GPT-4 uses a tokenizer with roughly 100,000 tokens in its vocabulary. The average English word requires about 1.3 tokens, while code and non-English languages often require more tokens per word, which affects both cost and context window utilization.
Positional Encoding
Since transformers process all tokens in parallel rather than sequentially, they have no inherent sense of word order. Positional encodings are added to token embeddings to inject information about each token's position in the sequence. The original transformer used sinusoidal functions for this purpose, while newer models like GPT use learned positional embeddings or more advanced schemes like Rotary Position Embeddings (RoPE) that better handle long contexts.
Putting It All Together: How LLMs Generate Text
During training, an LLM processes billions of text examples, adjusting its parameters to predict the next token given all preceding tokens. This process, called causal language modeling, teaches the model the statistical patterns of language at every level, from grammar and facts to reasoning patterns and style.
During inference (when generating text), the model works autoregressively:
- It takes the input prompt, tokenizes it, and processes it through all transformer layers.
- The final layer outputs a probability distribution over the entire vocabulary for the next token.
- A sampling strategy (greedy, top-k, top-p, or temperature-based) selects the next token.
- That token is appended to the sequence, and the process repeats until a stopping condition is met.
This autoregressive loop is why LLMs generate text one token at a time, even though the underlying architecture processes sequences in parallel during the forward pass.
Scale and Its Consequences
Modern LLMs derive much of their capability from sheer scale. GPT-3 has 175 billion parameters, and later models are believed to be significantly larger. Training these models requires thousands of GPUs running for months, consuming megawatt-hours of electricity.
Research has shown that model capabilities often emerge unpredictably at certain scales. Abilities like few-shot learning, chain-of-thought reasoning, and code generation appear to improve dramatically once models cross certain parameter thresholds, a phenomenon researchers call emergent abilities.
However, scale alone is not sufficient. Training data quality, instruction tuning, reinforcement learning from human feedback (RLHF), and architectural refinements all play critical roles in producing models that are not just capable but also aligned with human intentions.
Practical Implications
Understanding these fundamentals has direct practical value. Knowing how tokenization works helps you craft more efficient prompts and estimate API costs. Understanding attention mechanisms explains why models sometimes lose track of instructions in very long contexts. Recognizing that LLMs are fundamentally next-token predictors helps set realistic expectations about their reasoning abilities.
As the field continues to evolve with innovations like mixture-of-experts architectures, state-space models, and longer context windows, the transformer remains the foundation upon which modern AI language capabilities are built. A solid grasp of its mechanics is the starting point for anyone serious about working with or understanding artificial intelligence.