Attention Is All You Need: Understanding Transformer Architecture
The transformer architecture revolutionized artificial intelligence and natural language processing, establishing itself as the foundation for modern AI systems. From BERT to GPT, nearly every state-of-the-art language model relies on the principles that the groundbreaking paper “Attention Is All You Need” introduced. This article provides a comprehensive exploration of transformer architecture, breaking down its key components and explaining why it has become the dominant paradigm in deep learning.
Content
Toggle1. What is a transformer?
A transformer is a neural network architecture that handles sequential data without relying on recurrence or convolution. Unlike traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs), the transformer model processes entire sequences simultaneously using a mechanism called attention. This parallel processing capability makes transformers significantly faster and more efficient at capturing long-range dependencies in data.
The transformer deep learning architecture specifically addresses the limitations of previous sequential models. Traditional RNNs struggled with long sequences because information had to pass through many time steps, leading to vanishing or exploding gradients. Transformers solve this problem by allowing every position in a sequence to directly attend to every other position, creating direct connections regardless of distance.
Key advantages of transformers
The transformer neural network offers several compelling advantages that explain its widespread adoption. First, parallelization allows transformers to process all tokens in a sequence simultaneously, dramatically reducing training time compared to sequential models. Second, the attention mechanism enables the model to capture dependencies between distant elements in a sequence more effectively than RNNs or LSTMs.
Third, transformers scale highly effectively, meaning developers can make them larger by adding more layers, attention heads, or hidden dimensions, consistently improving performance with increased capacity. This scalability has enabled the development of massive models like GPT-3 with 175 billion parameters.
Consider a simple example: translating the sentence “The bank can guarantee deposits will eventually cover future tuition costs because the bank is very reliable.” A transformer can simultaneously understand that “bank” refers to a financial institution in both instances, “cover” relates to financial protection, and “deposits” connects to the financial context—all without processing the sentence word by word.
2. The transformer architecture overview
The original transformer architecture follows an encoder-decoder structure, where the encoder processes the input sequence and the decoder generates the output sequence. Both encoder and decoder consist of stacked layers, each containing multiple sub-components that work together to transform the input data.
Encoder-decoder transformer structure
The encoder consists of a stack of identical layers, typically six in the original implementation. Each encoder layer contains two main sub-components: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. Residual connections and layer normalization wrap both sub-components.
The decoder also consists of a stack of identical layers, with an additional sub-component inserted between the self-attention and feed-forward layers. This third component performs multi-head attention over the encoder’s output, allowing the decoder to focus on relevant parts of the input sequence while generating output.
Here’s a simplified visualization of the flow:
Input Sequence → Input Embedding → Positional Encoding →
→ Encoder Stack (N layers) → Encoder Output →
→ Decoder Stack (N layers) → Output Embedding → Final Output
Each encoder layer in Python looks like this:
import torch
import torch.nn as nn
class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super(EncoderLayer, self).__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = PositionwiseFeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention with residual connection
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output))
# Feed-forward with residual connection
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output))
return x
Dimensionality and hyperparameters
The transformer model uses several key hyperparameters that define its architecture. The model dimension \(d_{model}\) represents the size of the embedding space, typically 512 in the base model. The number of attention heads \(h\) usually equals 8, and the feed-forward dimension \(d_{ff}\) typically measures 2048, which is four times the model dimension.
Developers commonly set the number of encoder and decoder layers \(N\) to 6, though modern transformers like GPT use many more layers. You can adjust these hyperparameters based on the specific task and available computational resources.
3. Self-attention mechanism: The heart of transformers
Self-attention forms the core innovation that makes transformers powerful. It allows each position in a sequence to attend to all positions in the same sequence, computing a weighted representation where the weights indicate the relevance of each position to the current position.
How self-attention works
The self-attention mechanism computes attention using three learned linear transformations of the input: Query (Q), Key (K), and Value (V). For each input position, we compute these three vectors and use them to determine how much attention to pay to every other position in the sequence.
The mathematical formula for scaled dot-product attention is:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
Here, \(d_k\) represents the dimension of the key vectors. The scaling factor \(\frac{1}{\sqrt{d_k}}\) prevents the dot products from growing too large in magnitude, which would push the softmax function into regions with extremely small gradients.
Let’s break down the computation step by step:
- Compute queries, keys, and values: Learned weight matrices \(W^Q\), \(W^K\), and \(W^V\) transform each input token into three vectors.
- Calculate attention scores: We compare the query of each token against all keys using dot products: \(\text{score}_{ij} = q_i \cdot k_j\).
- Scale the scores: We divide by \(\sqrt{d_k}\) to prevent vanishing gradients.
- Apply softmax: We convert scores to probabilities: $$\text{attention_weights}_{ij} =
\frac{e^{\text{score}_{ij}}}{\sum_{k} e^{\text{score}_{ik}}}$$ - Weighted sum of values: The output for each position comes from a weighted combination of all value vectors.
Here’s a Python implementation:
import torch
import torch.nn.functional as F
import math
def scaled_dot_product_attention(query, key, value, mask=None):
"""
Compute scaled dot-product attention.
Args:
query: Query tensor of shape (batch_size, num_heads, seq_len, d_k)
key: Key tensor of shape (batch_size, num_heads, seq_len, d_k)
value: Value tensor of shape (batch_size, num_heads, seq_len, d_v)
mask: Optional mask tensor
Returns:
output: Attention output
attention_weights: Attention weight matrix
"""
d_k = query.size(-1)
# Compute attention scores
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
# Apply mask if provided (for padding or causal masking)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Apply softmax to get attention weights
attention_weights = F.softmax(scores, dim=-1)
# Compute weighted sum of values
output = torch.matmul(attention_weights, value)
return output, attention_weights
Self-attention example
Consider the sentence “The animal didn’t cross the street because it was too tired.” When processing the word “it,” self-attention helps the model determine that “it” refers to “animal” rather than “street.”
The attention scores might look like this (simplified):
- “it” attends to “The” (0.02)
- “it” attends to “animal” (0.67)
- “it” attends to “didn’t” (0.04)
- “it” attends to “street” (0.09)
- “it” attends to “tired” (0.18)
The high attention weight of 0.67 on “animal” indicates the model has learned that “it” likely refers to the animal in this context.
4. Multi-head attention: Capturing diverse relationships
While single-headed attention proves powerful, multi-head attention extends this capability by allowing the model to jointly attend to information from different representation subspaces at different positions. Instead of performing a single attention function, multi-head attention performs multiple attention operations in parallel.
The multi-head attention mechanism
Multi-head attention projects the queries, keys, and values \(h\) times using different learned linear projections. We call each projection an “attention head.” The system then applies the attention function in parallel to each of these projected versions, producing (h) output values that concatenate and project again to produce the final values.
The formula for multi-head attention is:
$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O $$
where the system computes each head as:
$$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$
Here, \(W_i^Q \in \mathbb{R}^{d_{model} \times d_k}\), \(W_i^K \in \mathbb{R}^{d_{model} \times d_k}\), \(W_i^V \in \mathbb{R}^{d_{model} \times d_v}\), and \(W^O \in \mathbb{R}^{hd_v \times d_{model}}\) represent learned parameter matrices.
Why multiple heads?
Different attention heads can capture different types of relationships. For example, in the sentence “The bank can guarantee deposits,” one head might focus on syntactic relationships (subject-verb agreement), another on semantic relationships (financial terms clustering), and yet another on long-range dependencies.
In the original transformer, with 8 attention heads and \(d_{model} = 512\), each head operates in a dimension of \(d_k = d_v = 64\) \(512/8\). This allows the model to learn diverse patterns without significantly increasing computational cost.
Here’s a complete implementation of multi-head attention:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
# Linear projections for Q, K, V
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
# Output projection
self.W_o = nn.Linear(d_model, d_model)
def split_heads(self, x, batch_size):
"""Split the last dimension into (num_heads, d_k)"""
x = x.view(batch_size, -1, self.num_heads, self.d_k)
return x.transpose(1, 2)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# Linear projections
Q = self.split_heads(self.W_q(query), batch_size)
K = self.split_heads(self.W_k(key), batch_size)
V = self.split_heads(self.W_v(value), batch_size)
# Scaled dot-product attention
attn_output, attention_weights = scaled_dot_product_attention(
Q, K, V, mask
)
# Concatenate heads
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.view(batch_size, -1, self.d_model)
# Final linear projection
output = self.W_o(attn_output)
return output
Different types of attention in transformers
The encoder-decoder transformer architecture uses three different types of attention:
Encoder self-attention: Each position in the encoder attends to all positions in the previous encoder layer. This allows the encoder to build rich, contextualized representations.
Decoder self-attention: Each position in the decoder attends to all positions up to and including that position. We implement this using a mask to prevent positions from attending to future positions (causal masking).
Encoder-decoder attention: Each position in the decoder attends to all positions in the encoder output. This allows the decoder to focus on relevant parts of the input while generating output.
5. Positional encoding: Adding sequence order information
Since transformers process all positions simultaneously rather than sequentially, they lack inherent information about the order of elements in a sequence. Positional encoding addresses this limitation by injecting information about the position of tokens into the input embeddings.
Why positional encoding matters
In recurrent networks, position information comes implicitly through sequential processing—the model processes token 1, then token 2, then token 3, and so on. Transformers, however, see all tokens at once. Without positional encoding, the transformer would treat “The cat chased the mouse” identically to “The mouse chased the cat” because the same set of tokens appears in both sentences.
Sinusoidal positional encoding
The original transformer uses sinusoidal functions to generate positional encodings. For each position (pos) and each dimension (i) in the embedding, the system computes the positional encoding as:
$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$
$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$
This formulation offers several desirable properties. First, it produces unique encodings for each position. Second, the sinusoidal pattern allows the model to extrapolate to sequence lengths longer than those it sees during training. Third, for any fixed offset (k), the system can represent the positional encoding at position \(pos + k\) as a linear function of the encoding at position \(pos\), helping the model learn relative positions.
Here’s a Python implementation:
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_seq_len=5000):
super(PositionalEncoding, self).__init__()
# Create positional encoding matrix
pe = torch.zeros(max_seq_len, d_model)
position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
(-math.log(10000.0) / d_model))
# Apply sine to even indices
pe[:, 0::2] = torch.sin(position * div_term)
# Apply cosine to odd indices
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
"""
Args:
x: Tensor of shape (batch_size, seq_len, d_model)
"""
x = x + self.pe[:, :x.size(1)]
return x
Learned positional embeddings
While the original transformer uses fixed sinusoidal encodings, many modern transformers (like BERT and GPT) use learned positional embeddings instead. These work as simple embedding vectors that the model learns during training, similar to token embeddings.
class LearnedPositionalEmbedding(nn.Module):
def __init__(self, max_seq_len, d_model):
super(LearnedPositionalEmbedding, self).__init__()
self.position_embeddings = nn.Embedding(max_seq_len, d_model)
def forward(self, x):
batch_size, seq_len = x.size(0), x.size(1)
positions = torch.arange(seq_len, device=x.device).unsqueeze(0)
positions = positions.expand(batch_size, -1)
position_embeddings = self.position_embeddings(positions)
return x + position_embeddings
Both approaches have trade-offs. Sinusoidal encodings generalize better to longer sequences than training shows, while learned embeddings can potentially capture more nuanced positional information specific to the task.
6. Feed-forward networks and other components
Beyond attention and positional encoding, transformers include several other crucial components that contribute to their effectiveness.
Position-wise feed-forward networks
Each encoder and decoder layer contains a fully connected feed-forward network that applies to each position separately and identically. This consists of two linear transformations with a ReLU activation in between:
$$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$
The inner layer typically has a dimensionality of \(d_{ff} = 2048\) (four times the model dimension), while the input and output have dimensionality \(d_{model} = 512\). This expansion and compression allow the network to learn complex transformations.
class PositionwiseFeedForward(nn.Module):
def __init__(self, d_model, d_ff):
super(PositionwiseFeedForward, self).__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.relu = nn.ReLU()
def forward(self, x):
return self.linear2(self.relu(self.linear1(x)))
Layer normalization and residual connections
A residual connection followed by layer normalization wraps each sub-layer (self-attention and feed-forward) in the transformer:
$$ \text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x)) $$
Residual connections help with gradient flow during training, allowing information to bypass layers directly. Layer normalization stabilizes the learning process by normalizing the inputs to have mean zero and variance one across the features.
Dropout for regularization
The architecture applies dropout at several points in the transformer to prevent overfitting. This includes dropout on the attention weights, on the output of each sub-layer before adding the residual connection, and on the positional encodings.
Complete transformer model
Here’s how all components come together in a simplified complete transformer:
class Transformer(nn.Module):
def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512,
num_heads=8, num_layers=6, d_ff=2048, max_seq_len=5000,
dropout=0.1):
super(Transformer, self).__init__()
# Embeddings
self.src_embedding = nn.Embedding(src_vocab_size, d_model)
self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model, max_seq_len)
# Encoder
self.encoder_layers = nn.ModuleList([
EncoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
# Decoder
self.decoder_layers = nn.ModuleList([
DecoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
# Output projection
self.output_projection = nn.Linear(d_model, tgt_vocab_size)
self.dropout = nn.Dropout(dropout)
def encode(self, src, src_mask):
x = self.dropout(self.positional_encoding(
self.src_embedding(src) * math.sqrt(self.d_model)))
for layer in self.encoder_layers:
x = layer(x, src_mask)
return x
def decode(self, tgt, memory, tgt_mask, src_mask):
x = self.dropout(self.positional_encoding(
self.tgt_embedding(tgt) * math.sqrt(self.d_model)))
for layer in self.decoder_layers:
x = layer(x, memory, tgt_mask, src_mask)
return x
def forward(self, src, tgt, src_mask=None, tgt_mask=None):
encoder_output = self.encode(src, src_mask)
decoder_output = self.decode(tgt, encoder_output, tgt_mask, src_mask)
output = self.output_projection(decoder_output)
return output
7. Modern transformers: BERT, GPT, and beyond
The original encoder-decoder transformer architecture has spawned numerous variants that have achieved state-of-the-art results across diverse AI tasks.
BERT: Encoder-only transformers
BERT (Bidirectional Encoder Representations from Transformers) uses only the encoder portion of the transformer architecture. Researchers pre-train it using masked language modeling, where they mask random tokens and the model learns to predict them based on bidirectional context. This makes BERT excellent for understanding tasks like classification, question answering, and named entity recognition.
BERT demonstrated that transformers could learn powerful general-purpose language representations that transfer well to downstream tasks. The key innovation was bidirectional training—unlike previous models that trained left-to-right, BERT can attend to context from both directions simultaneously.
GPT: Decoder-only transformers
GPT (Generative Pre-trained Transformer) uses only the decoder portion of the transformer architecture without the encoder-decoder attention component. Researchers train it using causal language modeling, predicting the next token given all previous tokens. This autoregressive approach makes GPT excellent for generation tasks like text completion, summarization, and dialogue.
The GPT series has demonstrated remarkable scaling properties—as model size increases (GPT-2 with 1.5 billion parameters, GPT-3 with 175 billion), the model’s capabilities improve dramatically, even enabling few-shot and zero-shot learning on tasks it didn’t explicitly train for.
Transformers beyond language
While researchers originally designed transformers for natural language processing, they’ve successfully adapted them to many other domains:
Vision Transformers (ViT): These apply transformers directly to image patches, treating them as sequence tokens. ViT has achieved competitive or superior results to convolutional neural networks on image classification tasks.
Audio transformers: Models like Whisper use transformers for speech recognition and audio processing, demonstrating the architecture’s versatility across modalities.
Protein structure prediction: AlphaFold 2 uses transformer-like attention mechanisms to predict protein structures with remarkable accuracy.
Multi-modal transformers: Models like CLIP learn joint representations of images and text, enabling powerful cross-modal understanding and generation.
Key innovations in modern transformers
Modern transformer implementations have introduced numerous improvements:
Efficient attention mechanisms: Techniques like sparse attention, linear attention, and Flash Attention reduce the quadratic complexity of standard attention, enabling processing of longer sequences.
Better positional encodings: Relative positional encodings, rotary position embeddings (RoPE), and ALiBi provide improved ways to encode position information.
Architectural modifications: Changes like pre-layer normalization, GLU variants in feed-forward networks, and mixture-of-experts layers improve performance and efficiency.
Training strategies: Techniques like curriculum learning, learning rate warm-up, and careful initialization help train larger models more effectively.
8. Conclusion
The transformer architecture represents a paradigm shift in deep learning, replacing sequential processing with parallel attention mechanisms. By understanding self-attention, multi-head attention, positional encoding, and the encoder-decoder structure, you can grasp why transformers have become the foundation of modern AI systems. The architecture’s ability to capture long-range dependencies, process sequences in parallel, and scale effectively has made it the model of choice for everything from language understanding to computer vision.
Whether you work with BERT for natural language understanding, GPT for text generation, or explore transformers in new domains, the core principles remain the same. The attention mechanism allows models to dynamically focus on relevant information, multi-head attention captures diverse relationships, and positional encoding preserves sequence order. As transformers continue to evolve and expand into new applications, these fundamental concepts provide the foundation for understanding and building cutting-edge AI systems.