How do AI models understand relationships, context and meaning in sequences? A very important mechanism behind it is attention. It lets a model to focus on, or, in other words, to attend to tokens that matter most when processing or generating text. It dynamically weighs their importance based on context. This process works using three components: Query (Q) → what this particular token is looking for, Key (K) → what each token offers, Value (V) → the actual information to pass along. A model compares Q with all Ks, then uses these scores to weight the Vs.
But which tokens to attend to more is determined by the specific attention mechanism. Here are the main ones:
Self-attention
Lets each token attend directly to other tokens in the same sequence, making it possible to model long-range dependencies without recurrence. → Explore moreCross-attention
Allows tokens from one sequence attend to a different sequence, bringing information from different sources or modalities together (for example, description and image). It uses queries from one module – usually encoder, and keys and values from the other one (decoder). → Explore more
Causal Attention
An attention mechanism where each token can only attend to its past and present tokens, never future ones. This is enforced with a causal mask, which blocks access to tokens that come later in the sequence. → Explore more
An upgraded version of it called Causal Attention with Lookahead Keys (CASTLE), in contrast, gives tokens the opportunity to attend to information hidden in future tokens and updates the keys (K) as more tokens are processed. → Explore more
Linear Attention
Linear attention is a faster version of self-attention that avoids comparing all token pairs. By restructuring the computation with kernel functions (kernel functions are a way to compute similarity between vectors as if they were transformed into another space, without actually performing that transformation), it reduces complexity from O(N2) to O(N), making it much more efficient for long sequences. → Explore moreSoftmax Attention
Computes pairwise token similarities, normalizes them with softmax (a function that turns a list of numbers into probabilities), and uses the result to combine information from all tokens. Scaled dot-product attention is the standard implementation of softmax attention in Transformers, described in the legendary “Attention is all you need” paper. → Explore more
Sliding Window (Local Attention)
It is a form of sparse attention that limits attention to a subset of tokens instead of all pairs. In sliding window, each token attends only to nearby tokens within a fixed window. This mechanism is is widely used in long-context models. → Explore more
Global attention
It is also a type of sparse attention that allows some tokens to attend to every token in a sequence, while others have limited attention. Every token can also attend to these special tokens. It is often used with local attention to preserve the entire context handling. → Explore more
FlashAttention
An attention algorithm optimized for the hardware (GPUs). It doesn’t store the full attention matrix at once, it computes attention in small blocks (tiling) and keeps intermediate results in fast on-chip memory (SRAM). This reduces costly memory transfers to GPU HBM (High Bandwidth Memory), and attention becomes much faster and more memory-efficient. → Explore more
Multi-Head Attention (MHA)
Splits attention into multiple heads (one independent attention mechanism), so the model can learn different interaction patterns in parallel. It became the defining feature of Transformers and LLMs, and it still remains the baseline for novel attention variants. → Explore more
Multi-Query Attention (MQA)
This one is about inference practicality. It shares one key-value set across all query heads, which greatly speeds up decoder inference and reduces KV-cache cost. → Explore more
Grouped-Query Attention (GQA)
GQA is the compromise that won a lot of real-world adoption. Instead of one KV head for all queries as in MQA, it uses an intermediate number of KV heads, keeping much of MQA’s speed while staying closer to full multi-head quality. In practice, this became one of the most important “LLM-era” attention choices. → Explore moreMulti-Head Latent Attention (MLA)
MLA is an upgrades to attention for large-scale inference. Presented with DeepSeek-V2 release, it compresses the KV cache into latent vectors, reducing memory overhead and improving throughput. → Explore moreInterleaved Head Attention (IHA)
Introduces cross-head mixing via pseudo-heads, where each pseudo-head is a combination of all heads. They interact with each other and create many more possible attention patterns (scaling ~quadratically with the number of heads) so models can combine information from multiple places and handle multi-step reasoning better. → Explore moreOther interesting attention mechanisms you should learn about:
