How do AI models understand relationships, context and meaning in sequences? A very important mechanism behind it is attention. It lets a model to focus on, or, in other words, to attend to tokens that matter most when processing or generating text. It dynamically weighs their importance based on context. This process works using three components: Query (Q) → what this particular token is looking for, Key (K) → what each token offers, Value (V) → the actual information to pass along. A model compares Q with all Ks, then uses these scores to weight the Vs.

But which tokens to attend to more is determined by the specific attention mechanism. Here are the main ones:

  1. Self-attention
    Lets each token attend directly to other tokens in the same sequence, making it possible to model long-range dependencies without recurrence. → Explore more

  2. Cross-attention

    Allows tokens from one sequence attend to a different sequence, bringing information from different sources or modalities together (for example, description and image). It uses queries from one module – usually encoder, and keys and values from the other one (decoder). → Explore more

  3. Causal Attention

    An attention mechanism where each token can only attend to its past and present tokens, never future ones. This is enforced with a causal mask, which blocks access to tokens that come later in the sequence. → Explore more

    An upgraded version of it called Causal Attention with Lookahead Keys (CASTLE), in contrast, gives tokens the opportunity to attend to information hidden in future tokens and updates the keys (K) as more tokens are processed. → Explore more

  4. Linear Attention
    Linear attention is a faster version of self-attention that avoids comparing all token pairs. By restructuring the computation with kernel functions (kernel functions are a way to compute similarity between vectors as if they were transformed into another space, without actually performing that transformation), it reduces complexity from O(N2) to O(N), making it much more efficient for long sequences. → Explore more

  5. Softmax Attention

    Computes pairwise token similarities, normalizes them with softmax (a function that turns a list of numbers into probabilities), and uses the result to combine information from all tokens. Scaled dot-product attention is the standard implementation of softmax attention in Transformers, described in the legendary “Attention is all you need” paper. → Explore more

  6. Sliding Window (Local Attention)

    It is a form of sparse attention that limits attention to a subset of tokens instead of all pairs. In sliding window, each token attends only to nearby tokens within a fixed window. This mechanism is is widely used in long-context models. → Explore more

  7. Global attention

    It is also a type of sparse attention that allows some tokens to attend to every token in a sequence, while others have limited attention. Every token can also attend to these special tokens. It is often used with local attention to preserve the entire context handling. → Explore more

  8. FlashAttention

    An attention algorithm optimized for the hardware (GPUs). It doesn’t store the full attention matrix at once, it computes attention in small blocks (tiling) and keeps intermediate results in fast on-chip memory (SRAM). This reduces costly memory transfers to GPU HBM (High Bandwidth Memory), and attention becomes much faster and more memory-efficient. → Explore more

  9. Multi-Head Attention (MHA)
    Splits attention into multiple heads (one independent attention mechanism), so the model can learn different interaction patterns in parallel. It became the defining feature of Transformers and LLMs, and it still remains the baseline for novel attention variants. → Explore more

  10. Multi-Query Attention (MQA)

    This one is about inference practicality. It shares one key-value set across all query heads, which greatly speeds up decoder inference and reduces KV-cache cost. → Explore more

  11. Grouped-Query Attention (GQA)
    GQA is the compromise that won a lot of real-world adoption. Instead of one KV head for all queries as in MQA, it uses an intermediate number of KV heads, keeping much of MQA’s speed while staying closer to full multi-head quality. In practice, this became one of the most important “LLM-era” attention choices. → Explore more

  12. Multi-Head Latent Attention (MLA)
    MLA is an upgrades to attention for large-scale inference. Presented with DeepSeek-V2 release, it compresses the KV cache into latent vectors, reducing memory overhead and improving throughput. → Explore more

  13. Interleaved Head Attention (IHA)
    Introduces cross-head mixing via pseudo-heads, where each pseudo-head is a combination of all heads. They interact with each other and create many more possible attention patterns (scaling ~quadratically with the number of heads) so models can combine information from multiple places and handle multi-step reasoning better. → Explore more

    Other interesting attention mechanisms you should learn about:

Also, subscribe to our X, Threads and YouTube

to get unique content on every social media

Reply

Avatar

or to participate

Keep Reading