Attention Mechanisms in LLMs: 13+ Types Explained

How do AI models understand relationships, context and meaning in sequences? A very important mechanism behind it is attention. It lets a model to focus on, or, in other words, to attend to tokens that matter most when processing or generating text. It dynamically weighs their importance based on context. This process works using three components: Query (Q) → what this particular token is looking for, Key (K) → what each token offers, Value (V) → the actual information to pass along. A model compares Q with all Ks, then uses these scores to weight the Vs.

TL;DR: attention mechanisms help AI models decide which tokens, features, or context matter most. This guide explains 13+ attention types – from self-attention and cross-attention to GQA, FlashAttention, and MLA – showing how modern transformers and LLMs process relationships, memory, and long-range meaning.

What is attention mechanism?

An attention mechanism is a method that helps an AI model focus on the most relevant parts of its input. In transformers and LLMs, attention lets each token weigh other tokens by importance, so the model can understand context, relationships, and meaning more effectively.

Which tokens to attend is determined by the specific attention mechanism. These are the main ones:

Foundational Attention Mechanisms

Self-attention
Lets each token attend directly to other tokens in the same sequence: the model compares every token with the rest of the sequence and builds a context-aware representation. This is what allows transformers to capture relationships across distance and makes it possible to model long-range dependencies without recurrence. → Explore more

Cross-attention

Allows tokens from one sequence attend to a different sequence, bringing information from different sources or modalities together (for example, description and image). It uses queries from one module – usually encoder, and keys and values from the other one (decoder). → Explore more

Causal Attention

An attention mechanism where each token can only attend to its past and present tokens, never future ones. This is enforced with a causal mask, which blocks access to tokens that come later in the sequence.

An upgraded version of it called Causal Attention with Lookahead Keys (CASTLE), in contrast, gives tokens the opportunity to attend to information hidden in future tokens and updates the keys (K) as more tokens are processed. → Explore more

Softmax Attention

Computes pairwise token similarities, normalizes them with softmax (a function that turns a list of numbers into probabilities), and uses the result to combine information from all tokens. Scaled dot-product attention is the standard implementation of softmax attention in Transformers, described in the legendary “Attention is all you need” paper. → Explore more

Efficient Attention Mechanisms: FlashAttention, PagedAttention, Sparse

Linear Attention
Linear attention is a faster version of self-attention that avoids comparing all token pairs. By restructuring the computation with kernel functions (kernel functions are a way to compute similarity between vectors as if they were transformed into another space, without actually performing that transformation), it reduces complexity from O(N²) to O(N), making it much more efficient for long sequences. → Explore more

Sliding Window (Local Attention)

It is a form of sparse attention that limits attention to a subset of tokens instead of all pairs. In sliding window, each token attends only to nearby tokens within a fixed window. This mechanism is is widely used in long-context models. → Explore more

Global attention

It is also a type of sparse attention that allows some tokens to attend to every token in a sequence, while others have limited attention. Every token can also attend to these special tokens. It is often used with local attention to preserve the entire context handling. → Explore more

FlashAttention

An attention algorithm optimized for the hardware (GPUs). It doesn’t store the full attention matrix at once, it computes attention in small blocks (tiling) and keeps intermediate results in fast on-chip memory (SRAM). This reduces costly memory transfers to GPU HBM (High Bandwidth Memory), and attention becomes much faster and more memory-efficient. → Explore more

Multi-Head Attention Variants in LLMs

Multi-Head Attention (MHA)
Splits attention into multiple heads (one independent attention mechanism), so the model can learn different interaction patterns in parallel. It became the defining feature of Transformers and LLMs, and it still remains the baseline for novel attention variants. → Explore more

Multi-Query Attention (MQA)

This one is about inference practicality. It shares one key-value set across all query heads, which greatly speeds up decoder inference and reduces KV-cache cost. → Explore more

Grouped-Query Attention (GQA)
GQA is the compromise that won a lot of real-world adoption. Instead of one KV head for all queries as in MQA, it uses an intermediate number of KV heads, keeping much of MQA’s speed while staying closer to full multi-head quality. In practice, this became one of the most important “LLM-era” attention choices. → Explore more

Multi-Head Latent Attention (MLA)
MLA is an upgrades to attention for large-scale inference. Presented with DeepSeek-V2 release, it compresses the KV cache into latent vectors, reducing memory overhead and improving throughput. → Explore more DeepSeek's work on memory efficiency goes further — conditional memory in LLMs introduces a selective activation principle where models choose what knowledge to access, rather than processing all parameters for every token.

Interleaved Head Attention (IHA)
Introduces cross-head mixing via pseudo-heads, where each pseudo-head is a combination of all heads. They interact with each other and create many more possible attention patterns (scaling ~quadratically with the number of heads) so models can combine information from multiple places and handle multi-step reasoning better. → Explore more

Other Attention Mechanisms Worth Knowing

Other interesting attention mechanisms you should learn about:

Slim Attention, KArAt, and XAttention

Mixture-of-Depths Attention

Subscribe to get it in your inbox

Also, subscribe to our X, Threads and YouTube

to get unique content on every social media

FAQ

What are attention mechanisms?

Attention mechanisms are different ways of calculating what a model should focus on when processing data. Examples include self-attention, cross-attention, causal attention, local attention, global attention, multi-head attention, grouped-query attention, FlashAttention, multi-head latent attention and many others.

What are the two main steps of the attention mechanism?

The two main steps are scoring and aggregation. First, the model compares a Query with Keys to calculate attention scores. Then it uses those scores to weight the Values and combine the most relevant information into a new representation.

What is the basic type of attention mechanism?

The basic type used in transformers is scaled dot-product attention, also known as softmax attention. It compares Queries and Keys with a dot product, scales the result, applies softmax to create weights, and uses those weights to combine Values.

13+ Attention Mechanisms in Transformers and LLMs Explained

What is attention mechanism?

Foundational Attention Mechanisms

Efficient Attention Mechanisms: FlashAttention, PagedAttention, Sparse

Multi-Head Attention Variants in LLMs

Other Attention Mechanisms Worth Knowing

FAQ

What are attention mechanisms?

What are the two main steps of the attention mechanism?

What is the basic type of attention mechanism?

Reply

#7: AI Saved Time. Where Did the Value Go?

10 Small Language Models to Know in 2026

Inflection AI: The Story of the Unforeseen and Secretive Unicorn with a $4 Billion Valuation

13+ Attention Mechanisms in Transformers and LLMs Explained

What is attention mechanism?

Foundational Attention Mechanisms

Efficient Attention Mechanisms: FlashAttention, PagedAttention, Sparse

Multi-Head Attention Variants in LLMs

Other Attention Mechanisms Worth Knowing

FAQ

What are attention mechanisms?

What are the two main steps of the attention mechanism?

What is the basic type of attention mechanism?

Reply

#7: AI Saved Time. Where Did the Value Go?

10 Small Language Models to Know in 2026

Inflection AI: The Story of the Unforeseen and Secretive Unicorn with a $4 Billion Valuation

#7: AI Saved Time. Where Did the Value Go?