Written By: Qingyang Xu (et AI)

Last Modified: April 16, 2026

0. Taxonomy of attention mechanisms

  1. What score function do we use?

    Softmax attention, sigmoid attention, differential attention, gated attention.

  2. Which tokens may interact?

    Full attention, causal attention, sliding-window/local attention, structured sparse attention.

  3. How do we store and serve K/V at inference?

    MHA, MQA, GQA, MLA.

  4. How do we make it fast on hardware?

    FlashAttention and related kernels.

That split helps because many papers change only one of these axes, while keeping the others fixed. (arXiv)

1. Baseline: causal self-attention

For a decoder-only LLM with hidden states $X \in \mathbb{R}^{T \times d_{\text{model}}}$

$$ Q = XW_Q,\qquad K = XW_K,\qquad V = XW_V $$

and the standard causal attention output is

$$ Y = \operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_h}} + M + B\right)V $$

where $M$ is the causal mask ($-\infty$ above the diagonal), and $B$ is any positional bias term. In multi-head form, each head $h$ gets its own $(Q_h,K_h,V_h)$, and the outputs are concatenated and projected:

$$ Y = \operatorname{Concat}(Y_1,\dots,Y_H)W_O,\quad Y_h = \operatorname{Attn}(Q_h,K_h,V_h). $$

The key idea is that each token computes a similarity score against previous tokens, normalizes those scores row-wise with softmax, and takes a weighted sum of the value vectors. This gives strong content-based routing and is the reason Transformers displaced recurrence for language modeling. (arXiv)

Why multi-head attention matters