Written By: Qingyang Xu (et AI)
Last Modified: April 16, 2026
We focus on decoder-only LLMs. A useful way to think about “architecture beyond attention” is that modern LLM design changes one or more of four places in the block:
$$ h^{(ℓ+1/2)}=h^{(ℓ)}+\text{TokenMixer}_l(\text{Norm}(h^{(ℓ)})) \\ h^{(ℓ+1)}=h^{(ℓ+1/2)}+\text{ChannelMixer}_l(\text{Norm}(h^{(ℓ)})) $$
Innovation beyond attention either replaces the token mixer, replaces the channel mixer, adds conditional routing, or adds a separate memory / latent interface. That framing covers most of the important families below. (arXiv)
In the original Transformer-style decoder block, the channel mixer is a dense feed-forward network
$$ \mathrm{FFN}(x)=W_2\phi(W_1x+b_1)+b_2, $$
applied independently at each token. A large fraction of post-2020 LLM gains came from improving this sublayer rather than changing attention itself. (arXiv)
The dominant replacement is the gated MLP family:
$$ \mathrm{GLU\text{-}FFN}(x)=W_o\big(\phi(W_gx)\odot (W_vx)\big), $$
with variants such as ReGLU, GEGLU, and SwiGLU. The core idea is simple: instead of one expanded hidden representation, create two projections, let one gate the other multiplicatively, and then project back down. In practice, this gives a more expressive channel mixer because the model can modulate which features should pass through on a token-by-token basis. Shazeer’s paper showed these GLU variants consistently improve Transformer quality over standard ReLU/GELU FFNs, and SwiGLU/GEGLU became especially influential in later LLMs. (arXiv)
Best practices. If compute is fixed, do not just swap FFN for SwiGLU at the same hidden width; the gated formulation has extra projections, so practitioners usually reduce the intermediate width somewhat to keep FLOPs comparable. Conceptually, gated MLPs are now the “safe default” channel mixer unless you specifically need MoE or extremely lightweight deployments. (arXiv)
Tradeoff. Dense gated MLPs improve quality and training stability with modest architectural risk, but they still scale compute linearly with all FFN parameters at every token. That is exactly the bottleneck MoE tries to escape. (arXiv)
A second non-attention axis is how the residual stream is normalized and scaled. The two main formulas are LayerNorm and RMSNorm. For a vector $x\in\mathbb{R}^d$,