Optimization algorithms for LLM training

Written By: Qingyang Xu (et AI)

Last Modified: April 16, 2026

We provide a summary of optimization methods for AI and LLM model training. A good way to understand optimizer history is this:

$$ \theta_{t+1}=\theta_t-\eta_t P_t\tilde g_t $$

where $g_t=\nabla_\theta \ell(\theta_t;\xi_t)$ is a stochastic gradient on a minibatch, $\tilde g_t$ is a temporally smoothed version of recent gradients, and $P_t$ is a geometry-aware rescaling or preconditioner. The whole story from SGD to Adam to Muon is really about three questions:

Should we average gradients across time?
Should we scale coordinates differently?
Should we respect matrix structure instead of treating everything as a flat vector?

That lens makes the family tree much cleaner.

1. The base problem

In AI/LLM training we usually minimize an empirical or population objective[

$$ L(\theta)=\frac1N\sum_{i=1}^N \ell(\theta;x_i) \qquad\text{or}\qquad L(\theta)=\mathbb E_{\xi}[\ell(\theta;\xi)]. $$

A full gradient step uses $\nabla L(\theta_t)$, but that is too expensive at scale, so we use a minibatch estimator

$$ g_t=\frac1{|B_t|}\sum_{i\in B_t}\nabla \ell(\theta_t;x_i), \qquad \mathbb E[g_t]\approx \nabla L(\theta_t). $$

Everything below is a different way to turn $g_t$ into an update.

2. SGD: the irreducible baseline

Vanilla SGD is

$$ \theta_{t+1}=\theta_t-\eta_t g_t. $$

1. The base problem

2. SGD: the irreducible baseline

Main innovation