Written By: Qingyang Xu (et AI)
Last Modified: April 16, 2026
We provide a summary of optimization methods for AI and LLM model training. A good way to understand optimizer history is this:
$$ \theta_{t+1}=\theta_t-\eta_t P_t\tilde g_t $$
where $g_t=\nabla_\theta \ell(\theta_t;\xi_t)$ is a stochastic gradient on a minibatch, $\tilde g_t$ is a temporally smoothed version of recent gradients, and $P_t$ is a geometry-aware rescaling or preconditioner. The whole story from SGD to Adam to Muon is really about three questions:
That lens makes the family tree much cleaner.
In AI/LLM training we usually minimize an empirical or population objective[
$$ L(\theta)=\frac1N\sum_{i=1}^N \ell(\theta;x_i) \qquad\text{or}\qquad L(\theta)=\mathbb E_{\xi}[\ell(\theta;\xi)]. $$
A full gradient step uses $\nabla L(\theta_t)$, but that is too expensive at scale, so we use a minibatch estimator
$$ g_t=\frac1{|B_t|}\sum_{i\in B_t}\nabla \ell(\theta_t;x_i), \qquad \mathbb E[g_t]\approx \nabla L(\theta_t). $$
Everything below is a different way to turn $g_t$ into an update.
Vanilla SGD is
$$ \theta_{t+1}=\theta_t-\eta_t g_t. $$