Written By: Qingyang Xu (website)
Date Created: November 16, 2025
Last Modified: December 7, 2025
Detailed summary of lecture notes of Stanford CME 295 (course website), supplemented with standardized mathematical notations, images, and code pointers!
Autoregressive LM (decoder-only LLMs)
Given tokens $x_1,\dots,x_T$ from vocab of size $V$:
$$ p_\theta(x_{1:T}) = \prod_{t=1}^T p_\theta(x_t \mid x_{<t}) $$
$$ \mathcal{L}(\theta) = -\sum_{t=1}^{T} \log p_\theta(x_t^{\text{(true)}} \mid x_{<t}^{\text{(true)}}) $$
