Written By: Qingyang Xu (website)

Date Created: November 16, 2025

Last Modified: December 7, 2025

Detailed summary of lecture notes of Stanford CME 295 (course website), supplemented with standardized mathematical notations, images, and code pointers!


Part 1. Transformer-based Models

1. Language Modeling Objective

Autoregressive LM (decoder-only LLMs)

Given tokens $x_1,\dots,x_T$ from vocab of size $V$:

$$ p_\theta(x_{1:T}) = \prod_{t=1}^T p_\theta(x_t \mid x_{<t}) $$

$$ \mathcal{L}(\theta) = -\sum_{t=1}^{T} \log p_\theta(x_t^{\text{(true)}} \mid x_{<t}^{\text{(true)}}) $$


2. Embeddings & Positional Encodings

Screenshot 2025-11-16 at 6.10.41 PM.png

2.1 Token Embeddings