Last Modified: December 7, 2025

Detailed summary of lecture notes of Stanford CME 295 (course website), supplemented with standardized mathematical notations, images, and code pointers!

Part 1. Transformer-based Models

1. Language Modeling Objective

Autoregressive LM (decoder-only LLMs)

Given tokens $x_1,\dots,x_T$ from vocab of size $V$:

$$ p_\theta(x_{1:T}) = \prod_{t=1}^T p_\theta(x_t \mid x_{<t}) $$

Training loss (per example):

$$ \mathcal{L}(\theta) = -\sum_{t=1}^{T} \log p_\theta(x_t^{\text{(true)}} \mid x_{<t}^{\text{(true)}}) $$

Think of it as: multi-class classification at each position, conditioned on the prefix.

2. Embeddings & Positional Encodings

Screenshot 2025-11-16 at 6.10.41 PM.png

2.1 Token Embeddings

Vocab size $V$, model dim $d_{\text{model}}$
Embedding matrix: $E \in \mathbb{R}^{V \times d_{\text{model}}}$
For token $w_t$ the embedding: $x_t = E[w_t] \in \mathbb{R}^{d_{\text{model}}}$