Last Modified: January 12, 2026

Detailed summary of lecture notes of Stanford CS 224R (course website), supplemented with standardized mathematical notations, images, and code pointers!

1. Fundamentals of Sequential Decision Making

Markov Decision Process (MDP) is ****defined by the tuple $(S,A,p,r,γ,μ)$.

State $s_t \in S$ - complete description of the world at time t
Action $a_t \in A$ - decision taken by the agent.
Transition Dynamics $p(s_{t+1}∣s_t,a_t)$ satisfying the Markov Property
Reward Function $r(s_t,a_t)$ A scalar feedback indicating immediate "goodness" of an action.
Trajectory $\tau$ - a sequence of states and actions $(s_1,a_1,\cdots,s_T,a_T).$
Goal: Learn a policy $π_θ(a∣s)$ that maximizes expected sum of discounted rewards:

$$ J(θ)=\mathbb E_{ τ∼p_θ(τ)} \Big[\sum_{t=1}^Tγ^{t−1}r(s_t,a_t) \Big] $$

Types of RL algorithms

Imitation learning: mimic a policy that achieves high reward
Policy gradients: directly differentiate $\nabla_\theta J(\theta)$
Actor-critic: estimate value of the current policy and use it to make the policy better
Value-based: estimate value of the optimal policy
Model-based: learn to model the dynamics, and use it for planning or policy improvement