Last Modified: January 12, 2026

Detailed summary of lecture notes of Stanford CS 224R (course website), supplemented with standardized mathematical notations, images, and code pointers!


1. Fundamentals of Sequential Decision Making

Markov Decision Process (MDP) is ****defined by the tuple $(S,A,p,r,γ,μ)$.

$$ J(θ)=\mathbb E_{ τ∼p_θ(τ)} \Big[\sum_{t=1}^Tγ^{t−1}r(s_t,a_t) \Big] $$


Types of RL algorithms

  1. Imitation learning: mimic a policy that achieves high reward
  2. Policy gradients: directly differentiate $\nabla_\theta J(\theta)$
  3. Actor-critic: estimate value of the current policy and use it to make the policy better
  4. Value-based: estimate value of the optimal policy
  5. Model-based: learn to model the dynamics, and use it for planning or policy improvement