Elements of Statistical Learning

Written By: Qingyang Xu (website)

Date Created: November 16, 2022

Last Modified: January 15, 2024

Chapter summary of “Elements of Statistical Learning” (Second Edition)

Chapter 3. Linear Models of Regression

SVD $X=UDV^\top$ where $D$ is diagonal with singular values $d_1 \ge d_2 \ge \cdots \ge d_p \ge 0$
- $U \in \mathbb{R}^{n\times p}$ with orthonormal columns $U^\top U=I_p$ and spans row space of $X$
- $V \in \mathbb{R}^{p\times p}$ with orthonormal columns $V^\top V=I_p$ and spans column space of $X$
Key: OLS estimator is the projection of $Y \in \mathbb{R}^{n}$ onto $p$-dimensional ONB $U \in \mathbb{R}^{n\times p}$

$$ \hat{Y}=SY=X(X^\top X)^{-1}X^TY=UU^\top Y= \sum_{i=1}^p u_i u_i^\top Y $$

Eigendecomp. $X^\top X=VD^2 V^\top =\sum_i d_i^2 v_i v_i^\top$ where $v_i$ is principle component of $X$
Large $d_i$ means large sample variance in the direction of $v_i$

Estimate $\hat{\beta}0 = \bar{y}=\frac{1}{n} \sum_i y_i$ and use centered $x{ij}$ and $y_{ij}$
Ridge regression minimizes

$$ RSS(\lambda)=(Y-X\beta)^\top(Y-X\beta) + \lambda ||\beta||^2 $$

$$ \hat{\beta}{ridge} = ((X^\top X)+\lambda I_p)^{-1}X^\top Y \implies \hat{Y}{ridge} = X((X^\top X)+\lambda I_p)^{-1}X^\top Y $$