Written By: Qingyang Xu (website)
Date Created: November 16, 2022
Last Modified: January 15, 2024
Chapter summary of “Elements of Statistical Learning” (Second Edition)
Chapter 3. Linear Models of Regression
Geometry of OLS
- SVD $X=UDV^\top$ where $D$ is diagonal with singular values $d_1 \ge d_2 \ge \cdots \ge d_p \ge 0$
- $U \in \mathbb{R}^{n\times p}$ with orthonormal columns $U^\top U=I_p$ and spans row space of $X$
- $V \in \mathbb{R}^{p\times p}$ with orthonormal columns $V^\top V=I_p$ and spans column space of $X$
- Key: OLS estimator is the projection of $Y \in \mathbb{R}^{n}$ onto $p$-dimensional ONB $U \in \mathbb{R}^{n\times p}$
$$
\hat{Y}=SY=X(X^\top X)^{-1}X^TY=UU^\top Y= \sum_{i=1}^p u_i u_i^\top Y
$$
- Eigendecomp. $X^\top X=VD^2 V^\top =\sum_i d_i^2 v_i v_i^\top$ where $v_i$ is principle component of $X$
- Large $d_i$ means large sample variance in the direction of $v_i$
Ridge
- Estimate $\hat{\beta}0 = \bar{y}=\frac{1}{n} \sum_i y_i$ and use centered $x{ij}$ and $y_{ij}$
- Ridge regression minimizes
$$
RSS(\lambda)=(Y-X\beta)^\top(Y-X\beta) + \lambda ||\beta||^2
$$
$$
\hat{\beta}{ridge} = ((X^\top X)+\lambda I_p)^{-1}X^\top Y \implies \hat{Y}{ridge} = X((X^\top X)+\lambda I_p)^{-1}X^\top Y
$$
- SVD $X=UDV^\top \implies X^\top X+\lambda I_p=V^\top (D^2+\lambda I_p)V$