大1 · 第2學期生物統計回歸分析

主成分分析

Principal Component Analysis

難度 3 · 進階statisticsbioinformatics想做成互動版

PCA 的數學理論從線性代數延伸到隨機矩陣理論和非線性降維方法。

最佳化與 SVD

PCA 等同求解 max wᵀΣw s.t. ‖w‖=1，解為 Σ 的最大特徵向量。等價於 SVD 分解：X = UDVᵀ（X 已中心化），PC scores = UD，loadings = V。截斷 SVD 提供 rank-k 最佳近似（Eckart-Young theorem）。

隨機矩陣理論

p/n → γ > 0 時，Marchenko-Pastur law 描述 null 資料的特徵值分布。最大特徵值服從 Tracy-Widom distribution。Johnstone（2001, Ann Stat）：顯著的 PC 應超過 MP 分布的上界。Parallel analysis（Horn, 1965）以隨機資料的特徵值為 null 基準選取 PC 數。

Sparse PCA 與正規化

Standard PCA loadings 非零——難以解讀。Sparse PCA（Zou, Hastie & Tibshirani, 2006, J Comput Graph Stat）加入 L1 penalty：max wᵀΣw − λ‖w‖₁ s.t. ‖w‖₂=1。SPC（Witten et al., 2009）和 PMD 框架統一稀疏降維方法。在基因組學中以 sparse loadings 識別 driver genes。

Probabilistic PCA

Tipping & Bishop（1999, JRSSB）：x = Wz + μ + ε，z ~ N(0,I)，ε ~ N(0,σ²I)。MLE of W 等同 standard PCA。EM 演算法處理 missing data。混合 PPCA（mixture of probabilistic PCA）用於 clustering。

Kernel PCA 與非線性延伸

Kernel trick 將 PCA 推廣至非線性映射（Schölkopf, Smola & Müller, 1998, Neural Comput）。t-SNE（van der Maaten & Hinton, 2008）和 UMAP（McInnes et al., 2018）是更現代的非線性降維——保留 local 結構而非全域變異。單細胞 RNA-seq 分析中 PCA 用於初步降維（前 30-50 PC），再以 UMAP 視覺化。

批次效應校正

PCA 常揭露技術性批次效應。Combat（Johnson, Li & Rabinovic, 2007, Biostatistics）和 SVA（Leek & Storey, 2007, PLoS Genet）以 surrogate variables 移除 unwanted variation。scRNA-seq 以 Harmony（Korsunsky et al., 2019）和 LIGER 整合多批次資料。

文獻參考：Johnstone, I.M. (2001). Ann Stat, 29, 295-327. / Zou, H. et al. (2006). J Comput Graph Stat, 15, 265-286. / Tipping, M.E. & Bishop, C.M. (1999). JRSSB, 61, 611-622.

互動工具