大1 · 第2學期生物統計生物學應用

生物資訊統計

Bioinformatics Statistics

難度 3 · 進階statisticsbioinformatics

生物資訊統計的進階方法涉及 empirical Bayes 框架、網路分析和多組學整合。

Empirical Bayes 在高維度的核心角色

omics 分析的共同主題：每個特徵（基因/蛋白/代謝物）的樣本量小（n = 3-10/組），但特徵數極大（p = 10³-10⁵）。Empirical Bayes 「借力」（borrow strength）跨特徵：以所有基因的方差估計構建先驗，穩定個別基因的方差估計。limma 的 moderated t-statistic（Smyth, 2004）和 DESeq2 的 shrunken LFC（Love et al., 2014）都是此思路。

加權基因共表達網路分析（WGCNA）

Langfelder & Horvath（2008）：

構建基因共表達相關矩陣。
以 soft-thresholding（power adjacency）建立 scale-free 網路。
Topological overlap matrix（TOM）+ hierarchical clustering 定義基因模組。
Module eigengene（ME，第一主成分）與表型的相關分析。
識別與疾病相關的共表達模組和 hub genes。

多組學整合（Multi-omics Integration）

垂直整合：同一樣本的多種 omics（如 mRNA + protein + methylation）。方法：Multi-Omics Factor Analysis（MOFA, Argelaguet et al., 2018），sparse CCA，DIABLO（mixOmics）。
水平整合：不同研究/平台的同類 omics。方法：meta-analysis（fixed/random effects）或 ComBat 批次校正。
因果整合：Mendelian Randomization 結合 GWAS + eQTL 資料推論基因→蛋白→疾病的因果路徑。

Bayesian 與機器學習前沿

scVI（Lopez et al., 2018）：variational autoencoder 建模 scRNA-seq 的 ZINB 分布。
CellTypist, scArches 等 transfer learning 方法將 reference atlas 的注釋轉移到新數據集。
Graph Neural Networks（GNN）建模 spatial transcriptomics 的空間關係。

文獻參考：Smyth, G.K. (2004). Stat Appl Genet Mol Biol, 3, Art 3. / Love, M.I. et al. (2014). Genome Biol, 15, 550. / Langfelder, P. & Horvath, S. (2008). BMC Bioinformatics, 9, 559.

互動工具