高维数据中有效惩罚的广义线性混合模型，用于可变选择和遗传风险预测

论文标题

高维数据中有效惩罚的广义线性混合模型，用于可变选择和遗传风险预测

Efficient Penalized Generalized Linear Mixed Models for Variable Selection and Genetic Risk Prediction in High-Dimensional Data

论文作者

St-Pierre, Julien, Oualkacha, Karim, Bhatnagar, Sahir Rai

论文摘要

稀疏的正则回归方法现在广泛用于全基因组关联研究（GWAS），以解决限制潜在重要预测因子的多重测试负担。线性混合模型（LMM）已成为主体组件（PC）调整的有吸引力替代品，以说明高维惩罚模型中的种群结构和相关性。但是，它们在二元性状GWA中的使用取决于无效的假设，即残余方差不取决于估计的回归系数。此外，LMM使用响应的协方差矩阵的单光谱分解，在广义线性混合模型（GLMM）中不再可能。我们引入了一种称为PGLMM的新方法，该方法允许同时选择遗传标记并估算其效果，从而考虑了性状之间的个体之间的相关性和二进制性质。我们根据PQL估计来开发一种计算高效的算法，该算法允许在高维二进制特征GWAS（〜300,000 SNP）上缩放正则混合模型。我们通过模拟显示，通过PC调整的惩罚LMM和逻辑回归无法正确选择重要的预测指标和/或当相关性矩阵的维度与PGLMM相比高时，预测精度会降低二进制响应的准确性。此外，我们通过分析英国生物库数据中的两个多基因二进制特征来证明我们的方法可以实现更高的预测性能，同时选择比PC调整的稀疏正规逻辑套索更少的预测因子。我们的方法可作为Julia Package惩罚性Glmm.jl提供。

Sparse regularized regression methods are now widely used in genome-wide association studies (GWAS) to address the multiple testing burden that limits discovery of potentially important predictors. Linear mixed models (LMMs) have become an attractive alternative to principal components (PC) adjustment to account for population structure and relatedness in high-dimensional penalized models. However, their use in binary trait GWAS rely on the invalid assumption that the residual variance does not depend on the estimated regression coefficients. Moreover, LMMs use a single spectral decomposition of the covariance matrix of the responses, which is no longer possible in generalized linear mixed models (GLMMs). We introduce a new method called pglmm, a penalized GLMM that allows to simultaneously select genetic markers and estimate their effects, accounting for between-individual correlations and binary nature of the trait. We develop a computationally efficient algorithm based on PQL estimation that allows to scale regularized mixed models on high-dimensional binary trait GWAS (~300,000 SNPs). We show through simulations that penalized LMM and logistic regression with PC adjustment fail to correctly select important predictors and/or that prediction accuracy decreases for a binary response when the dimensionality of the relatedness matrix is high compared to pglmm. Further, we demonstrate through the analysis of two polygenic binary traits in the UK Biobank data that our method can achieve higher predictive performance, while also selecting fewer predictors than a sparse regularized logistic lasso with PC adjustment. Our method is available as a Julia package PenalizedGLMM.jl.

下载PDF全文

下载文献需遵守相关版权规定

论文标题