A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.

IF 3.7 2区生物学 Q1 Agricultural and Biological Sciences PLoS Genetics Pub Date : 2020-10-23 eCollection Date: 2020-10-01 DOI:10.1371/journal.pgen.1009141

Junyang Qian, Yosuke Tanigawa, Wenfei Du, Matthew Aguirre, Chris Chang, Robert Tibshirani, Manuel A Rivas, Trevor Hastie

{"title":"A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.","authors":"Junyang Qian, Yosuke Tanigawa, Wenfei Du, Matthew Aguirre, Chris Chang, Robert Tibshirani, Manuel A Rivas, Trevor Hastie","doi":"10.1371/journal.pgen.1009141","DOIUrl":null,"url":null,"abstract":"<p><p>The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.</p>","PeriodicalId":20266,"journal":{"name":"PLoS Genetics","volume":" ","pages":"e1009141"},"PeriodicalIF":3.7000,"publicationDate":"2020-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7641476/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pgen.1009141","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/10/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}

引用次数: 0

Abstract

The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一个快速和可扩展的框架，用于大规模和超高维稀疏回归，并应用于英国生物银行。

英国生物库是一个非常大的，前瞻性的人群为基础的队列研究在英国。它为研究人员提供了前所未有的机会来研究基因型信息和感兴趣的表型之间的关系。与全基因组关联研究（GWAS）相比，多元回归方法已被证明可以大大提高对多种表型的预测性能。在高维环境下，套索自首次在统计学中提出以来，已被证明是同时进行变量选择和估计的有效方法。然而，在英国生物银行看到的大规模和超高维度对套索方法的应用提出了新的挑战，因为许多现有的算法及其实现都不能扩展到大型应用中。在本文中，我们提出了一个称为批量筛选迭代套索（BASIL）的计算框架，它可以利用任何现有的套索求解器，并轻松构建一个可扩展的解决方案，用于非常大的数据，包括那些大于内存大小的数据。我们介绍了snpnet，这是一个R包，它在glmnet之上实现了所提出的算法，并针对单核苷酸多态性（SNP）数据集进行了优化。目前支持1惩罚线性模型、logistic回归、Cox模型，并扩展到1/ 2惩罚弹性网络。我们在UK Biobank数据集上展示了结果，与其他已建立的多基因风险评分方法相比，我们仅使用一小部分变体，就实现了对所有四种表型（身高、体重指数、哮喘、高胆固醇）的竞争性预测性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

PLoS Genetics 生物-遗传学

CiteScore

8.10

自引率

2.20%

发文量

438

审稿时长

1 months

期刊介绍： PLOS Genetics is run by an international Editorial Board, headed by the Editors-in-Chief, Greg Barsh (HudsonAlpha Institute of Biotechnology, and Stanford University School of Medicine) and Greg Copenhaver (The University of North Carolina at Chapel Hill). Articles published in PLOS Genetics are archived in PubMed Central and cited in PubMed.