A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.

IF 4.5 2区 生物学 Q1 Agricultural and Biological Sciences PLoS Genetics Pub Date : 2020-10-23 eCollection Date: 2020-10-01 DOI:10.1371/journal.pgen.1009141
Junyang Qian, Yosuke Tanigawa, Wenfei Du, Matthew Aguirre, Chris Chang, Robert Tibshirani, Manuel A Rivas, Trevor Hastie
{"title":"A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.","authors":"Junyang Qian,&nbsp;Yosuke Tanigawa,&nbsp;Wenfei Du,&nbsp;Matthew Aguirre,&nbsp;Chris Chang,&nbsp;Robert Tibshirani,&nbsp;Manuel A Rivas,&nbsp;Trevor Hastie","doi":"10.1371/journal.pgen.1009141","DOIUrl":null,"url":null,"abstract":"<p><p>The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.</p>","PeriodicalId":20266,"journal":{"name":"PLoS Genetics","volume":" ","pages":"e1009141"},"PeriodicalIF":4.5000,"publicationDate":"2020-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7641476/pdf/","citationCount":"68","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pgen.1009141","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/10/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}
引用次数: 68

Abstract

The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一个快速和可扩展的框架,用于大规模和超高维稀疏回归,并应用于英国生物银行。
英国生物库是一个非常大的,前瞻性的人群为基础的队列研究在英国。它为研究人员提供了前所未有的机会来研究基因型信息和感兴趣的表型之间的关系。与全基因组关联研究(GWAS)相比,多元回归方法已被证明可以大大提高对多种表型的预测性能。在高维环境下,套索自首次在统计学中提出以来,已被证明是同时进行变量选择和估计的有效方法。然而,在英国生物银行看到的大规模和超高维度对套索方法的应用提出了新的挑战,因为许多现有的算法及其实现都不能扩展到大型应用中。在本文中,我们提出了一个称为批量筛选迭代套索(BASIL)的计算框架,它可以利用任何现有的套索求解器,并轻松构建一个可扩展的解决方案,用于非常大的数据,包括那些大于内存大小的数据。我们介绍了snpnet,这是一个R包,它在glmnet之上实现了所提出的算法,并针对单核苷酸多态性(SNP)数据集进行了优化。目前支持1惩罚线性模型、logistic回归、Cox模型,并扩展到1/ 2惩罚弹性网络。我们在UK Biobank数据集上展示了结果,与其他已建立的多基因风险评分方法相比,我们仅使用一小部分变体,就实现了对所有四种表型(身高、体重指数、哮喘、高胆固醇)的竞争性预测性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
PLoS Genetics
PLoS Genetics 生物-遗传学
CiteScore
8.10
自引率
2.20%
发文量
438
审稿时长
1 months
期刊介绍: PLOS Genetics is run by an international Editorial Board, headed by the Editors-in-Chief, Greg Barsh (HudsonAlpha Institute of Biotechnology, and Stanford University School of Medicine) and Greg Copenhaver (The University of North Carolina at Chapel Hill). Articles published in PLOS Genetics are archived in PubMed Central and cited in PubMed.
期刊最新文献
Subfunctionalization of NRC3 altered the genetic structure of the Nicotiana NRC network The transcription factor RUNT-like regulates pupal cuticle development via promoting a pupal cuticle protein transcription Direct targets of MEF2C are enriched for genes associated with schizophrenia and cognitive function and are involved in neuron development and mitochondrial function Evolutionary rate covariation is pervasive between glycosylation pathways and points to potential disease modifiers Histone variant H2A.Z is needed for efficient transcription-coupled NER and genome integrity in UV challenged yeast cells
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1