Efficient storage and regression computation for population-scale genome sequencing studies.

Manuel A Rivas, Christopher Chang
{"title":"Efficient storage and regression computation for population-scale genome sequencing studies.","authors":"Manuel A Rivas, Christopher Chang","doi":"10.1093/bioinformatics/btaf067","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>The growing availability of large-scale population biobanks has the potential to significantly advance our understanding of human health and disease. However, the massive computational and storage demands of whole genome sequencing (WGS) data pose serious challenges, particularly for underfunded institutions or researchers in developing countries. This disparity in resources can limit equitable access to cutting-edge genetic research.</p><p><strong>Results: </strong>We present novel algorithms and regression methods that dramatically reduce both computation time and storage requirements for WGS studies, with particular attention to rare variant representation. By integrating these approaches into PLINK 2.0, we demonstrate substantial gains in efficiency without compromising analytical accuracy. In an exome-wide association analysis of 19.4 million variants for the body mass index phenotype in 125 077 individuals (AllofUs project data), we reduced runtime from 695.35 min (11.5 h) on a single machine to 1.57 min with 30 GB of memory and 50 threads (or 8.67 min with 4 threads). Additionally, the framework supports multi-phenotype analyses, further enhancing its flexibility.</p><p><strong>Availability and implementation: </strong>Our optimized methods are fully integrated into PLINK 2.0 and can be accessed at: https://www.cog-genomics.org/plink/2.0/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11893150/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: The growing availability of large-scale population biobanks has the potential to significantly advance our understanding of human health and disease. However, the massive computational and storage demands of whole genome sequencing (WGS) data pose serious challenges, particularly for underfunded institutions or researchers in developing countries. This disparity in resources can limit equitable access to cutting-edge genetic research.

Results: We present novel algorithms and regression methods that dramatically reduce both computation time and storage requirements for WGS studies, with particular attention to rare variant representation. By integrating these approaches into PLINK 2.0, we demonstrate substantial gains in efficiency without compromising analytical accuracy. In an exome-wide association analysis of 19.4 million variants for the body mass index phenotype in 125 077 individuals (AllofUs project data), we reduced runtime from 695.35 min (11.5 h) on a single machine to 1.57 min with 30 GB of memory and 50 threads (or 8.67 min with 4 threads). Additionally, the framework supports multi-phenotype analyses, further enhancing its flexibility.

Availability and implementation: Our optimized methods are fully integrated into PLINK 2.0 and can be accessed at: https://www.cog-genomics.org/plink/2.0/.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
群体规模基因组测序研究的高效存储和回归计算。
动机:大规模人口生物库的日益普及有可能显著推进我们对人类健康和疾病的理解。然而,全基因组测序(WGS)数据的大量计算和存储需求构成了严峻的挑战,特别是对发展中国家资金不足的机构或研究人员。这种资源上的差异可能会限制公平获取尖端基因研究。结果:我们提出了新的算法和回归方法,大大减少了WGS研究的计算时间和存储要求,特别注意罕见变体表示。通过将这些方法集成到PLINK 2.0中,我们在不影响分析准确性的情况下证明了效率的实质性提高。在对125,077个人(AllofUs项目数据)的1,940万个体重指数表型变异的外显子组全关联分析中,我们将单机上的运行时间从695.35分钟(11.5小时)减少到30gb内存和50个线程时的1.57分钟(或4个线程时的8.67分钟)。此外,该框架支持多表型分析,进一步增强了其灵活性。可用性:我们优化的方法已完全集成到PLINK 2.0中,可访问:https://www.cog-genomics.org/plink/2.0/。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Response to: Best practices when benchmarking CATCH for the design of genome enrichment probes. scDock: Streamlining drug discovery targeting cell-cell communication via scRNA-seq analysis and molecular docking. GeneExt: a gene model extension tool for enhanced single-cell RNA-seq analysis. FishFeats: streamlined quantification of multimodal labeling at the single-cell level in 3D tissues. Statistical Methods to Harmonize Electronic Health Record Data Across Healthcare Systems: Case Study and Lessons Learned.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1