Efficient storage and regression computation for population-scale genome sequencing studies.

IF 5.4 Bioinformatics (Oxford, England) Pub Date : 2025-03-04 DOI:10.1093/bioinformatics/btaf067

Manuel A Rivas, Christopher Chang

{"title":"Efficient storage and regression computation for population-scale genome sequencing studies.","authors":"Manuel A Rivas, Christopher Chang","doi":"10.1093/bioinformatics/btaf067","DOIUrl":null,"url":null,"abstract":"Motivation: The growing availability of large-scale population biobanks has the potential to significantly advance our understanding of human health and disease. However, the massive computational and storage demands of whole genome sequencing (WGS) data pose serious challenges, particularly for underfunded institutions or researchers in developing countries. This disparity in resources can limit equitable access to cutting-edge genetic research.Results: We present novel algorithms and regression methods that dramatically reduce both computation time and storage requirements for WGS studies, with particular attention to rare variant representation. By integrating these approaches into PLINK 2.0, we demonstrate substantial gains in efficiency without compromising analytical accuracy. In an exome-wide association analysis of 19.4 million variants for the body mass index phenotype in 125 077 individuals (AllofUs project data), we reduced runtime from 695.35 min (11.5 h) on a single machine to 1.57 min with 30 GB of memory and 50 threads (or 8.67 min with 4 threads). Additionally, the framework supports multi-phenotype analyses, further enhancing its flexibility.Availability and implementation: Our optimized methods are fully integrated into PLINK 2.0 and can be accessed at: https://www.cog-genomics.org/plink/2.0/.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11893150/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: The growing availability of large-scale population biobanks has the potential to significantly advance our understanding of human health and disease. However, the massive computational and storage demands of whole genome sequencing (WGS) data pose serious challenges, particularly for underfunded institutions or researchers in developing countries. This disparity in resources can limit equitable access to cutting-edge genetic research.

Results: We present novel algorithms and regression methods that dramatically reduce both computation time and storage requirements for WGS studies, with particular attention to rare variant representation. By integrating these approaches into PLINK 2.0, we demonstrate substantial gains in efficiency without compromising analytical accuracy. In an exome-wide association analysis of 19.4 million variants for the body mass index phenotype in 125 077 individuals (AllofUs project data), we reduced runtime from 695.35 min (11.5 h) on a single machine to 1.57 min with 30 GB of memory and 50 threads (or 8.67 min with 4 threads). Additionally, the framework supports multi-phenotype analyses, further enhancing its flexibility.

Availability and implementation: Our optimized methods are fully integrated into PLINK 2.0 and can be accessed at: https://www.cog-genomics.org/plink/2.0/.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

群体规模基因组测序研究的高效存储和回归计算。

动机：大规模人口生物库的日益普及有可能显著推进我们对人类健康和疾病的理解。然而，全基因组测序（WGS）数据的大量计算和存储需求构成了严峻的挑战，特别是对发展中国家资金不足的机构或研究人员。这种资源上的差异可能会限制公平获取尖端基因研究。结果：我们提出了新的算法和回归方法，大大减少了WGS研究的计算时间和存储要求，特别注意罕见变体表示。通过将这些方法集成到PLINK 2.0中，我们在不影响分析准确性的情况下证明了效率的实质性提高。在对125,077个人（AllofUs项目数据）的1,940万个体重指数表型变异的外显子组全关联分析中，我们将单机上的运行时间从695.35分钟（11.5小时）减少到30gb内存和50个线程时的1.57分钟（或4个线程时的8.67分钟）。此外，该框架支持多表型分析，进一步增强了其灵活性。可用性：我们优化的方法已完全集成到PLINK 2.0中，可访问：https://www.cog-genomics.org/plink/2.0/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量