Small-group originating model: Optimized individual-level GWAS simulation featured by SLiM and using open-access data

IF 2.6 4区生物学 Q2 BIOLOGY Computational Biology and Chemistry Pub Date : 2024-07-15 DOI:10.1016/j.compbiolchem.2024.108147

{"title":"Small-group originating model: Optimized individual-level GWAS simulation featured by SLiM and using open-access data","authors":"","doi":"10.1016/j.compbiolchem.2024.108147","DOIUrl":null,"url":null,"abstract":"<div><p>The development of analytical methods for Genome-wide Association Studies (GWAS) has outpaced the evolution of simulation techniques and pipelines. This disparity underscores the importance of innovative simulation methods that can keep pace with the rapidly increasing scale of GWAS. The median sample size of GWAS over the past ten years has exceeded 50,000 individuals, a trend that emphasizes the need for simulation tools capable of generating data on a similar or larger scale. This paper introduces a novel method, the small-group originating (SGO) model, utilizing the SLiM software for simulating individual-level GWAS data. Our standardized protocol facilitates the generation of tens of thousands of pseudo-individuals with millions of variants from small (30−90) open-access datasets.</p><p>SGO stands out, especially when compared to the widely-used resampling method in HapGen, showcasing superior simulation efficiency for large sample sizes (> 13,000) of unrelated individuals. This capability is particularly relevant given the current trajectory towards larger GWAS, necessitating tools that can simulate datasets reflective of this growth. Additionally, SGO provides customization options and can model dynamic life cycles and mating across generations, positioning it as a highly promising alternative for GWAS simulations.</p><p>In a case study, sensitivity analyses of chromosome-level principal component analysis and kinship coefficient estimation were conducted. The results highlighted the poor robustness of chromosome-level quality control (QC) indexes and the uneven distribution of population structure across chromosomes and ancestries, advocating for the caution against relying solely on chromosome-level QC statistics.</p><p>With its flexible and efficient approach to generating pseudo GWAS data, our standardized SGO protocol emerges as a crucial asset for method development, power analysis, and benchmarking in GWAS research. It is especially vital in the context of accommodating the demands for large-scale simulations, aligning with the current and future scale of GWAS.</p></div>","PeriodicalId":10616,"journal":{"name":"Computational Biology and Chemistry","volume":null,"pages":null},"PeriodicalIF":2.6000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Biology and Chemistry","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S147692712400135X","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

The development of analytical methods for Genome-wide Association Studies (GWAS) has outpaced the evolution of simulation techniques and pipelines. This disparity underscores the importance of innovative simulation methods that can keep pace with the rapidly increasing scale of GWAS. The median sample size of GWAS over the past ten years has exceeded 50,000 individuals, a trend that emphasizes the need for simulation tools capable of generating data on a similar or larger scale. This paper introduces a novel method, the small-group originating (SGO) model, utilizing the SLiM software for simulating individual-level GWAS data. Our standardized protocol facilitates the generation of tens of thousands of pseudo-individuals with millions of variants from small (30−90) open-access datasets.

SGO stands out, especially when compared to the widely-used resampling method in HapGen, showcasing superior simulation efficiency for large sample sizes (> 13,000) of unrelated individuals. This capability is particularly relevant given the current trajectory towards larger GWAS, necessitating tools that can simulate datasets reflective of this growth. Additionally, SGO provides customization options and can model dynamic life cycles and mating across generations, positioning it as a highly promising alternative for GWAS simulations.

In a case study, sensitivity analyses of chromosome-level principal component analysis and kinship coefficient estimation were conducted. The results highlighted the poor robustness of chromosome-level quality control (QC) indexes and the uneven distribution of population structure across chromosomes and ancestries, advocating for the caution against relying solely on chromosome-level QC statistics.

With its flexible and efficient approach to generating pseudo GWAS data, our standardized SGO protocol emerges as a crucial asset for method development, power analysis, and benchmarking in GWAS research. It is especially vital in the context of accommodating the demands for large-scale simulations, aligning with the current and future scale of GWAS.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

小群体起源模型：以 SLiM 为特色并使用开放获取数据的优化个体水平 GWAS 模拟

全基因组关联研究（GWAS）分析方法的发展速度超过了模拟技术和管道的发展速度。这种差距凸显了创新模拟方法的重要性，它能跟上全基因组关联研究规模迅速扩大的步伐。在过去十年中，GWAS 的样本量中位数已超过 50,000 个个体，这一趋势强调了对能够生成类似或更大规模数据的模拟工具的需求。本文介绍了一种新方法--小群体起源（SGO）模型，利用 SLiM 软件模拟个体水平的 GWAS 数据。与 HapGen 中广泛使用的重采样方法相比，SGO 尤为突出，它对大样本量（13,000 个）非相关个体的模拟效率更高。鉴于目前全球基因组研究正朝着大型化的方向发展，因此需要能模拟反映这一发展的数据集的工具，而这一功能就显得尤为重要。此外，SGO 还提供定制选项，并能模拟动态生命周期和跨代交配，使其成为极有前途的 GWAS 模拟替代工具。在一项案例研究中，对染色体级主成分分析和亲缘关系系数估计进行了敏感性分析。结果表明，染色体水平的质量控制（QC）指标稳健性差，而且种群结构在染色体和祖先间分布不均，因此主张谨慎对待单纯依赖染色体水平的 QC 统计。它在满足大规模模拟需求方面尤为重要，符合 GWAS 当前和未来的规模。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computational Biology and Chemistry 生物-计算机：跨学科应用

CiteScore

6.10

自引率

3.20%

发文量

142

审稿时长

24 days

期刊介绍： Computational Biology and Chemistry publishes original research papers and review articles in all areas of computational life sciences. High quality research contributions with a major computational component in the areas of nucleic acid and protein sequence research, molecular evolution, molecular genetics (functional genomics and proteomics), theory and practice of either biology-specific or chemical-biology-specific modeling, and structural biology of nucleic acids and proteins are particularly welcome. Exceptionally high quality research work in bioinformatics, systems biology, ecology, computational pharmacology, metabolism, biomedical engineering, epidemiology, and statistical genetics will also be considered. Given their inherent uncertainty, protein modeling and molecular docking studies should be thoroughly validated. In the absence of experimental results for validation, the use of molecular dynamics simulations along with detailed free energy calculations, for example, should be used as complementary techniques to support the major conclusions. Submissions of premature modeling exercises without additional biological insights will not be considered. Review articles will generally be commissioned by the editors and should not be submitted to the journal without explicit invitation. However prospective authors are welcome to send a brief (one to three pages) synopsis, which will be evaluated by the editors.