全基因组测序研究的生物统计方面:预处理和质量控制

IF 1.3 3区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Biometrical Journal Pub Date : 2024-07-11 DOI:10.1002/bimj.202300278
Raphael O. Betschart, Cristian Riccio, Domingo Aguilera-Garcia, Stefan Blankenberg, Linlin Guo, Holger Moch, Dagmar Seidl, Hugo Solleder, Felix Thalén, Alexandre Thiéry, Raphael Twerenbold, Tanja Zeller, Martin Zoche, Andreas Ziegler
{"title":"全基因组测序研究的生物统计方面:预处理和质量控制","authors":"Raphael O. Betschart,&nbsp;Cristian Riccio,&nbsp;Domingo Aguilera-Garcia,&nbsp;Stefan Blankenberg,&nbsp;Linlin Guo,&nbsp;Holger Moch,&nbsp;Dagmar Seidl,&nbsp;Hugo Solleder,&nbsp;Felix Thalén,&nbsp;Alexandre Thiéry,&nbsp;Raphael Twerenbold,&nbsp;Tanja Zeller,&nbsp;Martin Zoche,&nbsp;Andreas Ziegler","doi":"10.1002/bimj.202300278","DOIUrl":null,"url":null,"abstract":"<p>Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg–Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.3000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300278","citationCount":"0","resultStr":"{\"title\":\"Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control\",\"authors\":\"Raphael O. Betschart,&nbsp;Cristian Riccio,&nbsp;Domingo Aguilera-Garcia,&nbsp;Stefan Blankenberg,&nbsp;Linlin Guo,&nbsp;Holger Moch,&nbsp;Dagmar Seidl,&nbsp;Hugo Solleder,&nbsp;Felix Thalén,&nbsp;Alexandre Thiéry,&nbsp;Raphael Twerenbold,&nbsp;Tanja Zeller,&nbsp;Martin Zoche,&nbsp;Andreas Ziegler\",\"doi\":\"10.1002/bimj.202300278\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg–Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.</p>\",\"PeriodicalId\":55360,\"journal\":{\"name\":\"Biometrical Journal\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2024-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300278\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biometrical Journal\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/bimj.202300278\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biometrical Journal","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/bimj.202300278","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

高通量 DNA 测序技术的飞速发展促成了大规模的全基因组测序(WGS)研究。在进行表型与基因型之间的关联分析之前,需要对原始序列数据进行预处理和质量控制(QC)。由于许多生物统计学家至今尚未接触过 WGS 数据,因此我们首先简要介绍了 Illumina 的短线程测序技术。其次,我们解释了 WGS 研究的一般预处理流程。第三,我们概述了应用于 WGS 数据的重要 QC 指标:原始数据、映射和比对后、变异调用后和多样本变异调用后。第四,我们用汉堡-达沃斯基因测序研究(GENESIS-HD)的数据来说明质量控制,这项研究涉及 9000 多个人类全基因组。所有样本均在 Illumina NovaSeq 6000 上进行测序,采用无 PCR 方案,平均覆盖率为 35×。为了进行质量控制,对一个瓶中基因组(GIAB)三组进行了四次重复测序,一个 GIAB 样本在不同的运行中成功测序了 70 次。第五,我们提供了使用 DRAGEN 原始读存档(ORA)压缩原始数据的经验数据。应用中最重要的质量指标是遗传相似性、样本交叉污染、与预期 Het/Hom 比率的偏差、相关性和覆盖率。使用 DRAGEN ORA 对原始文件的压缩率为 5.6:1,压缩时间与基因组覆盖率成线性关系。总之,大型 WGS 研究的预处理、联合调用和质量控制在合理的时间内是可行的,高效的质量控制程序也是现成的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control

Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg–Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Biometrical Journal
Biometrical Journal 生物-数学与计算生物学
CiteScore
3.20
自引率
5.90%
发文量
119
审稿时长
6-12 weeks
期刊介绍: Biometrical Journal publishes papers on statistical methods and their applications in life sciences including medicine, environmental sciences and agriculture. Methodological developments should be motivated by an interesting and relevant problem from these areas. Ideally the manuscript should include a description of the problem and a section detailing the application of the new methodology to the problem. Case studies, review articles and letters to the editors are also welcome. Papers containing only extensive mathematical theory are not suitable for publication in Biometrical Journal.
期刊最新文献
Post-Estimation Shrinkage in Full and Selected Linear Regression Models in Low-Dimensional Data Revisited Functional Data Analysis: An Introduction and Recent Developments Meta-Analysis of Diagnostic Accuracy Studies With Multiple Thresholds: Comparison of Approaches in a Simulation Study A Network-Constrain Weibull AFT Model for Biomarkers Discovery Multivariate Scalar on Multidimensional Distribution Regression With Application to Modeling the Association Between Physical Activity and Cognitive Functions
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1