Impact of reference design on estimating SARS-CoV-2 lineage abundances from wastewater sequencing data.

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES GigaScience Pub Date : 2024-01-02 DOI:10.1093/gigascience/giae051

Eva Aßmann, Shelesh Agrawal, Laura Orschler, Sindy Böttcher, Susanne Lackner, Martin Hölzer

{"title":"Impact of reference design on estimating SARS-CoV-2 lineage abundances from wastewater sequencing data.","authors":"Eva Aßmann, Shelesh Agrawal, Laura Orschler, Sindy Böttcher, Susanne Lackner, Martin Hölzer","doi":"10.1093/gigascience/giae051","DOIUrl":null,"url":null,"abstract":"Background: Sequencing of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) RNA from wastewater samples has emerged as a valuable tool for detecting the presence and relative abundances of SARS-CoV-2 variants in a community. By analyzing the viral genetic material present in wastewater, researchers and public health authorities can gain early insights into the spread of virus lineages and emerging mutations. Constructing reference datasets from known SARS-CoV-2 lineages and their mutation profiles has become state-of-the-art for assigning viral lineages and their relative abundances from wastewater sequencing data. However, selecting reference sequences or mutations directly affects the predictive power.Results: Here, we show the impact of a mutation- and sequence-based reference reconstruction for SARS-CoV-2 abundance estimation. We benchmark 3 datasets: (i) synthetic \"spike-in\"' mixtures; (ii) German wastewater samples from early 2021, mainly comprising Alpha; and (iii) samples obtained from wastewater at an international airport in Germany from the end of 2021, including first signals of Omicron. The 2 approaches differ in sublineage detection, with the marker mutation-based method, in particular, being challenged by the increasing number of mutations and lineages. However, the estimations of both approaches depend on selecting representative references and optimized parameter settings. By performing parameter escalation experiments, we demonstrate the effects of reference size and alternative allele frequency cutoffs for abundance estimation. We show how different parameter settings can lead to different results for our test datasets and illustrate the effects of virus lineage composition of wastewater samples and references.Conclusions: Our study highlights current computational challenges, focusing on the general reference design, which directly impacts abundance allocations. We illustrate advantages and disadvantages that may be relevant for further developments in the wastewater community and in the context of defining robust quality metrics.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11308188/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giae051","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Sequencing of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) RNA from wastewater samples has emerged as a valuable tool for detecting the presence and relative abundances of SARS-CoV-2 variants in a community. By analyzing the viral genetic material present in wastewater, researchers and public health authorities can gain early insights into the spread of virus lineages and emerging mutations. Constructing reference datasets from known SARS-CoV-2 lineages and their mutation profiles has become state-of-the-art for assigning viral lineages and their relative abundances from wastewater sequencing data. However, selecting reference sequences or mutations directly affects the predictive power.

Results: Here, we show the impact of a mutation- and sequence-based reference reconstruction for SARS-CoV-2 abundance estimation. We benchmark 3 datasets: (i) synthetic "spike-in"' mixtures; (ii) German wastewater samples from early 2021, mainly comprising Alpha; and (iii) samples obtained from wastewater at an international airport in Germany from the end of 2021, including first signals of Omicron. The 2 approaches differ in sublineage detection, with the marker mutation-based method, in particular, being challenged by the increasing number of mutations and lineages. However, the estimations of both approaches depend on selecting representative references and optimized parameter settings. By performing parameter escalation experiments, we demonstrate the effects of reference size and alternative allele frequency cutoffs for abundance estimation. We show how different parameter settings can lead to different results for our test datasets and illustrate the effects of virus lineage composition of wastewater samples and references.

Conclusions: Our study highlights current computational challenges, focusing on the general reference design, which directly impacts abundance allocations. We illustrate advantages and disadvantages that may be relevant for further developments in the wastewater community and in the context of defining robust quality metrics.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

参考设计对从废水测序数据中估计 SARS-CoV-2 世系丰度的影响。

背景：对废水样本中的严重急性呼吸系统综合征冠状病毒 2（SARS-CoV-2）RNA 进行测序，已成为检测社区中是否存在 SARS-CoV-2 变种及其相对丰度的重要工具。通过分析废水中的病毒遗传物质，研究人员和公共卫生机构可以及早了解病毒品系的传播和新出现的变异。从已知的 SARS-CoV-2 品系及其变异特征中构建参考数据集，已成为从废水测序数据中确定病毒品系及其相对丰度的最先进方法。然而，参考序列或突变的选择会直接影响预测能力：在此，我们展示了基于突变和序列的参考重建对 SARS-CoV-2 丰度估计的影响。我们以 3 个数据集为基准：(i) 合成的 "spike-in"'混合物；(ii) 2021 年初的德国废水样本，主要包括 Alpha；(iii) 2021 年底从德国一个国际机场的废水中获得的样本，包括 Omicron 的第一个信号。这两种方法在亚系检测方面存在差异，尤其是基于标记突变的方法面临着突变和亚系数量不断增加的挑战。不过，这两种方法的估计结果都取决于选择有代表性的参照物和优化的参数设置。通过参数升级实验，我们展示了参照物大小和替代等位基因频率截止值对丰度估计的影响。我们展示了不同的参数设置如何导致我们的测试数据集产生不同的结果，并说明了废水样本和参考文献的病毒谱系组成的影响：我们的研究突显了当前的计算挑战，重点是直接影响丰度分配的一般参考设计。我们说明了可能与废水领域进一步发展相关的优势和劣势，以及在定义稳健质量指标方面的优势和劣势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

GigaScience MULTIDISCIPLINARY SCIENCES-

CiteScore

15.50

自引率

1.10%

发文量

119

审稿时长

1 weeks

期刊介绍： GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.