Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data.

Q2 Pharmacology, Toxicology and Pharmaceutics F1000Research Pub Date : 2025-01-02 eCollection Date: 2024-01-01 DOI:10.12688/f1000research.155230.2
Eva Kohnert, Clemens Kreutz
{"title":"Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data.","authors":"Eva Kohnert, Clemens Kreutz","doi":"10.12688/f1000research.155230.2","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Synthetic data's utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.'s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning.</p><p><strong>Methods: </strong>We replicate Nearing et al.'s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results.</p><p><strong>Conclusions: </strong>Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research.</p>","PeriodicalId":12260,"journal":{"name":"F1000Research","volume":"13 ","pages":"1180"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11757917/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"F1000Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12688/f1000research.155230.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"Pharmacology, Toxicology and Pharmaceutics","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Synthetic data's utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.'s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning.

Methods: We replicate Nearing et al.'s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results.

Conclusions: Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research.

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
计算研究方案:利用合成数据验证16S微生物组测序数据差异丰度测试的基准研究。
背景:合成数据在基准研究中的效用取决于其密切模拟现实世界条件和再现实验数据所得结果的能力。在接近等人的研究(1)的基础上,我们正在生成模拟实验数据的合成数据集,以验证他们的发现。他们在病例对照设计中使用38个实验性16S rRNA数据集评估了14个差异丰度测试。我们将采用统计测试来严格评估合成数据和实验数据之间的相似性,并验证由near等人得出的关于这些测试性能的结论。(1)。本方案遵循SPIRIT指南,展示了已建立的报告框架如何支持稳健、透明和公正的研究计划。方法:我们复制了near et al.(1)的方法,结合了使用两种不同工具模拟的合成数据,反映了38个实验数据集。将对46个数据特征的非冗余子集进行等效性检验,比较合成数据和实验数据,并辅以主成分分析进行总体相似性评估。14种差异丰度测试将应用于合成和实验数据集,评估重要特征识别的一致性和每个工具的重要特征数量。相关分析和多元回归将探讨合成和实验数据特征之间的差异如何影响结果。结论:合成数据可以通过对照实验验证研究结果。我们评估了合成数据如何很好地复制实验数据,试图用最新版本的数据分析方法验证以前的发现,并描述了基准研究中合成数据的优势和局限性。此外,据我们所知,这是第一个系统地结合合成数据来验证差异丰度方法的计算基准研究,同时严格遵守预先指定的研究方案,遵循SPIRIT指南,有助于透明度,可重复性和无偏倚的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
F1000Research
F1000Research Pharmacology, Toxicology and Pharmaceutics-Pharmacology, Toxicology and Pharmaceutics (all)
CiteScore
5.00
自引率
0.00%
发文量
1646
审稿时长
1 weeks
期刊介绍: F1000Research publishes articles and other research outputs reporting basic scientific, scholarly, translational and clinical research across the physical and life sciences, engineering, medicine, social sciences and humanities. F1000Research is a scholarly publication platform set up for the scientific, scholarly and medical research community; each article has at least one author who is a qualified researcher, scholar or clinician actively working in their speciality and who has made a key contribution to the article. Articles must be original (not duplications). All research is suitable irrespective of the perceived level of interest or novelty; we welcome confirmatory and negative results, as well as null studies. F1000Research publishes different type of research, including clinical trials, systematic reviews, software tools, method articles, and many others. Reviews and Opinion articles providing a balanced and comprehensive overview of the latest discoveries in a particular field, or presenting a personal perspective on recent developments, are also welcome. See the full list of article types we accept for more information.
期刊最新文献
An extended Theory of Planned Behavior in explaining intention toward sustainable forest management: Evidence from COVID 19 Pandemic from Bali, Indonesia. Simultaneous Numerical Determination of Two Time-dependent Coefficients in Second Order Parabolic Equation With Nonlocal Initial and Boundary Conditions. Revisiting the Governance-Dividend Nexus: The Mediating Role of Corporate Social Responsibility. The Improved Hybrid STD- Radial Basis Function Neural Network Approach for Time Series Forecasting Application to Tesla Stock Price Prediction. Non‑pharmacological care for early-stage dementia through smart environments in Colombia: a mixed‑methods study and methodological guide for caregivers and patients.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1