Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data.
{"title":"Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data.","authors":"Eva Kohnert, Clemens Kreutz","doi":"10.12688/f1000research.155230.2","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Synthetic data's utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.'s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning.</p><p><strong>Methods: </strong>We replicate Nearing et al.'s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results.</p><p><strong>Conclusions: </strong>Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research.</p>","PeriodicalId":12260,"journal":{"name":"F1000Research","volume":"13 ","pages":"1180"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11757917/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"F1000Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12688/f1000research.155230.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"Pharmacology, Toxicology and Pharmaceutics","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Synthetic data's utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.'s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning.
Methods: We replicate Nearing et al.'s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results.
Conclusions: Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research.
背景:合成数据在基准研究中的效用取决于其密切模拟现实世界条件和再现实验数据所得结果的能力。在接近等人的研究(1)的基础上,我们正在生成模拟实验数据的合成数据集,以验证他们的发现。他们在病例对照设计中使用38个实验性16S rRNA数据集评估了14个差异丰度测试。我们将采用统计测试来严格评估合成数据和实验数据之间的相似性,并验证由near等人得出的关于这些测试性能的结论。(1)。本方案遵循SPIRIT指南,展示了已建立的报告框架如何支持稳健、透明和公正的研究计划。方法:我们复制了near et al.(1)的方法,结合了使用两种不同工具模拟的合成数据,反映了38个实验数据集。将对46个数据特征的非冗余子集进行等效性检验,比较合成数据和实验数据,并辅以主成分分析进行总体相似性评估。14种差异丰度测试将应用于合成和实验数据集,评估重要特征识别的一致性和每个工具的重要特征数量。相关分析和多元回归将探讨合成和实验数据特征之间的差异如何影响结果。结论:合成数据可以通过对照实验验证研究结果。我们评估了合成数据如何很好地复制实验数据,试图用最新版本的数据分析方法验证以前的发现,并描述了基准研究中合成数据的优势和局限性。此外,据我们所知,这是第一个系统地结合合成数据来验证差异丰度方法的计算基准研究,同时严格遵守预先指定的研究方案,遵循SPIRIT指南,有助于透明度,可重复性和无偏倚的研究。
F1000ResearchPharmacology, Toxicology and Pharmaceutics-Pharmacology, Toxicology and Pharmaceutics (all)
CiteScore
5.00
自引率
0.00%
发文量
1646
审稿时长
1 weeks
期刊介绍:
F1000Research publishes articles and other research outputs reporting basic scientific, scholarly, translational and clinical research across the physical and life sciences, engineering, medicine, social sciences and humanities. F1000Research is a scholarly publication platform set up for the scientific, scholarly and medical research community; each article has at least one author who is a qualified researcher, scholar or clinician actively working in their speciality and who has made a key contribution to the article. Articles must be original (not duplications). All research is suitable irrespective of the perceived level of interest or novelty; we welcome confirmatory and negative results, as well as null studies. F1000Research publishes different type of research, including clinical trials, systematic reviews, software tools, method articles, and many others. Reviews and Opinion articles providing a balanced and comprehensive overview of the latest discoveries in a particular field, or presenting a personal perspective on recent developments, are also welcome. See the full list of article types we accept for more information.