A semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data

Michael B Sohn, Kristin Scheible, Steven R Gill
{"title":"A semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data","authors":"Michael B Sohn, Kristin Scheible, Steven R Gill","doi":"10.1101/2024.09.05.611521","DOIUrl":null,"url":null,"abstract":"High sparsity (i.e., excessive zeros) in microbiome data, which are high-dimensional and compositional, is unavoidable and can significantly alter analysis results. However, efforts to address this high sparsity have been very limited because, in part, it is impossible to justify the validity of any such methods, as zeros in microbiome data arise from multiple sources (e.g., true absence, stochastic nature of sampling). The most common approach is to treat all zeros as structural zeros (i.e., true absence) or rounded zeros (i.e., undetected due to detection limit). However, this approach can underestimate the mean abundance while overestimating its variance because many zeros can arise from the stochastic nature of sampling and/or functional redundancy (i.e., different microbes can perform the same functions), thus losing power. In this manuscript, we argue that treating all zeros as missing values would not significantly alter analysis results if the proportion of structural zeros is similar for all taxa, and we propose a semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data. We demonstrate the merits of the proposed method and its beneficial effects on downstream analyses in extensive simulation studies. We reanalyzed a type II diabetes (T2D) dataset to determine differentially abundant species between T2D patients and non-diabetic controls.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"157 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.05.611521","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

High sparsity (i.e., excessive zeros) in microbiome data, which are high-dimensional and compositional, is unavoidable and can significantly alter analysis results. However, efforts to address this high sparsity have been very limited because, in part, it is impossible to justify the validity of any such methods, as zeros in microbiome data arise from multiple sources (e.g., true absence, stochastic nature of sampling). The most common approach is to treat all zeros as structural zeros (i.e., true absence) or rounded zeros (i.e., undetected due to detection limit). However, this approach can underestimate the mean abundance while overestimating its variance because many zeros can arise from the stochastic nature of sampling and/or functional redundancy (i.e., different microbes can perform the same functions), thus losing power. In this manuscript, we argue that treating all zeros as missing values would not significantly alter analysis results if the proportion of structural zeros is similar for all taxa, and we propose a semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data. We demonstrate the merits of the proposed method and its beneficial effects on downstream analyses in extensive simulation studies. We reanalyzed a type II diabetes (T2D) dataset to determine differentially abundant species between T2D patients and non-diabetic controls.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
针对高稀疏、高维、组合数据的半参数多重估算方法
微生物组数据具有高维性和组成性,其中的高稀疏性(即过多的零)是不可避免的,会严重改变分析结果。然而,解决这种高稀疏性的努力非常有限,部分原因是无法证明任何此类方法的有效性,因为微生物组数据中的零是由多种原因造成的(如真正的缺失、采样的随机性)。最常见的方法是将所有零点视为结构零点(即真正缺失)或四舍五入零点(即因检测限而未检测到)。然而,这种方法可能会低估平均丰度,同时高估其方差,因为许多零可能是由于取样的随机性和/或功能冗余(即不同微生物可以执行相同的功能)引起的,从而失去了研究的意义。在本手稿中,我们认为如果所有类群的结构零比例相似,那么将所有零作为缺失值处理并不会显著改变分析结果,我们还提出了一种针对高稀疏、高维、成分数据的半参数多重估算方法。我们在大量模拟研究中证明了所提方法的优点及其对下游分析的有利影响。我们重新分析了一个 II 型糖尿病(T2D)数据集,以确定 T2D 患者与非糖尿病对照组之间物种丰富度的差异。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
ECSFinder: Optimized prediction of evolutionarily conserved RNA secondary structures from genome sequences GeneSpectra: a method for context-aware comparison of cell type gene expression across species A Bioinformatician, Computer Scientist, and Geneticist lead bioinformatic tool development - which one is better? Interpretable high-resolution dimension reduction of spatial transcriptomics data by DeepFuseNMF Pangenomics to understand prophage dynamics in the Pectobacterium genus and the radiating lineages of P. brasiliense
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1