{"title":"针对高稀疏、高维、组合数据的半参数多重估算方法","authors":"Michael B Sohn, Kristin Scheible, Steven R Gill","doi":"10.1101/2024.09.05.611521","DOIUrl":null,"url":null,"abstract":"High sparsity (i.e., excessive zeros) in microbiome data, which are high-dimensional and compositional, is unavoidable and can significantly alter analysis results. However, efforts to address this high sparsity have been very limited because, in part, it is impossible to justify the validity of any such methods, as zeros in microbiome data arise from multiple sources (e.g., true absence, stochastic nature of sampling). The most common approach is to treat all zeros as structural zeros (i.e., true absence) or rounded zeros (i.e., undetected due to detection limit). However, this approach can underestimate the mean abundance while overestimating its variance because many zeros can arise from the stochastic nature of sampling and/or functional redundancy (i.e., different microbes can perform the same functions), thus losing power. In this manuscript, we argue that treating all zeros as missing values would not significantly alter analysis results if the proportion of structural zeros is similar for all taxa, and we propose a semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data. We demonstrate the merits of the proposed method and its beneficial effects on downstream analyses in extensive simulation studies. We reanalyzed a type II diabetes (T2D) dataset to determine differentially abundant species between T2D patients and non-diabetic controls.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"157 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data\",\"authors\":\"Michael B Sohn, Kristin Scheible, Steven R Gill\",\"doi\":\"10.1101/2024.09.05.611521\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High sparsity (i.e., excessive zeros) in microbiome data, which are high-dimensional and compositional, is unavoidable and can significantly alter analysis results. However, efforts to address this high sparsity have been very limited because, in part, it is impossible to justify the validity of any such methods, as zeros in microbiome data arise from multiple sources (e.g., true absence, stochastic nature of sampling). The most common approach is to treat all zeros as structural zeros (i.e., true absence) or rounded zeros (i.e., undetected due to detection limit). However, this approach can underestimate the mean abundance while overestimating its variance because many zeros can arise from the stochastic nature of sampling and/or functional redundancy (i.e., different microbes can perform the same functions), thus losing power. In this manuscript, we argue that treating all zeros as missing values would not significantly alter analysis results if the proportion of structural zeros is similar for all taxa, and we propose a semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data. We demonstrate the merits of the proposed method and its beneficial effects on downstream analyses in extensive simulation studies. We reanalyzed a type II diabetes (T2D) dataset to determine differentially abundant species between T2D patients and non-diabetic controls.\",\"PeriodicalId\":501307,\"journal\":{\"name\":\"bioRxiv - Bioinformatics\",\"volume\":\"157 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv - Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.09.05.611521\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.05.611521","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
微生物组数据具有高维性和组成性,其中的高稀疏性(即过多的零)是不可避免的,会严重改变分析结果。然而,解决这种高稀疏性的努力非常有限,部分原因是无法证明任何此类方法的有效性,因为微生物组数据中的零是由多种原因造成的(如真正的缺失、采样的随机性)。最常见的方法是将所有零点视为结构零点(即真正缺失)或四舍五入零点(即因检测限而未检测到)。然而,这种方法可能会低估平均丰度,同时高估其方差,因为许多零可能是由于取样的随机性和/或功能冗余(即不同微生物可以执行相同的功能)引起的,从而失去了研究的意义。在本手稿中,我们认为如果所有类群的结构零比例相似,那么将所有零作为缺失值处理并不会显著改变分析结果,我们还提出了一种针对高稀疏、高维、成分数据的半参数多重估算方法。我们在大量模拟研究中证明了所提方法的优点及其对下游分析的有利影响。我们重新分析了一个 II 型糖尿病(T2D)数据集,以确定 T2D 患者与非糖尿病对照组之间物种丰富度的差异。
A semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data
High sparsity (i.e., excessive zeros) in microbiome data, which are high-dimensional and compositional, is unavoidable and can significantly alter analysis results. However, efforts to address this high sparsity have been very limited because, in part, it is impossible to justify the validity of any such methods, as zeros in microbiome data arise from multiple sources (e.g., true absence, stochastic nature of sampling). The most common approach is to treat all zeros as structural zeros (i.e., true absence) or rounded zeros (i.e., undetected due to detection limit). However, this approach can underestimate the mean abundance while overestimating its variance because many zeros can arise from the stochastic nature of sampling and/or functional redundancy (i.e., different microbes can perform the same functions), thus losing power. In this manuscript, we argue that treating all zeros as missing values would not significantly alter analysis results if the proportion of structural zeros is similar for all taxa, and we propose a semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data. We demonstrate the merits of the proposed method and its beneficial effects on downstream analyses in extensive simulation studies. We reanalyzed a type II diabetes (T2D) dataset to determine differentially abundant species between T2D patients and non-diabetic controls.