聚类微生物组组成的Dirichlet-tree多项式混合物。

IF 1.3 4区数学 Q2 STATISTICS & PROBABILITY Annals of Applied Statistics Pub Date : 2022-09-01 Epub Date: 2022-07-19 DOI:10.1214/21-aoas1552

Jialiang Mao, L I Ma

{"title":"聚类微生物组组成的Dirichlet-tree多项式混合物。","authors":"Jialiang Mao, L I Ma","doi":"10.1214/21-aoas1552","DOIUrl":null,"url":null,"abstract":"Studying the human microbiome has gained substantial interest in recent years, and a common task in the analysis of these data is to cluster microbiome compositions into subtypes. This subdivision of samples into subgroups serves as an intermediary step in achieving personalized diagnosis and treatment. In applying existing clustering methods to modern microbiome studies including the American Gut Project (AGP) data, we found that this seemingly standard task, however, is very challenging in the microbiome composition context due to several key features of such data. Standard distance-based clustering algorithms generally do not produce reliable results as they do not take into account the heterogeneity of the cross-sample variability among the bacterial taxa, while existing model-based approaches do not allow sufficient flexibility for the identification of complex within-cluster variation from cross-cluster variation. Direct applications of such methods generally lead to overly dispersed clusters in the AGP data and such phenomenon is common for other microbiome data. To overcome these challenges, we introduce Dirichlet-tree multinomial mixtures (DTMM) as a Bayesian generative model for clustering amplicon sequencing data in microbiome studies. DTMM models the microbiome population with a mixture of Dirichlet-tree kernels that utilizes the phylogenetic tree to offer a more flexible covariance structure in characterizing within-cluster variation, and it provides a means for identifying a subset of signature taxa that distinguish the clusters. We perform extensive simulation studies to evaluate the performance of DTMM and compare it to state-of-the-art model-based and distance-based clustering methods in the microbiome context, and carry out a validation study on a publicly available longitudinal data set to confirm the biological relevance of the clusters. Finally, we report a case study on the fecal data from the AGP to identify compositional clusters among individuals with inflammatory bowel disease and diabetes. Among our most interesting findings is that enterotypes (i.e., gut microbiome clusters) are not always defined by the most dominant species as previous analyses had assumed, but can involve a number of less abundant OTUs, which cannot be identified with existing distance-based and method-based approaches.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":null,"pages":null},"PeriodicalIF":1.3000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9484567/pdf/nihms-1814687.pdf","citationCount":"3","resultStr":"{\"title\":\"DIRICHLET-TREE MULTINOMIAL MIXTURES FOR CLUSTERING MICROBIOME COMPOSITIONS.\",\"authors\":\"Jialiang Mao, L I Ma\",\"doi\":\"10.1214/21-aoas1552\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Studying the human microbiome has gained substantial interest in recent years, and a common task in the analysis of these data is to cluster microbiome compositions into subtypes. This subdivision of samples into subgroups serves as an intermediary step in achieving personalized diagnosis and treatment. In applying existing clustering methods to modern microbiome studies including the American Gut Project (AGP) data, we found that this seemingly standard task, however, is very challenging in the microbiome composition context due to several key features of such data. Standard distance-based clustering algorithms generally do not produce reliable results as they do not take into account the heterogeneity of the cross-sample variability among the bacterial taxa, while existing model-based approaches do not allow sufficient flexibility for the identification of complex within-cluster variation from cross-cluster variation. Direct applications of such methods generally lead to overly dispersed clusters in the AGP data and such phenomenon is common for other microbiome data. To overcome these challenges, we introduce Dirichlet-tree multinomial mixtures (DTMM) as a Bayesian generative model for clustering amplicon sequencing data in microbiome studies. DTMM models the microbiome population with a mixture of Dirichlet-tree kernels that utilizes the phylogenetic tree to offer a more flexible covariance structure in characterizing within-cluster variation, and it provides a means for identifying a subset of signature taxa that distinguish the clusters. We perform extensive simulation studies to evaluate the performance of DTMM and compare it to state-of-the-art model-based and distance-based clustering methods in the microbiome context, and carry out a validation study on a publicly available longitudinal data set to confirm the biological relevance of the clusters. Finally, we report a case study on the fecal data from the AGP to identify compositional clusters among individuals with inflammatory bowel disease and diabetes. Among our most interesting findings is that enterotypes (i.e., gut microbiome clusters) are not always defined by the most dominant species as previous analyses had assumed, but can involve a number of less abundant OTUs, which cannot be identified with existing distance-based and method-based approaches.\",\"PeriodicalId\":50772,\"journal\":{\"name\":\"Annals of Applied Statistics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2022-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9484567/pdf/nihms-1814687.pdf\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annals of Applied Statistics\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1214/21-aoas1552\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2022/7/19 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Applied Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/21-aoas1552","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/7/19 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 3

摘要

近年来，对人类微生物组的研究获得了极大的兴趣，分析这些数据的一个共同任务是将微生物组组成聚类成亚型。将样本细分为亚组是实现个性化诊断和治疗的中间步骤。在将现有的聚类方法应用于包括美国肠道计划(AGP)数据在内的现代微生物组研究中，我们发现，由于这些数据的几个关键特征，这一看似标准的任务在微生物组组成背景下非常具有挑战性。标准的基于距离的聚类算法通常不能产生可靠的结果，因为它们没有考虑到细菌分类群之间跨样本变异性的异质性，而现有的基于模型的方法在识别复杂的簇内变异和跨簇变异方面没有足够的灵活性。这种方法的直接应用通常会导致AGP数据中过于分散的簇，这种现象在其他微生物组数据中也很常见。为了克服这些挑战，我们引入Dirichlet-tree多项式混合物(DTMM)作为微生物组研究中扩增子测序数据聚类的贝叶斯生成模型。DTMM利用dirichlet树核的混合模型对微生物群进行建模，该模型利用系统发育树提供更灵活的协方差结构来表征聚类内的变化，并提供了一种识别区分聚类的特征分类群子集的方法。我们进行了广泛的模拟研究，以评估DTMM的性能，并将其与微生物组背景下最先进的基于模型和基于距离的聚类方法进行比较，并对公开可用的纵向数据集进行验证研究，以确认聚类的生物学相关性。最后，我们报告了一个关于AGP粪便数据的案例研究，以确定炎症性肠病和糖尿病患者的组成簇。我们最有趣的发现之一是，肠型(即肠道微生物群)并不总是像以前的分析所假设的那样由最优势的物种定义，而是可能涉及一些较少的otu，这些otu无法用现有的基于距离和基于方法的方法识别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DIRICHLET-TREE MULTINOMIAL MIXTURES FOR CLUSTERING MICROBIOME COMPOSITIONS.

Studying the human microbiome has gained substantial interest in recent years, and a common task in the analysis of these data is to cluster microbiome compositions into subtypes. This subdivision of samples into subgroups serves as an intermediary step in achieving personalized diagnosis and treatment. In applying existing clustering methods to modern microbiome studies including the American Gut Project (AGP) data, we found that this seemingly standard task, however, is very challenging in the microbiome composition context due to several key features of such data. Standard distance-based clustering algorithms generally do not produce reliable results as they do not take into account the heterogeneity of the cross-sample variability among the bacterial taxa, while existing model-based approaches do not allow sufficient flexibility for the identification of complex within-cluster variation from cross-cluster variation. Direct applications of such methods generally lead to overly dispersed clusters in the AGP data and such phenomenon is common for other microbiome data. To overcome these challenges, we introduce Dirichlet-tree multinomial mixtures (DTMM) as a Bayesian generative model for clustering amplicon sequencing data in microbiome studies. DTMM models the microbiome population with a mixture of Dirichlet-tree kernels that utilizes the phylogenetic tree to offer a more flexible covariance structure in characterizing within-cluster variation, and it provides a means for identifying a subset of signature taxa that distinguish the clusters. We perform extensive simulation studies to evaluate the performance of DTMM and compare it to state-of-the-art model-based and distance-based clustering methods in the microbiome context, and carry out a validation study on a publicly available longitudinal data set to confirm the biological relevance of the clusters. Finally, we report a case study on the fecal data from the AGP to identify compositional clusters among individuals with inflammatory bowel disease and diabetes. Among our most interesting findings is that enterotypes (i.e., gut microbiome clusters) are not always defined by the most dominant species as previous analyses had assumed, but can involve a number of less abundant OTUs, which cannot be identified with existing distance-based and method-based approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Annals of Applied Statistics 社会科学-统计学与概率论

CiteScore

3.10

自引率

5.60%

发文量

131

审稿时长

6-12 weeks

期刊介绍： Statistical research spans an enormous range from direct subject-matter collaborations to pure mathematical theory. The Annals of Applied Statistics, the newest journal from the IMS, is aimed at papers in the applied half of this range. Published quarterly in both print and electronic form, our goal is to provide a timely and unified forum for all areas of applied statistics.