{"title":"PhyloMix:通过系统发育混合增强增强微生物组-性状关联预测。","authors":"Yifan Jiang, Disen Liao, Qiyun Zhu, Yang Young Lu","doi":"10.1093/bioinformatics/btaf014","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Understanding the associations between traits and microbial composition is a fundamental objective in microbiome research. Recently, researchers have turned to machine learning (ML) models to achieve this goal with promising results. However, the effectiveness of advanced ML models is often limited by the unique characteristics of microbiome data, which are typically high-dimensional, compositional, and imbalanced. These characteristics can hinder the models' ability to fully explore the relationships among taxa in predictive analyses. To address this challenge, data augmentation has become crucial. It involves generating synthetic samples with artificial labels based on existing data and incorporating these samples into the training set to improve ML model performance.</p><p><strong>Results: </strong>Here we propose PhyloMix, a novel data augmentation method specifically designed for microbiome data to enhance predictive analyses. PhyloMix leverages the phylogenetic relationships among microbiome taxa as an informative prior to guide the generation of synthetic microbial samples. Leveraging phylogeny, PhyloMix creates new samples by removing a subtree from one sample and combining it with the corresponding subtree from another sample. Notably, PhyloMix is designed to address the compositional nature of microbiome data, effectively handling both raw counts and relative abundances. This approach introduces sufficient diversity into the augmented samples, leading to improved predictive performance. We empirically evaluated PhyloMix on six real microbiome datasets across five commonly used ML models. PhyloMix significantly outperforms distinct baseline methods including sample-mixing-based data augmentation techniques like vanilla mixup and compositional cutmix, as well as the phylogeny-based method TADA. We also demonstrated the wide applicability of PhyloMix in both supervised learning and contrastive representation learning.</p><p><strong>Availability: </strong>The Apache licensed source code is available at (https://github.com/batmen-lab/phylomix).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PhyloMix: Enhancing microbiome-trait association prediction through phylogeny-mixing augmentation.\",\"authors\":\"Yifan Jiang, Disen Liao, Qiyun Zhu, Yang Young Lu\",\"doi\":\"10.1093/bioinformatics/btaf014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Motivation: </strong>Understanding the associations between traits and microbial composition is a fundamental objective in microbiome research. Recently, researchers have turned to machine learning (ML) models to achieve this goal with promising results. However, the effectiveness of advanced ML models is often limited by the unique characteristics of microbiome data, which are typically high-dimensional, compositional, and imbalanced. These characteristics can hinder the models' ability to fully explore the relationships among taxa in predictive analyses. To address this challenge, data augmentation has become crucial. It involves generating synthetic samples with artificial labels based on existing data and incorporating these samples into the training set to improve ML model performance.</p><p><strong>Results: </strong>Here we propose PhyloMix, a novel data augmentation method specifically designed for microbiome data to enhance predictive analyses. PhyloMix leverages the phylogenetic relationships among microbiome taxa as an informative prior to guide the generation of synthetic microbial samples. Leveraging phylogeny, PhyloMix creates new samples by removing a subtree from one sample and combining it with the corresponding subtree from another sample. Notably, PhyloMix is designed to address the compositional nature of microbiome data, effectively handling both raw counts and relative abundances. This approach introduces sufficient diversity into the augmented samples, leading to improved predictive performance. We empirically evaluated PhyloMix on six real microbiome datasets across five commonly used ML models. PhyloMix significantly outperforms distinct baseline methods including sample-mixing-based data augmentation techniques like vanilla mixup and compositional cutmix, as well as the phylogeny-based method TADA. We also demonstrated the wide applicability of PhyloMix in both supervised learning and contrastive representation learning.</p><p><strong>Availability: </strong>The Apache licensed source code is available at (https://github.com/batmen-lab/phylomix).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics.</p>\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btaf014\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
动机了解性状与微生物组成之间的关联是微生物组研究的一个基本目标。最近,研究人员转向使用机器学习(ML)模型来实现这一目标,并取得了可喜的成果。然而,高级 ML 模型的有效性往往受到微生物组数据独特特性的限制,这些数据通常具有高维、组成复杂和不平衡的特点。这些特点会阻碍模型在预测分析中充分探索类群之间关系的能力。为了应对这一挑战,数据扩增变得至关重要。它包括在现有数据的基础上生成带有人工标签的合成样本,并将这些样本纳入训练集,以提高 ML 模型的性能:在此,我们提出了 PhyloMix,这是一种专为微生物组数据设计的新型数据增强方法,可增强预测分析。PhyloMix 利用微生物群分类群之间的系统发育关系作为信息先导,指导合成微生物样本的生成。利用系统发育关系,PhyloMix 从一个样本中移除一个子树,然后将其与另一个样本中的相应子树结合,从而生成新样本。值得注意的是,PhyloMix 的设计旨在解决微生物组数据的组成性质问题,有效处理原始计数和相对丰度。这种方法为增强样本引入了足够的多样性,从而提高了预测性能。我们在六个真实的微生物组数据集上对 PhyloMix 进行了实证评估,涉及五个常用的 ML 模型。PhyloMix 明显优于不同的基线方法,包括基于样本混合的数据增强技术,如 vanilla mixup 和 compositional cutmix,以及基于系统发育的方法 TADA。我们还证明了 PhyloMix 在监督学习和对比表示学习中的广泛适用性:Apache 许可的源代码可在 (https://github.com/batmen-lab/phylomix) 上获取。补充信息:补充数据可从 Bioinformatics 网站获取。
PhyloMix: Enhancing microbiome-trait association prediction through phylogeny-mixing augmentation.
Motivation: Understanding the associations between traits and microbial composition is a fundamental objective in microbiome research. Recently, researchers have turned to machine learning (ML) models to achieve this goal with promising results. However, the effectiveness of advanced ML models is often limited by the unique characteristics of microbiome data, which are typically high-dimensional, compositional, and imbalanced. These characteristics can hinder the models' ability to fully explore the relationships among taxa in predictive analyses. To address this challenge, data augmentation has become crucial. It involves generating synthetic samples with artificial labels based on existing data and incorporating these samples into the training set to improve ML model performance.
Results: Here we propose PhyloMix, a novel data augmentation method specifically designed for microbiome data to enhance predictive analyses. PhyloMix leverages the phylogenetic relationships among microbiome taxa as an informative prior to guide the generation of synthetic microbial samples. Leveraging phylogeny, PhyloMix creates new samples by removing a subtree from one sample and combining it with the corresponding subtree from another sample. Notably, PhyloMix is designed to address the compositional nature of microbiome data, effectively handling both raw counts and relative abundances. This approach introduces sufficient diversity into the augmented samples, leading to improved predictive performance. We empirically evaluated PhyloMix on six real microbiome datasets across five commonly used ML models. PhyloMix significantly outperforms distinct baseline methods including sample-mixing-based data augmentation techniques like vanilla mixup and compositional cutmix, as well as the phylogeny-based method TADA. We also demonstrated the wide applicability of PhyloMix in both supervised learning and contrastive representation learning.
Availability: The Apache licensed source code is available at (https://github.com/batmen-lab/phylomix).
Supplementary information: Supplementary data are available at Bioinformatics.