Enhancing supervised analysis of imbalanced untargeted metabolomics datasets using a CWGAN-GP framework for data augmentation

IF 6.3 2区医学 Q1 BIOLOGY Computers in biology and medicine Pub Date : 2025-01-01 Epub Date: 2024-11-14 DOI:10.1016/j.compbiomed.2024.109414

Francisco Traquete , Marta Sousa Silva , António E.N. Ferreira

{"title":"Enhancing supervised analysis of imbalanced untargeted metabolomics datasets using a CWGAN-GP framework for data augmentation","authors":"Francisco Traquete , Marta Sousa Silva , António E.N. Ferreira","doi":"10.1016/j.compbiomed.2024.109414","DOIUrl":null,"url":null,"abstract":"<div><div>Untargeted metabolomics is an extremely useful approach for the discrimination of biological systems and biomarker identification. However, data analysis workflows are complex and face many challenges. Two of these challenges are the demand of high sample size and the possibility of severe class imbalance, which is particularly common in clinical studies. The latter can make statistical models less generalizable, increase the risk of overfitting and skew the analysis in favour of the majority class. One possible approach to mitigate this problem is data augmentation. However, the use of artificial data requires adequate data augmentation methods and criteria for assessing the quality of the generated data.</div><div>In this work, we used Conditional Wasserstein Generative Adversarial Networks with Gradient Penalty (CWGAN-GPs) for data augmentation of metabolomics data. Using a set of benchmark datasets, we applied several criteria for the evaluation of the quality of generated data and assessed the performance of supervised predictive models trained with datasets that included such data. CWGAN-GP models generated realistic data with identical characteristics to real samples, mostly avoiding mode collapse. Furthermore, in cases of class imbalance, the performance of predictive models improved by supplementing the minority class with generated samples. This is evident for high quality datasets with well separated classes. Conversely, model improvements were quite modest for high class overlap datasets. This trend was confirmed by using synthetic datasets with different class separation levels. Data augmentation is a viable procedure to alleviate class imbalance problems but is not universally beneficial in metabolomics.</div></div>","PeriodicalId":10578,"journal":{"name":"Computers in biology and medicine","volume":"184 ","pages":"Article 109414"},"PeriodicalIF":6.3000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in biology and medicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0010482524014999","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/14 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Untargeted metabolomics is an extremely useful approach for the discrimination of biological systems and biomarker identification. However, data analysis workflows are complex and face many challenges. Two of these challenges are the demand of high sample size and the possibility of severe class imbalance, which is particularly common in clinical studies. The latter can make statistical models less generalizable, increase the risk of overfitting and skew the analysis in favour of the majority class. One possible approach to mitigate this problem is data augmentation. However, the use of artificial data requires adequate data augmentation methods and criteria for assessing the quality of the generated data.

In this work, we used Conditional Wasserstein Generative Adversarial Networks with Gradient Penalty (CWGAN-GPs) for data augmentation of metabolomics data. Using a set of benchmark datasets, we applied several criteria for the evaluation of the quality of generated data and assessed the performance of supervised predictive models trained with datasets that included such data. CWGAN-GP models generated realistic data with identical characteristics to real samples, mostly avoiding mode collapse. Furthermore, in cases of class imbalance, the performance of predictive models improved by supplementing the minority class with generated samples. This is evident for high quality datasets with well separated classes. Conversely, model improvements were quite modest for high class overlap datasets. This trend was confirmed by using synthetic datasets with different class separation levels. Data augmentation is a viable procedure to alleviate class imbalance problems but is not universally beneficial in metabolomics.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用 CWGAN-GP 数据扩增框架加强对不平衡的非目标代谢组学数据集的监督分析。

非靶向代谢组学是辨别生物系统和鉴定生物标记物的一种极为有用的方法。然而，数据分析工作流程非常复杂，面临着许多挑战。其中两个挑战是对高样本量的要求和可能出现的严重类别不平衡，这在临床研究中尤为常见。后者会降低统计模型的通用性，增加过度拟合的风险，并使分析偏向于大多数类别。缓解这一问题的一个可行方法是数据扩增。然而，使用人工数据需要适当的数据扩增方法和评估生成数据质量的标准。在这项工作中，我们使用了带梯度惩罚的条件瓦瑟斯坦生成对抗网络（CWGAN-GPs）来增强代谢组学数据。利用一组基准数据集，我们对生成数据的质量采用了多项评估标准，并评估了使用包含此类数据的数据集训练的监督预测模型的性能。CWGAN-GP 模型生成了与真实样本具有相同特征的真实数据，在很大程度上避免了模式崩溃。此外，在类不平衡的情况下，通过使用生成的样本补充少数类，预测模型的性能也得到了提高。这一点在具有良好分类的高质量数据集上非常明显。相反，在类重叠度高的数据集上，模型的改进幅度不大。通过使用具有不同类别分离水平的合成数据集，这一趋势得到了证实。数据扩增是缓解类不平衡问题的一种可行方法，但在代谢组学中并非普遍有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computers in biology and medicine 工程技术-工程：生物医学

CiteScore

11.70

自引率

10.40%

发文量

1086

审稿时长

74 days

期刊介绍： Computers in Biology and Medicine is an international forum for sharing groundbreaking advancements in the use of computers in bioscience and medicine. This journal serves as a medium for communicating essential research, instruction, ideas, and information regarding the rapidly evolving field of computer applications in these domains. By encouraging the exchange of knowledge, we aim to facilitate progress and innovation in the utilization of computers in biology and medicine.