Yvette K. Kalimumbalo , Rosaline W. Macharia , Peter W. Wagacha
{"title":"Application of Generative Adversarial Networks on RNASeq data to uncover COVID-19 severity biomarkers","authors":"Yvette K. Kalimumbalo , Rosaline W. Macharia , Peter W. Wagacha","doi":"10.1016/j.abst.2025.01.002","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>The COVID-19 pandemic has highlighted the need for reliable biomarkers to predict disease severity and guide treatment strategies. However, the analysis of RNASeq data for biomarker discovery using machine learning is constrained by limited sample sizes, primarily due to cost and privacy considerations. In this study, we applied Generative Adversarial Networks (GANs) to RNASeq data in the process of identifying biomarkers associated with COVID-19 severity.</div></div><div><h3>Methods</h3><div>RNASeq data from COVID-19 patients, along with severity metadata, were collected from the GEO database. Differential expression analysis was conducted and GAN models were trained to augment the original dataset. This enhanced subsequent machine learning models’ robustness and accuracy for biomarker discovery. Feature selection using Recursive Feature Elimination with Cross-Validation (RFECV) identified key biomarkers on cGAN- and cWGAN-augmented datasets.</div></div><div><h3>Results</h3><div>Several key biomarkers significantly associated with disease severity were identified. Gene Ontology Enrichment analysis revealed upregulation of neutrophil degranulation and downregulation of T-cell activity, consistent with previous findings. The ROC analysis using a Random Forest machine learning model and the five most important biomarkers (CCDC65, ZNF239, OTUD7A, CEP126, and TCTN2) achieved high accuracy (AUC: 0.98, Acc: 0.94) in predicting disease severity. These genes are associated with processes such as cilium assembly, IFN activation, and NF-kB pathway suppression.</div></div><div><h3>Conclusions</h3><div>Our results demonstrate that GANs can effectively augment RNASeq data, leading to consistent findings that align with known mechanisms and providing new insights into severe COVID-19 transcriptional responses. Further experimental validation is needed to confirm the applicability of these biomarkers in diverse populations.</div></div>","PeriodicalId":72080,"journal":{"name":"Advances in biomarker sciences and technology","volume":"7 ","pages":"Pages 44-58"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in biomarker sciences and technology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S254310642500002X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background
The COVID-19 pandemic has highlighted the need for reliable biomarkers to predict disease severity and guide treatment strategies. However, the analysis of RNASeq data for biomarker discovery using machine learning is constrained by limited sample sizes, primarily due to cost and privacy considerations. In this study, we applied Generative Adversarial Networks (GANs) to RNASeq data in the process of identifying biomarkers associated with COVID-19 severity.
Methods
RNASeq data from COVID-19 patients, along with severity metadata, were collected from the GEO database. Differential expression analysis was conducted and GAN models were trained to augment the original dataset. This enhanced subsequent machine learning models’ robustness and accuracy for biomarker discovery. Feature selection using Recursive Feature Elimination with Cross-Validation (RFECV) identified key biomarkers on cGAN- and cWGAN-augmented datasets.
Results
Several key biomarkers significantly associated with disease severity were identified. Gene Ontology Enrichment analysis revealed upregulation of neutrophil degranulation and downregulation of T-cell activity, consistent with previous findings. The ROC analysis using a Random Forest machine learning model and the five most important biomarkers (CCDC65, ZNF239, OTUD7A, CEP126, and TCTN2) achieved high accuracy (AUC: 0.98, Acc: 0.94) in predicting disease severity. These genes are associated with processes such as cilium assembly, IFN activation, and NF-kB pathway suppression.
Conclusions
Our results demonstrate that GANs can effectively augment RNASeq data, leading to consistent findings that align with known mechanisms and providing new insights into severe COVID-19 transcriptional responses. Further experimental validation is needed to confirm the applicability of these biomarkers in diverse populations.